CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

This is the codebase of the paper: CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought (arXiv).

Author List: Boxuan Zhang, Ruqi Zhang

Comparison of existing UQ strategies with CoT-UQ

[2024/02/21]🔥 We are releasing the CoT-UQ version 1.0 for running on Llama Family models.

Getting Start

1. Install Dependencies

Update your environment for the required dependency.

pip install -r requirement.txt

2. Data Preparation

Datasets adopted in the paper are listed in CoT-UQ/dataset/
You can also upload your json version of dataset on CoT-UQ/dataset/
Setting for loading your dataset on CoT-UQ/utils.py.

Example

if args.dataset.lower() == 'gsm8k':
      for idx, line in enumerate(json_data):
            q = line['question']
            a = float(line['answer'])
            id = 'temp_{}'.format(idx)
      questions.append(q)
      answers.append(a)
      ids.append(id)

3. Running CoT-UQ Pipeline

Get your Llama Family weight in https://huggingface.co/meta-llama

run_llama_pipeline.sh is a script that executes all steps of our pipeline on the Llama Family.

The components of our pipeline are:

inference_refining.py focuses on refining the multi-step inference by extracting keywords and their corresponding importance scores to the final answer.
stepuq.py integrates the crucial reasoning information into the two common UQ strategies, aggregated probabilities and self-evaluation, respectively.

For instance, running the code on Llama3.1-8B:

sh run_llama_pipeline.sh llama3-1_8B probas_mean hotpotQA output/llama-3.1-8B/

4. Analyzing Results

After running the pipeline, use analyze_result.py to compute performance metrics, such as the AUROC.

python analyze_result.py --uq_engine probas_mean --dataset hotpotQA --output_path output/llama-3.1-8B/

Main Results

CoT-UQ consistently improves UQ performance across all tasks and datasets.
This demonstrates that incorporating reasoning into uncertainty quantification enables LLMs to provide more calibrated assessments of the trustworthiness of their generated outputs.
In general, CoT-UQ achieves greater improvements when applied to AP strategies compared to SE strategies, particularly for Probas-min, where it increases AUROC by up to 16.8%.

Citation

If you find our paper and repo useful, please cite our paper:

@article{zhang2025cot,
    title={CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought},
    author={Zhang, Boxuan and Zhang, Ruqi},
    journal={arXiv preprint arXiv:2502.17214},
    year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

Getting Start

1. Install Dependencies

2. Data Preparation

Example

3. Running CoT-UQ Pipeline

4. Analyzing Results

Main Results

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
dataset		dataset
figures		figures
src		src
README.md		README.md
analyze_result.py		analyze_result.py
config.py		config.py
inference_refining.py		inference_refining.py
requirement.txt		requirement.txt
run_llama_pipeline.sh		run_llama_pipeline.sh
stepuq.py		stepuq.py
utils.py		utils.py

ZBox1005/CoT-UQ

Folders and files

Latest commit

History

Repository files navigation

CoT-UQ: Improving Response-wise Uncertainty Quantification in LLMs with Chain-of-Thought

Getting Start

1. Install Dependencies

2. Data Preparation

Example

3. Running CoT-UQ Pipeline

4. Analyzing Results

Main Results

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages