Dechive Logo
Dechive
Dev#dechive#llm#prompt#prompt-engineering#self-consistency#chain-of-thought#reasoning#ai-accuracy

What is Self-Consistency? Improving AI Accuracy Through Multiple Reasoning Paths

Asking the same question multiple times becomes a strategy. How to dramatically increase AI accuracy rates with Self-Consistency.

Introduction: Turning 'Inconsistency' into a Weapon

There is a common, bewildering moment that beginners in prompt engineering experience. A prompt that worked perfectly yesterday gives a different answer today. You run the same prompt twice and get different results. At first, you think it's a bug. But even after refining the prompt or running it again, the results are slightly different each time.

This is not a bug. It is a designed characteristic of LLMs.

When generating text, language models sample the next token from a probability distribution rather than deterministically selecting only the token with the highest probability. Depending on the configured temperature value, they randomly select from the probability distribution. This is why different answers emerge from the same prompt. In creative writing, this is an advantage. But what about tasks like math problems, logical reasoning, and fact checking?

If an AI gives different answers every time for a problem with one correct answer, which one should you trust?

Self-Consistency (Wang et al., 2022) is a powerful answer to this question. The core intuition is simple: correct reasoning paths converge, while incorrect reasoning paths diverge. When you run the same question multiple times, incorrect answers scatter in different directions, while the correct answer appears repeatedly. By capturing this convergence pattern and selecting the final answer by majority vote, you can achieve much higher accuracy than a single run.


1. Why Single Execution Fails

1.1. Temperature and the Essence of Sampling

Understanding how LLMs generate text gives intuitive insight into why Self-Consistency is effective.

The model computes a probability distribution across the entire vocabulary each time it generates a token. For example, the probability distribution of words that follow "An apple is ___" might be "red(0.55)", "green(0.30)", "yellow(0.10)", "blue(0.02)". There are two strategies here.

Greedy Decoding: Always select the token with the highest probability. It's deterministic and reproducible. However, by following the "local optimum" at each step, the overall quality of the sentence often suffers. To use a travel analogy, it's like always choosing the most popular path at each fork in the road. It seems safe, but you ultimately reach the most common route.

Sampling: If the temperature value is greater than 0, you randomly sample from the probability distribution. With temperature = 1.0, you sample from the original distribution as-is, and as the value increases, lower probability tokens become more likely to be selected. Thanks to this randomness, the same prompt produces different, sometimes more creative results each time.

The problem lies in reasoning tasks. When you need a correct answer rather than a creative one, the randomness of sampling becomes a source of errors. If a single sampling enters an incorrect reasoning path, that error carries through to the final answer.

1.2. Vulnerability of Single Execution

Let's look at a concrete example. You ask an AI this math problem:

A store bought 12 apples. You ate 1/3 of them,
and gave half of the remaining to a friend. How many apples are left?

The reasoning path the model chooses in a single execution can be several things.

  • Path A (Correct): 12 → eat 1/3(4) → 8 left → give half(4) → 4 left
  • Path B (Error): 12 → eat 1/3(4) → 8 left → misunderstand "half of remaining" as half of total → 2 left
  • Path C (Error): 12 → "ate 1/3 and gave half" direct calculation → 12 × (1 - 1/3 - 1/2) = 2 → 2 left

A single execution randomly takes one of these three paths. Which path it takes depends on the luck of sampling. But what if we run it multiple times?


2. How Self-Consistency Works

2.1. Sample → Reason → Aggregate

Self-Consistency operates in three stages.

Step 1 — Sample Multiple Reasoning Paths

Execute the identical prompt multiple times. Set temperature to a non-zero value (usually 0.70.9) to encourage each execution to explore different reasoning paths. The number of executions depends on the task difficulty and cost tolerance, but generally 520 runs provide sufficient convergence.

Step 2 — Reason Independently on Each Path

Each execution must solve the problem independently without referencing the others. When combined with Chain of Thought, it becomes even more powerful. Adding the instruction "think step by step" makes each execution explicitly describe the reasoning process, allowing you to trace the path by which the final answer is derived.

Step 3 — Aggregate Final Answer

Collect the final answers from all executions and determine the final answer by majority vote. If 5 executions yield 4, 4, 2, 4, 4, the final answer is "4". In this process, the reasoning paths are discarded and only the final answers are aggregated. The fact that diverse paths converged on the same answer increases the credibility of that answer.

2.2. Why Majority Voting Increases Accuracy

Let's understand it with mathematical intuition.

Assume there's a problem where a single execution has a 70% chance of producing the correct answer. What would be the accuracy if you ran this problem 5 times independently and took the majority vote?

For the majority vote to be wrong, 3 or more of the 5 runs must be wrong. Computing this probability:

P(majority wrong) = P(3+ errors)
= C(5,3)×(0.3)³×(0.7)² + C(5,4)×(0.3)⁴×(0.7)¹ + C(5,5)×(0.3)⁵
= 0.0837 + 0.0284 + 0.0024
≈ 0.163

In other words, a single execution accuracy of 70% improves to approximately 84% through 5-round majority voting. Increasing to 10 runs approaches 93%.

This effect works on tasks where the original accuracy significantly exceeds 50%. If a problem already has a higher probability of being wrong in a single execution, the majority vote will also converge to the wrong answer. This is why you must distinguish between tasks where Self-Consistency can be applied and where it cannot.


3. When to Use and When to Abandon

As powerful as Self-Consistency is, applying it indiscriminately is inefficient. You must clearly distinguish between suitable and unsuitable tasks.

3.1. Effective Tasks

Self-Consistency is most effective on tasks where there is objectively one correct answer or verifiable criteria exist.

Task TypeExamplesEffectiveness
Math calculations and reasoningMulti-step operations, probability problems, geometryVery high
Logical reasoningSyllogisms, conditional reasoning, puzzlesVery high
Fact-based classificationSentiment classification, category classification, error detectionHigh
Code debuggingIdentifying bug causes, determining fixesHigh
Medical diagnostic assistanceSymptom-based possibility reasoningMedium~High

3.2. Tasks Where It's Ineffective or Counterproductive

Task TypeReason
Creative writingNo correct answer. Diversity itself is valuable
Subjective evaluationTaste and style are not subjects of majority voting
Open-ended questionsQuestions like "What is the meaning of life"
Simple fact lookupObvious facts are sufficient with single execution
Context-dependent responsesCases where previous conversation context must be remembered

Applying Self-Consistency to creative work results in the most "average" idea being selected. Among multiple results, the most unique and creative ones are eliminated in majority voting, while the most mundane is selected—a counterproductive effect.


4. Practical Implementation: Three Patterns

4.1. Pattern 1 — Simple Majority Vote

The most basic form. Execute the same prompt N times and choose the most frequent answer.

[Self-Consistency Prompt Template]

You are an expert at solving math/logic problems step by step.
When given a problem, you must follow this procedure:

1. Re-read the problem and organize the key conditions
2. Explicitly describe the solution process step by step (CoT)
3. Conclude with the final answer in the format "Answer: [answer]"

Problem: {{problem}}

Run this prompt 5~10 times, then aggregate the values appearing after "Answer: ___". The most frequently appearing value is the final answer.

Actual Execution Example:

Run 1: Answer: 4
Run 2: Answer: 4
Run 3: Answer: 2  ← Error path
Run 4: Answer: 4
Run 5: Answer: 4

→ Majority vote: 4 (4/5)  ✅

4.2. Pattern 2 — Confidence-Weighted Aggregation

A more sophisticated approach than simple majority voting. Have the model express its confidence in the answer for each run, and assign greater weight to answers with higher confidence levels.

[Confidence-Weighted Prompt]

Solve the following problem. Along with the solution process, 
you must conclude with the following format:

Answer: [value]
Confidence: [Low / Medium / High]
Reasoning: [One-line reason for choosing this confidence level]

Problem: {{problem}}

During aggregation, apply weight 3 for "High", 2 for "Medium", and 1 for "Low". Even if the same answer appears multiple times, a single answer with "High" confidence is more trustworthy than multiple answers with "Low" confidence.

This pattern is especially effective when the model is good at recognizing its own uncertainty. State-of-the-art models (GPT-4o, Claude Opus, etc.) have superior metacognitive abilities, making the reliability of confidence descriptions higher.

4.3. Pattern 3 — Meta-Judge Pattern

The most powerful but most expensive approach. Collect all results from multiple executions and pass them to a separate prompt, allowing the AI to judge which answer is most trustworthy.

[Meta-Judge Prompt]

Below are 5 independently conducted solutions to the same problem.
Review each solution and select the most logically sound answer.

Problem: {{problem}}

Solution 1: {{solution_1}}
Solution 2: {{solution_2}}
Solution 3: {{solution_3}}
Solution 4: {{solution_4}}
Solution 5: {{solution_5}}

Evaluation criteria:
- Is the logical connection of each step consistent?
- Does it correctly reflect all conditions of the problem?
- Are there any calculation errors?

Final Judgment:
- Most trustworthy solution: [number]
- Reasoning: [brief explanation]
- Final answer: [value]

The meta-judge pattern, unlike simple majority voting, also provides an explanation for "why is this answer more trustworthy?" It goes beyond simply obtaining an answer and is especially useful when you want to evaluate the quality of the reasoning process itself.


5. Temperature and Number of Executions Settings

Two questions most commonly arise when applying Self-Consistency in practice: What temperature should I set, and how many times should I run it?

5.1. Temperature Settings

Setting temperature = 0 makes the model always produce the same answer. It approaches Greedy Decoding. In this case, running multiple times yields identical answers, so Self-Consistency becomes meaningless.

Conversely, setting temperature too high (1.5 or above) increases diversity across executions but also increases noise. The model operates too randomly and fails to maintain proper reasoning paths.

Empirically, the optimal temperature range for Self-Consistency is 0.5 ~ 0.9.

TemperatureCharacteristicsSelf-Consistency Suitability
0.0Deterministic, no diversity❌ Meaningless
0.1 ~ 0.4Stable but low diversity🔺 Limited effect
0.5 ~ 0.7Balance of diversity and stability✅ Recommended (fact-based tasks)
0.7 ~ 0.9High diversity✅ Recommended (logical reasoning tasks)
1.0+High creativity, increased noise❌ Not recommended

5.2. Number of Executions Settings

More executions lead to higher accuracy, but cost and time increase linearly.

The original Self-Consistency paper experimented with 1040 samples depending on the task. However, in practical applications, **510 runs** is the optimal balance between cost and effectiveness. Particularly for less difficult tasks, 5 runs alone can achieve sufficient convergence.

Determine the number of executions using these criteria.

  • 3 runs: Exploratory judgment. When grasping rough trends.
  • 5 runs: Sufficient confidence for general reasoning tasks.
  • 10 runs: When high confidence is required for important decisions.
  • 20+ runs: Tasks where error costs are extremely high, such as medical or legal contexts.

Conclusion: Design the Execution Method, Not the Prompt

Self-Consistency differs in character from the techniques covered so far. It doesn't change the content of the prompt but rather how the prompt is executed. Instead of asking the same question just once, you explore through multiple paths and select the answer that these paths commonly indicate.

This matters because the source of credibility changes. When an AI says "the answer is 4" in a single execution, that confidence is internal certainty encoded in the model's weights. There's no way to verify it directly. In contrast, the fact that 9 out of 10 Self-Consistency runs converge on "4" is externally observable credibility. It can be expressed numerically, reproduced, and compared.

From this perspective, Self-Consistency transcends being a simple accuracy-improvement technique and becomes a methodology for quantifying the credibility of AI output.

Connection with Part 7: Automating with Templates

Using the variables and templates learned in Part 7, you can structure a Self-Consistency pipeline. By making the sampling prompt, aggregation logic, and meta-judge prompt each independent modules and combining them, you can complete a reusable Self-Consistency system.

Toward Part 9: Structured Output

There is one practical challenge in the aggregation stage of Self-Consistency: how to parse the final answer from each execution. If the model provides answers in free-form sentences, it becomes difficult to write aggregation scripts and errors easily occur.

In the following [Part 9: Structured Output – Reliably Getting JSON 99% of the Time], we cover techniques for fixing AI output format to JSON or predefined schemas. This is essential technology not only for Self-Consistency but for all situations where AI output must be processed by code.

사서Dechive 사서