Running the Same Prompt Twice Gave Different Answers

You run the same prompt twice and get different answers.

At first, you wonder if the prompt is flawed or the model is unstable. But the third run is different too, and so is the fourth. You don't know which answer to trust.

This isn't a bug. It's a characteristic that emerges from how LLMs generate text itself.

AI's Answers Can Fluctuate

When an LLM predicts the next token, it samples from a probability distribution. It doesn't just pick the token with the highest probability—randomness is introduced based on the set temperature value. This is why creative writing produces different, sometimes more interesting results each time.

But for tasks with a single correct answer—like math problems, logical reasoning, or fact checking—this randomness becomes uncertainty. You ask the same question with the same prompt, and one execution is correct while another is wrong. The result varies depending on which reasoning path is taken, and which path is followed comes down to the luck of sampling.

Trusting a single execution's answer is like hoping that luck works in your favor.

Why You Ask the Same Question Multiple Times

If you run the same problem five times, you might get results like this.

Run 1: Correct answer → 4
Run 2: Correct answer → 4
Run 3: Correct answer → 2  ← Error path
Run 4: Correct answer → 4
Run 5: Correct answer → 4

The reason "4" seems more trustworthy isn't just because it appears more frequently. Incorrect reasoning paths tend to scatter in different directions. In contrast, the correct reasoning path converges to the same destination even if starting points differ.

Asking the same question multiple times is a method to capture this convergence.

Consistent Answers vs. Lucky Answers

For convergence to serve as grounds for trustworthiness, one condition must be met: each execution must reason independently without referencing the others.

If all 5 executions follow the same intermediate steps and arrive at the same answer, that isn't convergence from diverse paths. It's no different from copying one answer. Setting temperature higher than 0 ensures each execution explores different reasoning paths.

When those paths differ yet still arrive at the same answer, that answer isn't the product of a specific path but closer to a signal coming from the problem itself.

In a math problem, it looks like this:

# Path A (Correct answer)
12 apples → eat 1/3 (4) → 8 remain → give half away (4) → 4

# Path B (Error)
Interpret "half of what remains" as half of the total → 12 × 0.5 = 6 → 2

# Path C (Error)
Ate 1/3 and gave half away, so 12 × (1 - 1/3 - 1/2) = 2 → 2

Error paths B and C arrive at the same answer (2) but for different reasons. The correct path A reaches the same place with the same logic across multiple executions. This difference is what makes majority voting meaningful.

How to Check Self-Consistency

The way to apply this in practice is straightforward. Run the same prompt 5-10 times and gather the final answers, then choose the one that appears most frequently. It's even more effective when used with Chain of Thought. When each execution describes intermediate steps, it becomes easier to trace which path is correct.

However, this method only works for tasks where answers are objectively verifiable—math, logical reasoning, classification, code debugging, and so on. If you apply this approach to creative writing or opinion questions, you'll select the most generic answer. Majority voting converges to the average.

Recent reasoning models that include internal reasoning processes already go through a process of exploring and validating multiple paths internally. Applying self-consistency separately to such models is just repeating externally what's already being done internally. Whether this is necessary depends on which model you're using.

Getting Closer to an Answer You Can Trust

Self-consistency isn't a way to directly verify whether an AI's answer is correct.

When an AI says "4" in a single execution, its reliability is the model's internal confidence. There's no way to look inside from the outside. But the fact that 8 out of 10 runs converge on "4" is an observable signal. It can be represented numerically and verified repeatedly.

Not trusting a single answer isn't distrusting AI. It's shifting the question about the answer from "Is this correct?" to "How consistently does this answer appear?"

When multiple paths point to the same place, there's a reason for it.