Dechive Logo
Dechive
Dev#dechive#llm#prompt#prompt-engineering#few-shot#chain-of-thought#cot#reasoning

Few-shot Learning and Chain of Thought – Improving AI Accuracy with Examples

Few-shot and CoT, When to Use and When to Abandon Them in the Latest Models

Introduction: Verified Classics, But Blind Faith is Forbidden

In Part 4, we designed the System Prompt layer and covered frameworks that constrain AI's freedom to execute only the designer's intentions.

This edition changes direction and dissects two classic prompt engineering techniques:

  • Few-shot Prompting: Teaching desired format by showing examples
  • Chain of Thought (CoT): Improving reasoning quality by asking to "think step by step"

Both are techniques that exploded in research and validation between 2020-2023. There are countless papers, blogs, and YouTube videos about them.

But should we use them the same way in 2025?

Let's analyze this coldly.


1. Few-shot Prompting: The Art of Training Models with Examples

Principle

Few-shot is a method of directly including examples in the prompt in the form of "this input gets this output."

Input: "Apple"
Output: "Fruit"

Input: "BMW"
Output: "Car"

Input: "Python"
Output:

The model sees this pattern and infers "Python → Programming Language." It outputs in a much more accurate format than zero-shot (asking without examples).

When It's Effective

Few-shot shines in clear situations.

1. When Output Format is Non-standard

When you require an output format that models have rarely seen in training data. For example, proprietary in-house JSON schemas, unique classification systems, or company-specific terminology notations.

Example:
Input: "2024-01-15 Meeting Minutes"
Output: {"doc_type": "MOM", "date": "20240115", "priority": "normal"}

Input: "Urgent: Server Outage Report"
Output: {"doc_type": "IR", "date": "today's_date", "priority": "critical"}

Input: "Q3 Sales Analysis Report"
Output:

This format is difficult for models to get right through pre-training alone. 2-3 examples become a perfect guide.

2. When Fixing Tone and Style

In writing tasks, few-shot is far more powerful than text instructions ("write politely"). The examples themselves define the precise tone standard.

Example:
Input: Refund Rejection Situation
Output: "We appreciate your valuable feedback. Upon review, we find it does not meet our refund policy criteria, so we cannot proceed. We sincerely apologize for the inconvenience."

Input: Shipping Delay Situation
Output:

3. Classification/Labeling Tasks

In well-defined labeling tasks like sentiment analysis, category classification, and intent recognition, few-shot demonstrates stable performance.

Caution: Example Quality Is Everything

The biggest trap with few-shot is that bad examples produce bad outputs.

  • If inconsistent formats are mixed within examples, the model becomes confused
  • Too many examples waste context window and become noise instead
  • 2-5 examples are optimal. More than 10 is counterproductive

2. Chain of Thought (CoT): The Real Meaning of "Think Step by Step"

Background: Where CoT Came From

CoT was first systematized in Google Brain's 2022 paper by Wei et al., "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." The core finding was simple: instead of having models output only the final answer, accuracy dramatically increases when they also output intermediate reasoning steps.

The same year, Kojima et al. took it a step further and discovered something even more remarkable: the effect works even without examples—just by adding a single sentence at the end of the prompt.

"Let's think step by step."

These six words increased GPT-3's accuracy on math benchmarks from 17.7% → 78.7%.

Why It Works: Connected to How LLMs Operate

To understand this, you need to know how LLMs generate text.

LLMs generate text token by token in sequence. When predicting the next token, all previously generated tokens are used as context. In other words, what the model "outputs" itself becomes input for the next reasoning step.

Without CoT, the model must generate an answer immediately upon seeing a problem. It jumps straight to the final value without intermediate calculations. In complex problems, this process fails.

With CoT, you force the model to actually generate intermediate steps as tokens. Each intermediate step remains in the context and supports the reasoning of the next step. The model gets "thinking space."

Principle: Direct Comparison

Simple examples don't show the difference. You need something more complex.

# Zero-shot (without CoT)
Q: At a store, apples cost 1,200 won each and pears cost 2,500 won each.
   If you buy 3 apples and 2 pears, you get a 10% discount on the total.
   What's the final payment amount?
A: 8,730 won  ← Wrong (mental math error)

# CoT Applied
Q: Same problem. Calculate step by step.
A:
Step 1: 3 apples = 1,200 × 3 = 3,600 won
Step 2: 2 pears = 2,500 × 2 = 5,000 won
Step 3: Total = 3,600 + 5,000 = 8,600 won
Step 4: 10% discount = 8,600 × 0.1 = 860 won
Step 5: Final amount = 8,600 - 860 = 7,740 won
Final answer: 7,740 won  ← Correct

It's not just about numerical calculation. The same pattern appears in complex reasoning tasks like legal interpretation, code debugging, and multi-stage decision-making. Without CoT, the model skips intermediate logic and rushes to a plausible conclusion, getting it wrong.

2025 Perspective: Is CoT Still Necessary

Here we need a cold analysis.

Bottom line: It depends on the model.

General models (as of August 2025: GPT-4o, Claude 3.7 Sonnet, Gemini 2.5 Pro) are trained to internally perform CoT to some degree. However, reasoning-specialized models are on a different level.

Reasoning-Specialized Models (August 2025 baseline):

  • OpenAI o3 — Uses thousands of internal reasoning tokens to derive answers
  • Anthropic Claude 3.7 Sonnet (Extended Thinking) — Activates internal reasoning process in extended thinking mode
  • Google Gemini 2.5 Pro — Supports Thinking mode

These models automatically perform thousands of tokens of reasoning internally even without the user requesting CoT. It's just not visible externally.

When you add "think step by step" to these models:

  • You're asking them to do what they're already doing
  • Reasoning tokens double up, increasing costs
  • Output becomes excessively long, degrading UX

Does that mean CoT is no longer necessary?

No. Distinction is needed.

SituationCoT Effectiveness
GPT-3.5 Turbo, Claude 2-era legacy modelsHighly effective (stark performance difference)
GPT-4o, Claude 3.7 Sonnet general tasksModerate (helps depending on complexity)
o3, Claude 3.7 Extended Thinking, Gemini 2.5 Pro ThinkingUnnecessary (automatically performed internally)
Complex math/logic/multi-step reasoningEffective regardless of model
Simple classification/format conversionCan be counterproductive (unnecessary verbosity)

3. In Practice: When to Use and When to Abandon

When to Use Few-shot

✅ Fix non-standard output format
✅ Precisely replicate tone/style
✅ Define classification labels
✅ Unify domain-specific terminology notation

When to Abandon Few-shot

❌ General writing tasks (instructions are sufficient)
❌ When context window is already full
❌ When the cost of creating examples is high
❌ When latest models already excel at zero-shot

When to Use CoT

✅ Math, logic, multi-step reasoning problems
✅ Using legacy models (GPT-3.5 level)
✅ When you want to debug/verify the model's reasoning process
✅ When output must include explanation/rationale

When to Abandon CoT

❌ Using o1, o3, Extended Thinking models
❌ Simple tasks (format conversion, translation, summarization)
❌ When fast response is critical in real-time systems
❌ When output length must be minimized

4. Practical Pattern: Combining Both Techniques

Few-shot and CoT can be used together. This is called Few-shot CoT.

Example:
Q: Product A's monthly subscription is 9,900 won, with 10% discount when converting to annual. What's the annual cost?
A: 
Step 1: Monthly subscription = 9,900 won
Step 2: Annual total = 9,900 × 12 = 118,800 won
Step 3: 10% discount = 118,800 × 0.9 = 106,920 won
Final: 106,920 won

Q: Product B's monthly subscription is 29,000 won, with 15% discount when converting to annual. What's the annual cost?
A:

The model learns the pattern "reason this way" from the example and applies the same reasoning structure to new problems. This is especially powerful with pre-2023 models.


5. Realistic Decision Criteria in 2025

Always ask three things when choosing a prompting technique.

1. Which model are you using?

Latest reasoning models work well without techniques. Legacy models' performance is determined by techniques.

2. How complex is the task?

Attaching CoT to simple tasks wastes tokens. Using only few-shot on complex reasoning is insufficient.

3. Are costs and speed acceptable?

Few-shot increases context costs. CoT increases response time. In production environments, you must calculate both.


Conclusion: Techniques Are Tools, Context Makes the Judgment

Few-shot and CoT remain effective. However, applying them unconditionally is 2020s thinking.

A 2025 prompt engineer understands how models work internally and selectively uses techniques suited to that model. Not attaching CoT to every situation, but distinguishing the moments when CoT is actually needed. That's expertise.

In Part 6, we'll cover the most frustrating problem with latest models: blocking hallucination. How to prevent models from fabricating non-existent facts at the prompt level.

사서Dechive 사서