Dechive Logo
Dechive
Dev#dechive#llm#prompt#context-engineering#context-window#rag#memory#ai-design

What is Context Engineering? Context Window Design Strategy

The era of writing good prompts has ended. Now is the era of designing context windows.

Introduction: Writing Prompts Well Is Not Enough

In Part 9, we learned that to process AI output as code, we must design for trust. We implemented techniques like enforcing JSON schemas and building recovery pipelines for parsing failures.

But at this point, we must ask a fundamental question.

"Is writing good prompts really the whole problem?"

No. A prompt is only a part of the total information an AI processes. What actually determines AI performance is not "what you said" but rather "what information lies before the model at the moment of inference".

This is Context Engineering.

Andrej Karpathy (former Tesla AI Director, OpenAI co-founder) said this in 2025:

"The hottest new programming language is English. And the real skill is not prompt engineering — it's context engineering."

From the skill of writing prompts to the skill of designing the entire context window. This shift is the core of the harness era. In Part 10, we fully understand this transition.


1. Context Window: The AI's Work Desk

1.1. AI Does Not Remember the Past

The most important fact for understanding LLMs:

AI does not remember the past. It only sees the information given at each inference.

The way AI works is like receiving a new desk each time. It can only see the documents placed on the desk, and has no idea what was on the desk yesterday. Yesterday's conversation, yesterday's search results — when the desk is reset, it all disappears.

The size of that desk is the context window.

Things on the Desk
📋 Instructions (System Prompt)
📄 Reference Documents (Retrieved Documents)
📜 Conversation History (Conversation History)
🔧 Tool Execution Results (Tool Outputs)
💬 Current Question (Current User Turn)

It reasons based only on what's inside.

1.2. Bigger Desks Aren't Always Better

As of 2026, context limits for major models:

ModelMax Context
Claude Sonnet 4.6200K tokens (≈ one novel)
GPT-4o128K tokens
Gemini 2.5 Pro1M tokens

Larger limits allow more information, but simultaneously require more careful design of what goes where. With a small desk, naturally only important things get placed. The wider the desk, the more design becomes necessary. This is why context engineering emerged.


2. Prompt ≠ Context

2.1. Clear Distinction Between Two Concepts

These terms are often used interchangeably, but they're completely different.

PromptContext
ScopeUser-input instructionsAll information the model sees
RelationshipPart of contextLarger concept containing prompt
Example"Summarize this text"Instructions + text content + previous conversation + searched documents + ...

2.2. Why the Distinction Matters

Everything taught in Parts 1-9 was about "writing good prompts." But no matter how well you write a prompt, if wrong documents are piled on the desk or important materials are missing, the AI cannot work properly.

Context engineering is the technique of designing all the documents on the desk. A prompt is just one of those documents.


3. Four Layers of Context

3.1. Layer Structure Overview

Classifying documents on the desk by role yields four layers.

LayerRoleCharacteristics
Layer 1 System PromptDefine model role, rules, output formatFixed across all conversations
Layer 2 Retrieved ContextSearched external documents, API responses, DB query resultsReal-time injection, most tokens
Layer 3 Conversation HistoryPrevious conversation contentMust not accumulate infinitely
Layer 4 Current User TurnUser's current input messageLocated at bottom of context

The model reasons while seeing all four simultaneously.

3.2. Layer 1: System Prompt — The Unchanging Constitution

The fundamental instructions applied identically across all conversations. The model's identity, behavioral rules, prohibitions, and output format go here.

[System Prompt Example]

You are the knowledge librarian of Dechive.

## Absolute Rules
- Do not answer information not in context
- Say "verification needed" when uncertain
- Answer only in Korean

## Output Format
- Core answer: within 3 sentences
- Explicitly cite source documents like [Doc 1]

Most common mistake: hardcoding dynamic information in the system prompt.

# ❌ Wrong approach
System Prompt: "Today is April 18, 2026."
→ Becomes false information tomorrow

# ✅ Correct approach
System Prompt contains only immutable rules.
Dynamic information like dates is injected in the Retrieved Context layer.

3.3. Layer 2: Retrieved Context — Real-time Injected Knowledge

The layer where external knowledge the model needs to know is injected. It takes the most tokens and Part 11 on RAG covers how to fill this layer in detail.

<doc id="1" relevance="0.95">
Context engineering involves designing what the model sees when inferring...
</doc>
<doc id="2" relevance="0.82">
The Lost in the Middle phenomenon was discovered in a 2023 Stanford study...
</doc>

When documents have relevance scores, the model prioritizes higher-confidence documents.

3.4. Layer 3: Conversation History — Memory That Needs Management

As conversation accumulates, this layer encroaches on context. If left unmanaged, two problems arise.

  1. Context limit exceeded → response cuts off mid-stream
  2. Old chatter crowds out important recent information

Management strategy is essential. Part 5 covers this in detail.

3.5. Layer 4: Current User Turn — This Question Now

The message the user just typed. Located at the very end of context, the model generates responses based on it by referring to the three layers above.


4. Lost in the Middle: The Trap of Long Contexts

4.1. Discovery of the Phenomenon

Discovered by Stanford researchers in 2023. When LLMs are given long documents, they well utilize information at the beginning and end, but tend to ignore information in the middle.

Information utilization by model based on position in context:

Beginning   ████████████████ High   ✅
Middle      ████░░░░░░░░░░  Low    ⚠️
End         ████████████████ High   ✅

In RAG systems when searching and inserting multiple documents into context, placing the most important document in the middle can cause the model to ignore it.

Information Utilization by Position in LLM Context — Lost in the Middle Phenomenon

4.2. Response Strategies

Strategy 1 — Place Important Information at Beginning or End

[System Prompt]
[★ Most important document → beginning]
[Supporting documents → middle]
[★ Second most important document → end]
[User Message]

Strategy 2 — Explicitly Tell Model What to Look At

# Vague instruction (avoid this)
"Answer using the documents below."

# Clear instruction
"Use item 3 from [Doc 1] and the conclusion from [Doc 3] as core evidence."

5. Context Pollution: Noise Lowers Performance

5.1. The Misconception That More Information Is Better

Injecting irrelevant information into context actually degrades performance. This is called Context Pollution.

Measured results show:

  • Adding 5 irrelevant documents → 15-20% accuracy drop
  • Long, verbose system prompt < short, accurate system prompt

5.2. Why This Happens

The model distributes attention across all information in context. When there's much irrelevant information, attention that should focus on truly important information becomes scattered.

Every token entering context must have a reason.


6. Token Budget: Allocating Limited Desk Space

6.1. Viewing as a Budget Concept

Think of the context window as a limited desk space. Example allocation based on 200K tokens:

Total Budget: 200,000 tokens
├── System Prompt:          10,000 tokens  (5%)
├── Retrieved Documents:    80,000 tokens  (40%)
├── Conversation History:   50,000 tokens  (25%)
├── Tool Outputs:           20,000 tokens  (10%)
├── Current User Turn:       5,000 tokens  (2.5%)
└── Output Buffer:           35,000 tokens  (17.5%)

Must always reserve output buffer. If input + output exceeds context limit, the model cuts off mid-response.

LLM Context Window Token Budget Allocation

6.2. Dynamic Allocation by Query Type

def build_context_budget(query_type: str, total: int) -> dict[str, int]:
    if query_type == "simple_qa":
        # Simple question → fewer search docs, larger output buffer
        return {
            "system":    int(total * 0.05),
            "retrieved": int(total * 0.25),
            "history":   int(total * 0.10),
            "output":    int(total * 0.30),
        }
    elif query_type == "deep_analysis":
        # Deep analysis → maximize search docs
        return {
            "system":    int(total * 0.05),
            "retrieved": int(total * 0.55),
            "history":   int(total * 0.15),
            "output":    int(total * 0.20),
        }

7. Conversation History Management: Three Strategies

7.1. Sliding Window — Keep Only Recent N Turns

MAX_TURNS = 10  # Keep only recent 10 turns

def trim_history(messages: list) -> list:
    return messages[-(MAX_TURNS * 2):]  # user + assistant pairs

Simplest and easiest to implement. Drawback: important information from early conversation can be cut off.

7.2. Summarization Compression — Compress Old Conversation into One Line

Summarize old conversations with LLM to save tokens.

[Conversation Summary]
User is a Python backend developer building a FastAPI project.
Had JWT authentication error, resolved in middleware layer.

[Recent 3 Turns]
User: I want to see DB connection pooling settings this time.
Assistant: For pooling configuration in SQLAlchemy...

[Current Question]
User: What happens if I switch to async mode?

Minimal information loss and good token savings. Most commonly used in real services.

Conversation History Management Strategy Comparison — Sliding Window vs Summarization Compression

Store all conversations but inject only those with high relevance to current question into context. Uses vector embedding for similarity search, covered in detail in Part 11 on RAG.


8. Practical Patterns

8.1. Role-Tagged Structure

Use tags so the model clearly recognizes each section's role in long context.

<instructions>
  You are the knowledge librarian of Dechive.
  You do not answer information not in context.
</instructions>

<knowledge>
  <doc id="1" relevance="0.95">Context engineering involves...</doc>
  <doc id="2" relevance="0.82">The Lost in the Middle phenomenon...</doc>
</knowledge>

<history>
  User: What's a context window?
  Assistant: A context window is...
</history>

<task>
  Current user question
</task>

8.2. Dynamic System Prompt

Parts of the system prompt that vary by user are filled at runtime.

def build_system_prompt(user: User) -> str:
    return f"""You are the knowledge librarian of Dechive.

Current User: {user.name}
Preferred Language: {user.preferred_lang}

## Rules
{"- Can access premium content" if user.plan == "premium" else "- Only free content guidance"}
"""

Conclusion: AI Works Properly When Context Is Designed

No matter how well you write a prompt, if wrong documents are piled on the desk, the AI cannot work properly. Conversely, even with a simple prompt, if the desk is perfectly set up, AI produces better-than-expected results.

The perspective shift in context engineering is this:

"What will I say" → "What will the model see when inferring"

Core Principles

PrincipleContent
Distinguish LayersSystem / Retrieved / History / Current — each has a different role
Important at Front and BackTo avoid Lost in the Middle, don't place it in the middle
Remove NoiseIrrelevant information scatters attention and lowers performance
Design BudgetAlways leave output buffer
Manage MemoryDon't let conversation history accumulate infinitely

Toward Part 11: RAG and Prompts

In Part 10, we covered context structure and design principles. The Retrieved Context layer among them — how to inject external knowledge — remains an open question.

In [Part 11: RAG and Prompts – Injecting External Knowledge in Real-Time] we fully cover the techniques to fill this layer. Which documents to search, how to refine them, what order to arrange them so the model utilizes them best. This is where context engineering and RAG meet.

사서Dechive 사서