What is Context Engineering? Context Window Design Strategy
The era of writing good prompts has ended. Now is the era of designing context windows.
Introduction: Writing Prompts Well Is Not Enough
In Part 9, we learned that to process AI output as code, we must design for trust. We implemented techniques like enforcing JSON schemas and building recovery pipelines for parsing failures.
But at this point, we must ask a fundamental question.
"Is writing good prompts really the whole problem?"
No. A prompt is only a part of the total information an AI processes. What actually determines AI performance is not "what you said" but rather "what information lies before the model at the moment of inference".
This is Context Engineering.
Andrej Karpathy (former Tesla AI Director, OpenAI co-founder) said this in 2025:
"The hottest new programming language is English. And the real skill is not prompt engineering — it's context engineering."
From the skill of writing prompts to the skill of designing the entire context window. This shift is the core of the harness era. In Part 10, we fully understand this transition.
1. Context Window: The AI's Work Desk
1.1. AI Does Not Remember the Past
The most important fact for understanding LLMs:
AI does not remember the past. It only sees the information given at each inference.
The way AI works is like receiving a new desk each time. It can only see the documents placed on the desk, and has no idea what was on the desk yesterday. Yesterday's conversation, yesterday's search results — when the desk is reset, it all disappears.
The size of that desk is the context window.
| Things on the Desk |
|---|
| 📋 Instructions (System Prompt) |
| 📄 Reference Documents (Retrieved Documents) |
| 📜 Conversation History (Conversation History) |
| 🔧 Tool Execution Results (Tool Outputs) |
| 💬 Current Question (Current User Turn) |
It reasons based only on what's inside.
1.2. Bigger Desks Aren't Always Better
As of 2026, context limits for major models:
| Model | Max Context |
|---|---|
| Claude Sonnet 4.6 | 200K tokens (≈ one novel) |
| GPT-4o | 128K tokens |
| Gemini 2.5 Pro | 1M tokens |
Larger limits allow more information, but simultaneously require more careful design of what goes where. With a small desk, naturally only important things get placed. The wider the desk, the more design becomes necessary. This is why context engineering emerged.
2. Prompt ≠ Context
2.1. Clear Distinction Between Two Concepts
These terms are often used interchangeably, but they're completely different.
| Prompt | Context | |
|---|---|---|
| Scope | User-input instructions | All information the model sees |
| Relationship | Part of context | Larger concept containing prompt |
| Example | "Summarize this text" | Instructions + text content + previous conversation + searched documents + ... |
2.2. Why the Distinction Matters
Everything taught in Parts 1-9 was about "writing good prompts." But no matter how well you write a prompt, if wrong documents are piled on the desk or important materials are missing, the AI cannot work properly.
Context engineering is the technique of designing all the documents on the desk. A prompt is just one of those documents.
3. Four Layers of Context
3.1. Layer Structure Overview
Classifying documents on the desk by role yields four layers.
| Layer | Role | Characteristics |
|---|---|---|
| Layer 1 System Prompt | Define model role, rules, output format | Fixed across all conversations |
| Layer 2 Retrieved Context | Searched external documents, API responses, DB query results | Real-time injection, most tokens |
| Layer 3 Conversation History | Previous conversation content | Must not accumulate infinitely |
| Layer 4 Current User Turn | User's current input message | Located at bottom of context |
The model reasons while seeing all four simultaneously.
3.2. Layer 1: System Prompt — The Unchanging Constitution
The fundamental instructions applied identically across all conversations. The model's identity, behavioral rules, prohibitions, and output format go here.
[System Prompt Example]
You are the knowledge librarian of Dechive.
## Absolute Rules
- Do not answer information not in context
- Say "verification needed" when uncertain
- Answer only in Korean
## Output Format
- Core answer: within 3 sentences
- Explicitly cite source documents like [Doc 1]
Most common mistake: hardcoding dynamic information in the system prompt.
# ❌ Wrong approach
System Prompt: "Today is April 18, 2026."
→ Becomes false information tomorrow
# ✅ Correct approach
System Prompt contains only immutable rules.
Dynamic information like dates is injected in the Retrieved Context layer.
3.3. Layer 2: Retrieved Context — Real-time Injected Knowledge
The layer where external knowledge the model needs to know is injected. It takes the most tokens and Part 11 on RAG covers how to fill this layer in detail.
<doc id="1" relevance="0.95">
Context engineering involves designing what the model sees when inferring...
</doc>
<doc id="2" relevance="0.82">
The Lost in the Middle phenomenon was discovered in a 2023 Stanford study...
</doc>
When documents have relevance scores, the model prioritizes higher-confidence documents.
3.4. Layer 3: Conversation History — Memory That Needs Management
As conversation accumulates, this layer encroaches on context. If left unmanaged, two problems arise.
- Context limit exceeded → response cuts off mid-stream
- Old chatter crowds out important recent information
Management strategy is essential. Part 5 covers this in detail.
3.5. Layer 4: Current User Turn — This Question Now
The message the user just typed. Located at the very end of context, the model generates responses based on it by referring to the three layers above.
4. Lost in the Middle: The Trap of Long Contexts
4.1. Discovery of the Phenomenon
Discovered by Stanford researchers in 2023. When LLMs are given long documents, they well utilize information at the beginning and end, but tend to ignore information in the middle.
Information utilization by model based on position in context:
Beginning ████████████████ High ✅
Middle ████░░░░░░░░░░ Low ⚠️
End ████████████████ High ✅
In RAG systems when searching and inserting multiple documents into context, placing the most important document in the middle can cause the model to ignore it.

4.2. Response Strategies
Strategy 1 — Place Important Information at Beginning or End
[System Prompt]
[★ Most important document → beginning]
[Supporting documents → middle]
[★ Second most important document → end]
[User Message]
Strategy 2 — Explicitly Tell Model What to Look At
# Vague instruction (avoid this)
"Answer using the documents below."
# Clear instruction
"Use item 3 from [Doc 1] and the conclusion from [Doc 3] as core evidence."
5. Context Pollution: Noise Lowers Performance
5.1. The Misconception That More Information Is Better
Injecting irrelevant information into context actually degrades performance. This is called Context Pollution.
Measured results show:
- Adding 5 irrelevant documents → 15-20% accuracy drop
- Long, verbose system prompt < short, accurate system prompt
5.2. Why This Happens
The model distributes attention across all information in context. When there's much irrelevant information, attention that should focus on truly important information becomes scattered.
Every token entering context must have a reason.
6. Token Budget: Allocating Limited Desk Space
6.1. Viewing as a Budget Concept
Think of the context window as a limited desk space. Example allocation based on 200K tokens:
Total Budget: 200,000 tokens
├── System Prompt: 10,000 tokens (5%)
├── Retrieved Documents: 80,000 tokens (40%)
├── Conversation History: 50,000 tokens (25%)
├── Tool Outputs: 20,000 tokens (10%)
├── Current User Turn: 5,000 tokens (2.5%)
└── Output Buffer: 35,000 tokens (17.5%)
Must always reserve output buffer. If input + output exceeds context limit, the model cuts off mid-response.

6.2. Dynamic Allocation by Query Type
def build_context_budget(query_type: str, total: int) -> dict[str, int]:
if query_type == "simple_qa":
# Simple question → fewer search docs, larger output buffer
return {
"system": int(total * 0.05),
"retrieved": int(total * 0.25),
"history": int(total * 0.10),
"output": int(total * 0.30),
}
elif query_type == "deep_analysis":
# Deep analysis → maximize search docs
return {
"system": int(total * 0.05),
"retrieved": int(total * 0.55),
"history": int(total * 0.15),
"output": int(total * 0.20),
}
7. Conversation History Management: Three Strategies
7.1. Sliding Window — Keep Only Recent N Turns
MAX_TURNS = 10 # Keep only recent 10 turns
def trim_history(messages: list) -> list:
return messages[-(MAX_TURNS * 2):] # user + assistant pairs
Simplest and easiest to implement. Drawback: important information from early conversation can be cut off.
7.2. Summarization Compression — Compress Old Conversation into One Line
Summarize old conversations with LLM to save tokens.
[Conversation Summary]
User is a Python backend developer building a FastAPI project.
Had JWT authentication error, resolved in middleware layer.
[Recent 3 Turns]
User: I want to see DB connection pooling settings this time.
Assistant: For pooling configuration in SQLAlchemy...
[Current Question]
User: What happens if I switch to async mode?
Minimal information loss and good token savings. Most commonly used in real services.

7.3. Selective Preservation — Retrieve Only Related Past
Store all conversations but inject only those with high relevance to current question into context. Uses vector embedding for similarity search, covered in detail in Part 11 on RAG.
8. Practical Patterns
8.1. Role-Tagged Structure
Use tags so the model clearly recognizes each section's role in long context.
<instructions>
You are the knowledge librarian of Dechive.
You do not answer information not in context.
</instructions>
<knowledge>
<doc id="1" relevance="0.95">Context engineering involves...</doc>
<doc id="2" relevance="0.82">The Lost in the Middle phenomenon...</doc>
</knowledge>
<history>
User: What's a context window?
Assistant: A context window is...
</history>
<task>
Current user question
</task>
8.2. Dynamic System Prompt
Parts of the system prompt that vary by user are filled at runtime.
def build_system_prompt(user: User) -> str:
return f"""You are the knowledge librarian of Dechive.
Current User: {user.name}
Preferred Language: {user.preferred_lang}
## Rules
{"- Can access premium content" if user.plan == "premium" else "- Only free content guidance"}
"""
Conclusion: AI Works Properly When Context Is Designed
No matter how well you write a prompt, if wrong documents are piled on the desk, the AI cannot work properly. Conversely, even with a simple prompt, if the desk is perfectly set up, AI produces better-than-expected results.
The perspective shift in context engineering is this:
"What will I say" → "What will the model see when inferring"
Core Principles
| Principle | Content |
|---|---|
| Distinguish Layers | System / Retrieved / History / Current — each has a different role |
| Important at Front and Back | To avoid Lost in the Middle, don't place it in the middle |
| Remove Noise | Irrelevant information scatters attention and lowers performance |
| Design Budget | Always leave output buffer |
| Manage Memory | Don't let conversation history accumulate infinitely |
Toward Part 11: RAG and Prompts
In Part 10, we covered context structure and design principles. The Retrieved Context layer among them — how to inject external knowledge — remains an open question.
In [Part 11: RAG and Prompts – Injecting External Knowledge in Real-Time] we fully cover the techniques to fill this layer. Which documents to search, how to refine them, what order to arrange them so the model utilizes them best. This is where context engineering and RAG meet.
