What is RAG - Injecting External Knowledge into AI in Real-Time
Technology to make models understand what they don't know. Complete mastery from RAG principles to prompt design.
Introduction: Layer 2 Was Empty
In Part 10, we divided the context window into four layers: system prompt, retrieved external documents, conversation history, and current user message. And honestly, we admitted one thing.
"How to fill Layer 2, the retrieved external documents, will be covered in Part 11."
This article keeps that promise.
Models don't know the world after their training data. They don't know company internal documents. They can't know policies updated yesterday or news announced today. No matter how well-crafted a prompt is, a model cannot create information it has no access to. If it does, that's hallucination.
RAG (Retrieval-Augmented Generation) is the technology that solves this problem. By finding relevant documents before the model reasons and putting them in the context, we make the model know what it doesn't know.
This article covers from beginning to end what RAG is, how it works, and how to design Layer 2 so the model can use it best.
1. What is RAG
1.1. Understanding the Name
RAG is a combination of three words.
- Retrieval: Search for relevant documents from external storage
- Augmented: Enrich the context with those documents
- Generation: Generate answers based on the enriched context
To summarize in one sentence:
"Before the model answers, find relevant documents and put them in the context."
It may sound simple, but this flow changes the paradigm of LLM utilization. Rather than changing the model itself, we design the information given to the model. The context emphasized in Part 10—not "what to say" but "what the model should see when reasoning"—is most directly realized in RAG.
1.2. Why It's Necessary
LLMs have two fundamental limitations.
The first is Knowledge Cutoff. Models cannot know information after their training ends. GPT-4o's knowledge stops at a certain date, and news announced today or product specifications changed yesterday don't exist for the model. Even if you try to solve this with a prompt, the limitation is clear: a prompt cannot create knowledge the model doesn't have.
The second is Absence of Internal Knowledge. Company internal documents, personal notes, and proprietary data from specific domains are not included in training data. A legal team's contract database, five years of customer service records, meeting notes written yesterday—these are information no powerful model can know.
RAG solves both limitations simultaneously. Not by updating the model, but by directly injecting necessary information at inference time.
2. The 3-Stage RAG Pipeline
Every RAG system follows the same flow: retrieve, refine, and inject. Let's examine each stage concretely.

2.1. Stage 1 — Retrieve
Find documents related to the user's question from external storage. Here, "external storage" typically refers to a vector database. The goal of this stage is to find documents semantically closest to the question.
User: "What's the context window size of Claude?"
→ Search for semantically similar documents in vector DB
→ [Doc 1] Claude Model Specification Document (Similarity: 0.94)
→ [Doc 2] LLM Context Size Comparison Table (Similarity: 0.87)
→ [Doc 3] Claude Latest Update Notes (Similarity: 0.81)
The quality of the search determines the quality of the entire RAG system. If wrong documents are retrieved, even a good model generates wrong answers. This stage is covered in more detail in Section 3.
2.2. Stage 2 — Refine
If you put retrieved documents directly into the context as-is, tokens explode. Three 30-page PDFs put in whole will far exceed the context limit. This stage removes low-relevance parts and keeps only the essence.
[Doc 1] Full 38 pages → Extract only 3 relevant paragraphs → ~800 tokens
[Doc 2] Full 15 pages → Extract only 1 relevant table → ~300 tokens
[Doc 3] Similarity 0.81 → Below threshold (0.85) → Excluded
It's important to filter out documents below a threshold. The context pollution problem covered in Part 10 applies here too. When irrelevant documents are injected, the model's attention scatters and accuracy actually drops.
2.3. Stage 3 — Inject
Place refined documents structurally in context Layer 2, and the model generates answers based on them. Rather than simply pasting text, you should design the structure so the model can clearly recognize the role of each document.
<knowledge>
<doc id="1" relevance="0.94" source="claude-specs.pdf">
Claude Sonnet 4.6's context window is 200K tokens.
This is a size that can process approximately 150,000 words of text at once...
</doc>
<doc id="2" relevance="0.87" source="llm-comparison.xlsx">
| Model | Max Context |
| Claude Sonnet 4.6 | 200K tokens |
| GPT-4o | 128K tokens |
| Gemini 2.5 Pro | 1M tokens |
</doc>
</knowledge>
Including relevance and source attributes allows the model to consider both the credibility and source of each document when generating answers. Injection design is covered in detail in Section 5.
3. Vector Search: Why Keyword Search Isn't Enough
3.1. Fundamental Limitations of Keyword Search
Traditional search engines work based on word matching. If words in the question appear in the document, it's found; if not, it's not. Using this method in RAG creates serious problems.
Question: "How big is the context window?"
Document: "The token limit the model can process at once is 200K."
→ Keyword search: The word "context window" doesn't exist → Search fails
→ Vector search: The meaning is the same → Search succeeds
Documents expressing the same concept in different ways are completely missed by keyword search. This problem becomes more severe with internal documents, where expressions are more varied and inconsistent.
3.2. Principles of Vector Search
When text is converted into a numerical array (vector) of hundreds to thousands of dimensions, sentences with similar meanings are positioned close in vector space. This conversion process is called Embedding.
"Context window size" → [0.23, 0.87, 0.41, -0.15, ...]
"Token processing limit" → [0.21, 0.89, 0.44, -0.13, ...] ← Close
"What should I eat for lunch" → [0.91, 0.12, 0.77, 0.62, ...] ← Far
Semantic distance is calculated using the angle between two vectors (cosine similarity). Even if words differ, if the meaning is the same, the similarity score is high. This is why vector search is more suitable for RAG than keyword search.
3.3. Practical Embedding Knowledge
Embedding is generated once when documents are stored and reused afterward. At search time, the question is also converted using the same embedding model and compared with document vectors. You must use the same model for embedding to align vector spaces. Comparing vectors embedded with different models won't produce meaningful similarity.
| Item | Content |
|---|---|
| Embedding Model | OpenAI text-embedding-3-small, Cohere Embed, BGE, E5, etc. |
| Vector DB | Pinecone, Supabase Vector, Weaviate, Chroma |
| Document Embedding | Generated once at storage, reused afterward |
| Query Embedding | Generated at each search, same model required |
4. Chunking: How to Split Documents
4.1. Why Chunking Matters
The retrieval unit in vector search is not the entire document but a slice of the document, called a Chunk. Let's think about what happens if you make a 100-page PDF into a single vector.
That PDF contains introduction, technical specifications, pricing policy, FAQ, and disclaimers all mixed together. Compressing this into a single vector results in a mediocre vector that's somewhat similar to any question and exactly matches none. Even if this PDF is retrieved when asked about "pricing policy," the actual pricing section is buried somewhere in 100 pages.
Chunking solves this problem by dividing documents into meaningful pieces. The smaller chunks are, the higher the search precision but the less context; the larger they are, the richer the context but the more irrelevant content is mixed in. Finding the appropriate size is the essence of chunking.
4.2. Chunking Strategies
Fixed-size chunking is the simplest approach. Text is cut into regular lengths. It's simple and fast to implement, but sentences can be cut at chunk boundaries. The overlap parameter can supplement context continuity by creating overlapping sections at both boundaries.
def fixed_chunk(text: str, size: int = 500, overlap: int = 50) -> list[str]:
chunks = []
for i in range(0, len(text), size - overlap):
chunks.append(text[i:i + size])
return chunks
Semantic chunking leverages document structure. For markdown, it divides by ## headings; for HTML, by <section> tags. Following natural semantic boundaries, it has higher search quality. This approach is recommended in real services.
def semantic_chunk(markdown: str) -> list[str]:
# Split by ## headings
sections = re.split(r'\n## ', markdown)
# Merge too-short sections with previous chunk
return [s.strip() for s in sections if len(s.strip()) > 100]
Hierarchical chunking combines the strengths of both approaches. Store documents in both large units (sections) and small units (paragraphs). Search is done precisely with small units, and when actually injecting into context, the entire section containing that paragraph is included. You get both search precision and rich context.
[Parent chunk] ## 4. Pricing Policy (entire section, ~1000 tokens)
├── [Child chunk] Basic plan explanation (paragraph, ~200 tokens)
├── [Child chunk] Premium plan (paragraph, ~200 tokens)
└── [Child chunk] Refund policy (paragraph, ~150 tokens)
Search: "How does refund work?" → Match child chunk "Refund policy"
Inject: Insert entire "Pricing Policy" section, parent of child, into context
5. Injection Design: How to Put Found Documents In
After searching and refining, what remains is how to position it in the context. Rather than simply concatenating documents, you should design it structurally so the model can use it best.
5.1. Placement Order Changes Results
Apply the Lost in the Middle phenomenon covered in Part 10 here. Models tend to better utilize information at the beginning and end of context, and relatively ignore information in the middle.
For this reason, simply listing by relevance can bury the most important document in the middle and leave it unused. It's more effective to place important documents at the beginning or end, and supplementary materials in the middle.
[System Prompt]
[★ Highest relevance document → Place at beginning]
[Medium-level relevance documents → Middle]
[★ Second most important document → Place at end]
[User Message]
5.2. Use Tag Structure to Clarify Roles
When putting documents in context, using structured tags allows the model to clearly recognize the role of each area. Claude especially officially recommends distinguishing context based on XML tags.
<instructions>
Answer based only on documents inside the <knowledge> tag.
If content is not in the documents, say "This requires verification."
Mark the source for all factual claims in [Doc N] format.
If dates conflict, prioritize the latest document.
</instructions>
<knowledge>
<doc id="1" relevance="0.94" date="2026-04">
...document content...
</doc>
<doc id="2" relevance="0.87" date="2025-11">
...document content...
</doc>
</knowledge>
<query>
User question
</query>
Including relevance and date attributes allows the model to consider both recency and relevance. Especially when multiple documents cover the same topic with different dates, the date attribute lets the model prioritize newer information.
5.3. Enforce Source Citation
One of the most effective ways to suppress hallucination in RAG systems is to enforce source citation. By instructing the model to mark sources in the format "this information is in [Doc 1]," it naturally creates constraints against generating content not in documents. It becomes difficult to make claims without being able to provide evidence.
# Instruction without citation (avoid)
"Answer with reference to the documents below."
# Instruction enforcing citation (recommended)
"Mark [Doc N] for all factual claims in your answer.
Never add content not found in documents.
If not found in documents, explicitly state 'This information is not verified in the documents.'"
6. RAG Failure Patterns
When RAG is implemented but performance falls short of expectations, it usually fits one of these four patterns.
6.1. Wrong Search — Garbage In, Garbage Out
The most common failure pattern. When irrelevant documents are retrieved, the model generates answers based on them. If the search itself is wrong, subsequent stages are meaningless.
Question: "What's the refund policy?"
Search result: Shipping policy document (matched keyword 'policy')
→ Model: Confuses shipping policy with refund policy and answers
Setting a similarity threshold to exclude documents below a certain level, and if necessary, adding a Reranker model to re-rank first-pass results is effective.
6.2. Too Much Input — Context Pollution
It might seem like adding as many relevant documents as possible is better, but it's actually the opposite. As covered in Part 10, when irrelevant information enters the context, the model's attention scatters. Research shows accuracy drops 15–20% for every 5 irrelevant documents added.
Using only the top 3–5 documents and managing quality with a similarity threshold produces better results.
TOP_K = 5
RELEVANCE_THRESHOLD = 0.75
docs = vector_search(query, top_k=TOP_K)
docs = [d for d in docs if d.relevance >= RELEVANCE_THRESHOLD]
6.3. Chunk Boundary Problems — Incomplete Information
When chunks are cut in the middle of important sentences, the model receives incomplete information. In this case, the model may reason based only on the truncated beginning and reach wrong conclusions.
# Chunk cut at boundary
"Refunds can be requested within 30 days of purchase,
only if the product is unopened..."
→ [Chunk boundary] → Rest of condition is in next chunk
# Model's interpretation: Refunds possible within 30 days with no conditions
You can mitigate this problem by setting overlap in chunks, chunking based on sentence boundaries, or using hierarchical chunking as explained earlier.
6.4. Date Conflict — Older Documents Win
When multiple documents cover the same topic with different dates, the model sometimes chooses content from older documents. This is particularly risky for information that changes frequently like pricing, policies, and specifications.
Explicitly including a date attribute and adding instructions in the system prompt like "prioritize the latest document when dates conflict" is the simplest and most effective solution.
7. Real-World RAG Prompt Template
Summarizing everything so far into one template:
<instructions>
You are the knowledge base AI for [Service Name].
## Answer Rules
- Answer only based on documents inside the <knowledge> tag
- For content not in documents, answer "This information is not verified in the documents"
- Mark the source as [Doc N] for all factual claims
- When dates conflict, prioritize the latest document
- Convey document content as-is, don't summarize or interpret it
</instructions>
<knowledge>
{{retrieved_documents}}
</knowledge>
<history>
{{conversation_history}}
</history>
<query>
{{user_message}}
</query>
{{variables}} are filled dynamically at execution time. Retrieved documents, previous conversation, and user message each take their place to form one context.
Conclusion: RAG Is the Core Engine of Context Engineering
After Part 10, one question remained: we divided context into 4 layers, but how do we fill Layer 2? RAG is the answer.
Writing good prompts is still important. But there's something more fundamental than that: what information should the model see when it reasons? RAG is the method to systematically fill that Layer 2.
Rather than making models bigger or fine-tuning them, injecting correct information in the right way into the context is a faster and more cost-effective solution for most practical problems. This is why RAG has rapidly become commonplace since 2023.
Summary of Core Principles
| Principle | Key Point |
|---|---|
| Vector Search | Find by meaning, not keywords. Find even if expressions differ as long as meaning is the same |
| Chunking | Cut by semantic units. Supplement boundaries with overlap |
| Refinement | Exclude documents below threshold. Keep to 3–5 or fewer |
| Placement | Important documents at beginning or end. Middle for supplementary materials |
| Enforce Citation | Make source marking mandatory to suppress hallucination |
Looking Forward to Part 12
Having learned how to inject external knowledge with RAG, a natural next question arises:
"Can we make the AI judge and perform searches itself?"
Until now, RAG has been a pattern where external systems handle the search and pass results to the model. But there's a pattern where the AI itself decides "which tool to use," performs searches directly, sees results, and judges the next action. This is ReAct — Reasoning and Acting.
[Part 12: ReAct Pattern – The Core of Agents That Interweave Reasoning and Tool Use] covers how AI evolves from a simple response machine to an agent that independently reasons and acts.
