Dechive Logo
Dechive
Dev#dechive#llm#prompt#prompt-engineering#structured-output#json#pydantic#tool-use#ai-parsing

AI JSON Output Stabilization – Complete Mastery of Structured Output

To process AI output as code, the format must be guaranteed. This covers a complete 5-step strategy for reliably receiving JSON.

Structured Output: How to Reliably Extract Data from LLMs

Introduction: Why "Just Give Me JSON" Doesn't Work

Imagine calling an AI from your code to extract data. You need to pull sentiment (positive/negative), product names mentioned, and key complaints from customer review text and store them in a database. Naturally, you add to the end of your prompt:

Please output the result in JSON format.

Then the AI responds with:

Of course! Here's the analysis result you requested in JSON format:

```json
{
  "sentiment": "negative",
  "product": "Notebook Battery",
  "complaint": "Battery lasts less than 3 hours"
}

I hope this analysis was helpful. Please let me know if you need anything else!


One line before the JSON, two lines after. Code block markers (` ``` `) included. `JSON.parse(response)` immediately throws an exception.

This is the essence of the Structured Output problem. **Instructing an AI on format and guaranteeing that format is parseable by code are two completely different issues.** As long as you conflate these two things, you cannot build a system that reliably processes AI output.

This article explains how to bridge that gap across five layers. From tweaking a single line of prompt to enforcing at the API level to building error recovery pipelines. After reading this, you should be able to make clear decisions about which method to use in any situation.

---

## 1. Why LLMs Can't Follow JSON Rules

### 1.1. LLM output is probabilistic text generation

As covered in Part 8, LLMs generate text by sampling the next token from a probability distribution. From this perspective, JSON is a very special format for LLMs.

JSON is not natural language. The vast majority of data that models trained on is natural language text. Structured formats like JSON, YAML, CSV are included in training data, but make up a much smaller proportion compared to natural language. Because of this, when instructed to "output in JSON format," the model tries to generate JSON while simultaneously struggling to suppress a powerful tendency to respond in natural language. Expressions like "Of course!", "Here below", "I hope this analysis was helpful" appear overwhelmingly frequently in question-answer pairs in training data.

This is not a model defect. The model faithfully reflects patterns in its training data. The problem is that we're asking for special output formats that deviate from those patterns.

### 1.2. Three failure patterns that break structure

In practice, JSON output failures typically follow one of three patterns.

**Pattern 1 — Preamble**
Natural language explanations appear before the JSON. Sentences like "Of course!", "Here's the result" are typical. You can work around this in code by finding the first `{` and parsing from there, but this itself is an unstable hack.

**Pattern 2 — Postamble**
Explanations are added after the JSON. Sentences like "This analysis is...", "By the way...", "If you need additional information..." are common. More difficult to handle than preambles. You could find the last `}` and truncate, but with nested JSON structures you might cut at the wrong position.

**Pattern 3 — Malformed JSON**
The structure looks like JSON but is syntactically invalid. Missing quotes on string values, missing quotes on keys, trailing commas after final items, single quotes instead of double quotes, comment insertion—these occur frequently.

```json
// Common Malformed JSON examples

{
  sentiment: "negative",        // No quotes on key
  'product': "Notebook Battery", // Single quotes used
  "complaint": "Battery life",   // Trailing comma
}

These three patterns aren't solvable by "giving clearer instructions." Due to the model's probabilistic nature, these patterns occur at low probability even with crystal-clear instructions. If you want 99% reliability, prompt instructions alone aren't enough. Multiple layers of defense are needed.

1.3. The Reliability Stack concept

The Structured Output problem isn't solved by a single technique. Instead, approach it with a Reliability Stack built from multiple layers.

Level 5 — Error Recovery Pipeline      ← Most powerful
Level 4 — API-level Enforcement (Tool Use)
Level 3 — Explicit Schema + Few-shot
Level 2 — Output Separation Structure
Level 1 — Imperative Instructions       ← Least powerful

Each level can be used independently, but high-reliability systems combine multiple levels. Which level to choose depends on task importance, error tolerance, and the capabilities of the model you're using.


2. Level 1 — Imperative Instructions: Getting the Basics Right

Quite a lot can be improved with instructions alone. But it must be far more specific than "give me JSON."

2.1. Position matters

The position of format instructions within a prompt has more impact than you might think.

LLMs process prompts sequentially and tend to weight the beginning more heavily. If format instructions appear only at the end, the model has already begun preparing a natural language response before applying the format. Stating it both at the beginning and end significantly increases format compliance.

[Bad example — format instruction only at end]
Extract sentiment, product name, and complaints from the following text.
Text: {{text}}
Please output in JSON format.


[Good example — specified at both beginning and end]
Analyze the following text and output the result in JSON only. No explanations, preambles, or postambles—just pure JSON.

Text: {{text}}

Output in exactly this format. Never include any other text:
{"sentiment": "...", "product": "...", "complaint": "..."}

2.2. Explicit prohibitions are more effective than explicit permissions

"Output in JSON format" is less effective than "Output JSON only without explanation." And "JSON only" is less effective than "Output pure JSON with no preamble, postamble, or code block markers."

When you explicitly enumerate what not to do, you effectively block the patterns the model naturally tends to choose.

[Output format instruction reinforced]

Output rules (strictly enforce):
✗ Never do this: Add preambles like "Of course!", "Here below", "This analysis", "Additionally"
✗ Never do this: Use ```json code block markers
✗ Never do this: Add // comments
✓ Do this: Output pure JSON only, starting with { and ending with }

3. Level 2 — Output Separation Structure: Separate Thinking from Format

One root cause of format errors despite clear prompt instructions is that the model carries the burden of thinking while simultaneously maintaining format. When reasoning and final output are written in the same space, format deteriorates.

3.1. Free thinking, fixed output

The key pattern for solving this is explicitly separating reasoning from output. Instruct the model to first analyze freely, then output JSON only after that analysis is complete.

[Reasoning-output separation pattern]

Analyze the following customer review.

[Analysis process]
First, freely analyze the following without any formatting:
- Overall emotional tone
- Products or features mentioned
- Key complaints or praise

[Final output]
Based on the above analysis, output only the following JSON schema. No explanation, JSON only:
{
  "sentiment": "positive" | "negative" | "neutral",
  "product": "string or null",
  "main_point": "key point in one sentence"
}

Review: {{review_text}}

When the model freely explores thinking in the [Analysis process] section then generates JSON in the [Final output] section, format error rates drop significantly. Since analysis is already complete, the model feels less compulsion to "add explanations."

3.2. Use XML tags to clearly delineate areas

This approach is recommended in Anthropic's official prompting guidelines. XML tags help the model recognize what should go in each area more clearly.

[XML tag separation pattern]

<task>
Extract information from the following text.
</task>

<text>
{{input_text}}
</text>

<thinking>
Write your analysis process here. Format doesn't matter.
</thinking>

<output>
Write only JSON here. No other text.
{"key": "value"}
</output>

This pattern explicitly signals to the model the purpose of each tag space. Within the <output> tag, the signal is to focus only on maintaining JSON format.


4. Level 3 — Explicit Schema and Few-shot: Examples Trump Instructions

4.1. Define expected values with JSON Schema

Stating "the key must be a string" is far less effective than including a precise JSON Schema in the prompt. JSON Schema explicitly defines each field's type, whether it's required, and allowed values.

[Prompt including JSON Schema]

Analyze the following text and output JSON that strictly follows the schema below.

Schema:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["sentiment", "confidence", "entities"],
  "properties": {
    "sentiment": {
      "type": "string",
      "enum": ["positive", "negative", "neutral"]
    },
    "confidence": {
      "type": "number",
      "minimum": 0,
      "maximum": 1
    },
    "entities": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["name", "type"],
        "properties": {
          "name": {"type": "string"},
          "type": {"type": "string", "enum": ["product", "person", "place"]}
        }
      }
    }
  }
}

Text: {{input_text}}

Output pure JSON only. No additional fields beyond the schema.

Models have learned JSON Schema syntax from training data and possess the ability to structure output based on schema definitions. In particular, restricting allowed values with enum is very effective for classification tasks.

4.2. Few-shot examples overpower instructions

While covered in Part 5, this deserves special emphasis in the Structured Output context. A single concrete input-output example defeats even the most detailed format instructions.

[Few-shot Structured Output pattern]

Extract information from the text and output as JSON. Refer to the examples.

---
Input: "Galaxy S24 battery drains too fast. Doesn't last even a day."
Output: {"sentiment":"negative","product":"Galaxy S24","issue":"battery life"}
---
Input: "iPhone camera is amazing. Night mode is incredible."
Output: {"sentiment":"positive","product":"iPhone","issue":null}
---
Input: "Fast shipping is great, but the packaging was a bit sloppy."
Output: {"sentiment":"neutral","product":null,"issue":"packaging quality"}
---

Now process the following:
Input: {{new_text}}
Output:

With nothing after Output:, the model naturally continues that pattern by writing JSON. All preceding examples started directly with JSON without preamble. This technique is called Prefix Forcing in prompt engineering. By filling in the start of the text the model should generate with examples, you guide the format of what follows.


5. Level 4 — API-level Enforcement: Guarantees Outside Prompts

No matter how sophisticated prompt-level techniques are, they fail at low probability. For systems requiring 99% reliability, that 1% failure may be unacceptable. To address this, model providers offer features that enforce structured output at the API level.

5.1. OpenAI Structured Outputs

OpenAI provides two approaches.

JSON Mode (response_format: { type: "json_object" }): Guarantees output is valid JSON. However, it doesn't guarantee what structure the JSON has. You might get completely different structures like {"answer": "42"}. You must specify the expected schema in your prompt alongside this.

Structured Outputs (supported in gpt-4o-2024-08-06 and above): When you pass a JSON Schema as an API parameter, the model generates JSON that 100% follows that schema. Invalid fields, wrong types, and missing required fields cannot occur. Internally, the decoding stage restricts token choices based on the schema.

from openai import OpenAI
from pydantic import BaseModel

client = OpenAI()

class ReviewAnalysis(BaseModel):
    sentiment: str  # "positive" | "negative" | "neutral"
    product: str | None
    main_issue: str | None

response = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "user", "content": f"Analyze the following review: {review_text}"}
    ],
    response_format=ReviewAnalysis,
)

result = response.choices[0].message.parsed
# result is guaranteed to be a ReviewAnalysis type object
print(result.sentiment)  # "negative"

The Pydantic model definition doubles as the schema definition. You receive an object with guaranteed type without needing parsing code.

5.2. Anthropic Tool Use enforces JSON

Anthropic Claude uses Tool Use (Function Calling) instead of a separate Structured Outputs API to achieve the same effect. The core idea is to make the AI "call a tool." Since tool call parameters are defined with JSON Schema, structured JSON naturally emerges as the model fills in tool parameters.

import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "analyze_review",
        "description": "Analyze a customer review to extract sentiment, product, and issues",
        "input_schema": {
            "type": "object",
            "properties": {
                "sentiment": {
                    "type": "string",
                    "enum": ["positive", "negative", "neutral"],
                    "description": "Overall emotional tone"
                },
                "product": {
                    "type": ["string", "null"],
                    "description": "Product name mentioned. null if none"
                },
                "main_issue": {
                    "type": ["string", "null"],
                    "description": "Key complaint or praise. null if none"
                }
            },
            "required": ["sentiment", "product", "main_issue"]
        }
    }
]

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "analyze_review"},  # Must use this tool
    messages=[
        {"role": "user", "content": f"Analyze the following review: {review_text}"}
    ]
)

# Extract input from tool_use block
tool_use = next(b for b in response.content if b.type == "tool_use")
result = tool_use.input  # Dictionary with schema guaranteed

By specifying tool_choice: {"type": "tool", "name": "analyze_review"}, you force the model to use this tool. The API returns an error if tool parameters don't follow the schema.

5.3. instructor library: Unified abstraction

instructor, created by Jason Liu, is an open-source library that lets you use multiple LLM clients (OpenAI, Anthropic, Gemini, etc.) with Pydantic-based Structured Output through a unified interface. Internally, it automatically selects each provider's optimal approach (Structured Outputs for OpenAI, Tool Use for Anthropic).

import instructor
from anthropic import Anthropic
from pydantic import BaseModel
from typing import Literal

client = instructor.from_anthropic(Anthropic())

class ReviewAnalysis(BaseModel):
    sentiment: Literal["positive", "negative", "neutral"]
    product: str | None
    main_issue: str | None

result = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    response_model=ReviewAnalysis,
    messages=[
        {"role": "user", "content": f"Analyze the following review: {review_text}"}
    ]
)

# result is a ReviewAnalysis type object. No parsing code needed.
print(result.sentiment)   # "negative"
print(result.product)     # "Galaxy S24"

instructor's core value comes from two things. First, a single Pydantic model definition handles schema, type validation, and parsing. Second, it includes built-in functionality that automatically retries on parse failure and feeds error messages back to the model so it self-corrects.


6. Level 5 — Error Recovery Pipeline: Design for Failure

There are situations where you can't use API-level enforcement. Legacy models, third-party API wrappers, cost constraints—situations where you must depend on prompt level only. Or even using API-level enforcement, you may need to handle unexpected exception scenarios.

The last line of defense for these situations is the Error Recovery Pipeline.

6.1. Parse → Validate → Recover loop

The basic principle is simple: if parsing fails, don't give up. Tell the model about the error and let it fix itself.

import json
import re

def extract_json(text: str) -> dict | None:
    """Attempt to extract JSON from text. Remove preamble/postamble then parse."""
    # Step 1: Try parsing as-is
    try:
        return json.loads(text.strip())
    except json.JSONDecodeError:
        pass

    # Step 2: Extract from first { to last }
    match = re.search(r'\{.*\}', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group())
        except json.JSONDecodeError:
            pass

    # Step 3: Remove code block markers then retry
    cleaned = re.sub(r'```(?:json)?\n?', '', text).strip()
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        return None


def get_structured_output(prompt: str, schema: dict, max_retries: int = 3) -> dict:
    """Retry loop to obtain structured output."""
    last_error = None

    for attempt in range(max_retries):
        if attempt == 0:
            current_prompt = prompt
        else:
            # Provide previous error as feedback
            current_prompt = f"""{prompt}

[Previous attempt error]
I tried to parse your response but encountered this error:
{last_error}

Please output pure JSON only. No preamble, postamble, or code block markers."""

        response_text = call_llm(current_prompt)  # Actual LLM call
        result = extract_json(response_text)

        if result is not None:
            # Schema validation (optional)
            if validate_against_schema(result, schema):
                return result
            else:
                last_error = f"Schema mismatch: {get_schema_errors(result, schema)}"
        else:
            last_error = f"JSON parse failed. Raw response: {response_text[:200]}"

    raise ValueError(f"Max retries exceeded. Last error: {last_error}")

The key in this pattern is including the error message as context for the next prompt. The model knows what it did wrong and fixes it. Empirically, the vast majority of errors get corrected on the first retry.

6.2. Strategy selection guide by use case

Which level of strategy to choose depends on the situation.

SituationRecommended Strategy
Simple classification, errors tolerableLevel 1~2 (Imperatives + Separation)
Important data extraction, occasional failures okayLevel 3 (Schema + Few-shot)
Production service, high failure costLevel 4 (API enforcement)
API enforcement unavailable + high reliability neededLevel 5 (Recovery pipeline)
Highest reliability requiredLevel 4 + Level 5 combined

Conclusion: To Trust AI Output, Design Trust Into It

Structured Output isn't "how to ask AI for JSON." It's system design to raise reliability of AI output to a level code can handle.

The difference between these two seems small but is decisive in real systems. Simply instructing "give me JSON" is using AI as a tool. Designing a reliability stack is integrating AI as a stable system component.

This perspective shift matters because as AI's role expands, pipelines, agents, and automated systems become more important than one-off prompts. All these systems must process AI output programmatically, and that processing only works reliably when format is guaranteed.

Practical application order

If implementing Structured Output for the first time, follow this order:

  1. Start with Level 2 (output separation structure). Simply restructuring your prompt significantly increases format compliance.
  2. If more reliability is needed, add Level 3 (JSON Schema + Few-shot).
  3. For production systems, switch to Level 4 (API enforcement). If using OpenAI, use Structured Outputs; if Anthropic, use Tool Use + instructor.
  4. If failure still cannot be tolerated, add Level 5 (recovery pipeline) as your final defense.

Toward Part 10: Context Engineering

Through Part 9, we've covered content (what to write) and format (how to receive) of prompts.

But every AI system has another decisive constraint: Context Window. The amount of text a model can process at once is finite. When dealing with long documents, extended conversations, and complex system prompts simultaneously, the context window depletes quickly.

The following [Part 10: Context Engineering – Designing Context Windows] covers techniques for using limited context as efficiently as possible. What to include in context, what to discard, what order to arrange. Viewing the context window as an object of design is the core of Part 10.

사서Dechive 사서