Dechive Logo
Dechive
Dev#dechive#llm#prompt#multimodal#image-prompting#vision-ai#gpt-4o#gemini#claude#ai-prompting

Complete Mastery of Multimodal Prompting — How to Communicate with AI Using Images, Audio, and Video

Beyond text to images, audio, and video. Complete mastery of prompt strategies for properly utilizing multimodal AI.

Introduction: The Question Left by Episode 14

In Episode 14, while covering Reasoning Models, I made this preview:

"There remains an area where the target of prompting expands from text to images, audio, and video."

Until now, all prompts covered in this series assumed text as input. Clear instructions, structured outputs, providing examples, context design. All of this was based on text.

But real-world problems aren't expressed by text alone.

Sometimes you want to show a screenshot and ask "How could we improve this UI?" Sometimes you want to share a meeting recording and say "Summarize just the key decisions." Sometimes you want to pass a video and say "Extract the important content from this lecture."

This is multimodal prompting. The technique of transmitting non-text inputs (images, audio, video) to AI and using them effectively.

This edition covers everything from how multimodal AI works, prompt strategies for images/audio/video respectively, and common mistakes from start to finish.


1. What is Multimodal?

1.1. First, what is modality?

Modality refers to the way information is conveyed. Text, images, audio, and video are each one modality.

Existing LLMs only understood text. They were single-modal systems that received text and output text.

Multimodal processes two or more modalities simultaneously. It's about receiving images and text together, or processing audio and text together.

# Single-modal (traditional)
Input: Text → Output: Text

# Multimodal (current)
Input: Image + Text → Output: Text
Input: Audio + Text → Output: Text
Input: Video + Text → Output: Text

1.2. Which models support what

ModelImageAudioVideo
GPT-4o
Claude Sonnet/Opus
Gemini 1.5 Pro / 2.0

As of 2026, all three models support images, but Gemini supports audio and video most broadly. This edition focuses on prompt strategies for each modality.


2. Image Prompting

Image prompting is the most mature field in multimodal. It can be divided into three directions: analyzing images, generating images, or editing images.

2.1. Image Analysis — What to ask and how

The most common mistake when asking AI to analyze an image is asking too vaguely.

# ❌ Vague question
"Analyze this image"
→ The model doesn't know what to focus on. It lists everything or misses what's important.

# ✅ Specific question
"Find 3 elements in this UI screenshot that hinder user experience"
→ The analysis purpose is clear, resulting in a focused answer.

It's also important to specify particular areas.

# ❌ Area not specified
"Explain this graph"

# ✅ Area specified
"Why does the blue line in the upper right of this graph drop sharply after 2024?
 Explain while referencing the axis labels and legend below"

Example of an actual image analysis prompt:

This screenshot is a mobile app sign-up screen.

Analyze it from these perspectives:
1. Is the input field layout natural?
2. Are error states clearly displayed?
3. Is the CTA button prominent enough?

Answer each item in "Problem / Solution" format.

2.2. OCR and Document Analysis

Multimodal is also useful for extracting text within images or analyzing documents.

# Receipt analysis
Extract the following information from this receipt image and return it as JSON:
{
  "date": "",
  "business_name": "",
  "items": [{"product": "", "quantity": 0, "price": 0}],
  "total": 0
}
Mark unclear sections as "unclear".
# Business card analysis
Extract name, title, company name, email, and phone number from this business card.
Mark missing information as null.

2.3. Code and Diagram Analysis

Particularly useful for developers.

# Whiteboard architecture diagram analysis
This whiteboard photo is a service architecture diagram.

Analyze the following:
1. The role of each component
2. Data flow direction
3. Are there any potential single points of failure (SPOF)?

Base your explanation on arrow directions and labels as much as possible.
# Error screenshot analysis
Look at this error screenshot and tell me the cause and solution.
Reference the entire stack trace, and pay special attention to the red-highlighted lines.

2.4. Image Generation Prompting

When writing prompts for image generation AI like DALL-E, Midjourney, or Stable Diffusion, a different strategy is needed.

Structure of a good image generation prompt:

[Subject] + [Style] + [Mood/Lighting] + [Composition] + [Technical Parameters]
# Bad prompt
"Draw a cat"

# Good prompt
"A small orange tabby cat sitting on a wooden desk,
 soft morning sunlight coming through a window,
 watercolor illustration style,
 warm and cozy atmosphere,
 close-up shot, detailed fur texture,
 high quality, 8k"

Style specification is key.

# Style examples
photorealistic         → Photorealistic
oil painting           → Oil painting feel
watercolor             → Watercolor
3D render, Pixar style → Pixar 3D style
cyberpunk              → Cyberpunk futuristic city
Studio Ghibli style    → Studio Ghibli animation feel

Use negative prompts to remove unwanted elements.

Prompt: "portrait of a woman, natural lighting, professional photo"
Negative Prompt: "blurry, low quality, extra fingers, deformed hands, watermark"

3. Audio Prompting

3.1. Leveraging models that support audio input

GPT-4o and Gemini can directly receive audio files as input. They understand audio itself without converting speech to text.

Key use cases:

# Meeting recording summary
Analyze this meeting recording and organize it in the following format:

1. Attendees (if identifiable by voice)
2. Key Discussion Points (3 lines or less each)
3. Decided Items
4. Next Action Items (including assignees)
# Lecture content extraction
Extract just the key concepts from this lecture audio.
Organize each concept in "Term / Definition / Example" format.

3.2. Speech-to-Text (STT) then Prompting

When using a model that doesn't directly support audio, first convert it to text using an STT tool like Whisper, then prompt.

import openai

client = openai.OpenAI()

# Step 1: Audio → Text
with open("meeting.mp3", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=audio_file,
        language="ko"
    )

# Step 2: Text → Analysis
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": f"""Below is a meeting transcript. Organize the key decisions and action items.

            {transcript.text}"""
        }
    ]
)

4. Video Prompting

Video prompting is currently led by Gemini. It can understand and analyze entire video files.

4.1. Video Analysis Strategy

Video contains much more information than images. Therefore, you must specify more clearly what to focus on.

# ❌ Scope too broad
"Analyze this video"
→ Unclear where to start and end the analysis

# ✅ Purpose clear
"In this product demo video, list what happens each time the user clicks a button
 in chronological order. Highlight any UI errors or unexpected behaviors."

Timestamp-based analysis requests:

Analyze this lecture video and divide it into chapters in the following format:

[00:00] Chapter Title — One-line summary
[05:30] Chapter Title — One-line summary
...

Also mark the timestamps where important concepts first appear.

4.2. Combining Video + Text Context

Providing related text context along with the video yields much more accurate analysis than video alone.

# Product demo video analysis
[Video attached]

This video is a demo of our service's onboarding flow.
Currently, our user drop-off rate is 60% at the third step.

Watch the video and:
1. Find elements that could confuse users at step 3
2. Check if progress indication is sufficient
3. Propose 3 specific improvement approaches

5. Core Principles of Multimodal Prompting

Whether it's images, audio, or video, there are principles common to all multimodal prompting.

5.1. What the model sees and what I see can be different

Humans use context, experience, and reasoning simultaneously when viewing images. Models analyze pixel patterns. What seems "obvious" to me may be unclear to the model.

# What I think is obvious
"Don't you see something odd in this graph?"
→ The model might see something different from what I think is "odd"

# Specify clearly
"In this graph, the value suddenly drops to 0 in March 2023.
 How can we tell if this is a data error or a real phenomenon?"

5.2. Supplement visual information with text context

Providing background explanation along with an image yields much more accurate answers than throwing an image alone.

# Image only
[Screenshot attached]
"Why doesn't this work?"

# Image + context
[Screenshot attached]
"I'm getting this error in a Next.js 14 app.
 I'm calling fetch in a Server Component,
 it works locally but only after Vercel deployment do I get this error.
 Tell me the cause and solution."

5.3. Specify output format clearly

Multimodal analysis results should also be received in the desired format for easy use.

# Format not specified — different structure each time
"Analyze this dashboard"

# Format specified — consistent structure
"Analyze this dashboard and return it as the following JSON:
{
  'key_metrics': [{'metric_name': '', 'current_value': '', 'trend': 'up/down/stable'}],
  'anomalies_detected': ['list of anomalies'],
  'recommended_actions': ['list in priority order']
}"

6. Multimodal Prompting Failure Patterns

6.1. Trusting hallucinated details

AI sometimes misreads text or numbers within images. This is particularly common with handwriting, small text, and low-quality images.

# Dangerous usage
Extract date and amount from contract image and save directly to DB

# Safe usage
Extract the date and amount from this contract image.
However, be sure to mark any sections that are unclear or difficult to confirm as [UNCLEAR].
→ Humans manually verify [UNCLEAR] items

6.2. Asking too many things from one image

# ❌ Too many questions
"Analyze trend in this dashboard,
 find anomalies, predict next month,
 propose improvements,
 and compare with competitors"

# ✅ Focus on one at a time
"From this dashboard, identify just 2 of the most noteworthy changes in the past 6 months"

6.3. Not considering image quality

Low quality, blurry, or partially cut-off images have reduced accuracy. Prompts should reflect this fact.

This image has low quality and some content may be unclear.
Don't guess for hard-to-read sections; mark them as "unclear".
Only extract parts that are clearly readable.

6.4. Not considering total video length

When analyzing very long videos, consider the model's context limitations and processing cost.

# Long video processing strategy
1. First request overall summary → identify which parts are important
2. Re-analyze important parts by specifying timestamps
3. If needed, process video in sections

7. Real-World Use Case Collection

7.1. Developer — Code Review + Error Analysis

[Error screenshot attached]

Analyze this error.
- Language/Framework: Python FastAPI
- Occurs at: POST /api/upload endpoint call
- Focus on the last 3 lines of the stack trace to find the cause
- Show fix code as an example

7.2. Designer — UI/UX Feedback

[Design draft attached]

Review this landing page draft from Nielsen's 10 usability heuristics perspective.
If there are principle violations, explain them with the element's location (upper left, center, etc.).
Provide specific improvement suggestions.

7.3. Marketer — Competitor Analysis

[Competitor website screenshot attached]

Analyze this competitor's landing page:
1. What is the main message?
2. Who is the target customer? (speculation)
3. 3 most emphasized features/benefits
4. What we can learn and our differentiation points

7.4. Data Analyst — Chart Interpretation

[Data chart attached]

Analyze this chart.

First, identify the chart type, X-axis, and Y-axis units.
Then answer these questions:
- What is the overall trend?
- Are there any anomalies (spikes, drops, outliers)?
- What are the business implications of this data?

Conclusion: Text Beyond — Making the World Your Input

From Episode 1 to 14, the focus was on how well to structure text. Multimodal breaks those boundaries.

Now AI can understand the screens we see, the conversations we hear, and the videos we watch together. This doesn't simply mean "we can input more things." It means the way we collaborate with AI has fundamentally changed.

What takes 30 minutes to explain in text can now be delivered with a single screenshot. An hour-long meeting can be transcribed and action items extracted in 5 minutes. We can extract just what we need from long, complex videos.

The key is one thing: For any input, clarify purpose, provide sufficient context, specify output format concretely.

Core Principles Summary

PrincipleEssence
Clarify PurposeNot "analyze image" but specifically what you want to know
Specify AreaClearly indicate which part of the image to focus on
Supplement ContextDon't just provide image/audio/video; include background explanation
Specify FormatBe explicit about what structure to receive analysis results in
Prevent HallucinationInstruct to mark unclear parts as such, not to guess

Toward Episode 16

Having covered multimodal, there remains an essential security issue you'll inevitably face in using AI.

"If someone tries to maliciously manipulate my AI, how do I stop them?"

[Episode 16: Prompt Injection Defense — Protecting Your AI from Attack] covers the principles of prompt injection attacks and defense strategies.

사서Dechive 사서