I can show instead of explain

Capture one error screen and show it to the AI.

"Why doesn't this work?"

There are moments when this short question alone completes the explanation. You don't need to transcribe the error message, explain which button you clicked, or describe which part of the screen looks wrong in words. The image already contains all that context at once.

This conciseness is the power of multimodal prompts.

But there's a catch. The AI doesn't read images the same way I do. The buttons, spacing, error messages, and screen flow that seem obvious to me might be unclear to the model.

So showing an image alone isn't enough. You need to point out what to look at together.

Now I can show instead of telling

A multimodal prompt is a way of using not just text, but also images, screenshots, documents, and tables as inputs.

In the past, you had to describe the screen. Now you can show the screen.

You don't need to explain in words whether the UI feels awkward, where the error message is, what the table structure looks like, or why the design atmosphere doesn't fit. The AI can see the elements in the image and create an answer.

Audio and video can also expand in the same direction. However, since support methods differ by model and tool, this article focuses on images and screenshots, which are most commonly used and easiest to understand.

How AI reads images

When people look at an image, they bring experience and intention with them.

Developers look at the red lines and file paths first on an error screen. Designers look at spacing and contrast first. Users look first at whether the button is clickable and what to do next.

AI doesn't automatically share this intention. It creates answers based on text, layout, color, and shape in the image, but it doesn't naturally know what I consider important to see.

A problem that's clear to me might be just one of many elements on the screen to the model.

Seeing an image and seeing the problem I see in the same way are different things.

When showing alone isn't enough

Throwing just one image and asking "What do you think?" is a broad question.

The AI could look at color, layout, read text, or talk about accessibility. If it doesn't know which of these should be the basis for answering, the answer becomes broad and blurry.

# Common request
What do you think of this screen?

# Specific request
Look at this error screen for me.
There are three things I want to know:

1. The direct cause of the error
2. Files or settings I need to check first
3. Things to be careful about when fixing

Judge based on the message visible on the screen,
and mark uncertain parts as speculation.

Both requests show the same image. But the second request sets together where the AI should look, what it should answer, and where it should stop. The image shows the situation, and the text sets the observation criteria.

How to use words and visuals together

Multimodal prompts are most stable when an image and brief instructions are together.

The image shows the situation. The text sets the observation criteria.

[Image]
Error screen screenshot

[Background]
This error occurred while modifying the archive page in a Next.js project.
I recently changed code related to search filters and active state.

[What to look at]
- Direct cause of the error message
- Components or state values that seem related
- Possibility of controlled/uncontrolled input issue

[Desired answer]
- Error candidates
- Check order
- Draft fix instructions

The image becomes evidence, and the text becomes a lens. More important than seeing what is visible is deciding how to look at it.

The same structure works when requesting UI feedback.

[Image]
Archive page screenshot

[Goal]
Check if the currently selected category/subject on the left is clearly visible.

[Things that shouldn't change]
- Open book UI
- Dark library atmosphere
- Overall layout

[Desired answer]
- Problems
- Priority
- Tailwind modification direction

With background, AI reduces unnecessary suggestions. With goals, AI can focus. With restrictions, AI doesn't lose direction.

What it means that images change the conversation

Multimodal prompts are not a technology that eliminates explanations.

Rather, they change where explanations go. The screen itself is shown by the image, and the person tells what should be seen in that screen. The effort of lengthy description decreases, but determining judgment criteria becomes more important.

Showing something to AI is not about reducing the effort of explanation. It's about pointing more accurately at what to look at.