I Showed Google Gemini a Blurry Photo and It Wrote Working Code From It
Took a photo of a whiteboard. Terrible lighting, half the text cut off, my thumb covering the corner. Uploaded it to Google’s Gemini with the prompt: “Build this.”
Two minutes later, I had functioning React code implementing exactly what the whiteboard sketch showed.
This is multimodal AI—models that understand images, video, audio, and text simultaneously. Not just “describe this image.” Actually reasoning across different types of data like humans do.
The demo that sold me: I recorded a 3-minute video walkthrough of a broken UI, explaining the bug while clicking through the interface. Gemini watched the video, read the error messages in the frames, understood my verbal explanation, and generated a patch.
No transcription step. No separate image analysis. It just… processed everything at once.
Google claims Gemini Ultra can handle up to 1 million tokens of mixed input—roughly 1,500 pages of text, or 11 hours of audio, or thousands of images. I tested it by feeding an entire product documentation site plus UI screenshots plus customer support call recordings.
Asked: “Why are users confused about the refund process?”
It identified three UI elements that contradicted the documentation, found two support calls where agents gave wrong information, and suggested specific copy changes. All from connecting dots across text, images, and audio.
The practical applications are wild. A radiologist colleague is testing Gemini on medical imaging. Feed it an MRI scan plus patient history plus relevant research papers. It highlights potential issues the radiologist might have missed—not replacing human judgment, but augmenting it.
Construction companies are using it for safety compliance. Workers film job sites with their phones, Gemini analyzes the video against safety regulations, flags violations in real-time. Hard hat missing? Railing unstable? It catches it.
But multimodal AI is still deeply weird.
I showed Gemini a picture of my desk and asked “What should I do next?” Based on visible coffee cup placement, open notebooks, and my laptop screen (which showed my calendar), it suggested I was probably procrastinating on a deadline and should start writing.
Creepy? Yes. Accurate? Also yes.
The hallucination problem is worse across modalities. I fed it a graph from a research paper. It confidently described trends that didn’t exist in the data. When I called it out, it apologized and gave a different wrong interpretation.
Text-only models hallucinate. Multimodal models hallucinate in multiple formats simultaneously.
OpenAI’s GPT-4V (Vision) and Anthropic’s Claude 3 Opus are in the same race. Everyone’s building multimodal capability. The question isn’t if AI will understand images and video—it’s who builds the most reliable version first.
I tested all three on the same task: analyze a complex diagram from an engineering textbook and explain it to a beginner. GPT-4V gave the most accurate technical explanation. Claude was best at the beginner-friendly part. Gemini found the sweet spot between the two.
The economics are tricky. Processing images uses way more compute than text. A single high-resolution image can cost as much as processing thousands of words. Video is even worse.
That construction safety company I mentioned? Their bill went from $400/month for text-only AI to $6,000/month when they switched to video analysis. Still worth it compared to potential safety violations, but not trivial.
The future Google’s pitching: you’ll interact with AI using whatever modality makes sense. Snap a photo instead of typing a description. Record a video instead of writing instructions. Have a conversation instead of crafting perfect prompts.
I tried this workflow: photographed a hand-drawn flowchart, uploaded it with a voice memo explaining the business logic, asked Gemini to build the system.
It generated a complete backend API, database schema, and even suggested edge cases I hadn’t considered. The code needed tweaking, but it was 80% there.
Five years ago, that would’ve required precise written specifications, multiple clarification rounds, and weeks of development. Now it’s a photo and a voice memo.
The catch: you have to trust the AI actually understood what you showed it. And sometimes it didn’t. I’ve had it confidently implement the wrong thing based on misinterpreting a blurry image.
Multimodal AI is powerful but unpredictable. Like hiring someone brilliant who occasionally misreads instructions in creative ways. You can’t just set it loose—you need to verify everything.
But when it works? It’s genuinely magical. The barrier between human thought and working software is getting very, very thin.