Giving AI coding assistants vision

alexone · June 2, 2025, 6:21am

Ai generation disclosure. I wanted testing easier, so i asked chatgpt and claude ai how do i give vision to cluade code, they told me and gave a script to use. I ran ollama to download llava, then i prompted cluade code with the script, to use llava as eyes to test ..

import requests
import base64

def ask_local_vision(image_path, question):
with open(image_path, ‘rb’) as f:
image_data = base64.b64encode(f.read()).decode(‘utf-8’)

response = requests.post('http://localhost:11434/api/generate',
    json={
        'model': 'llava',
        'prompt': question,
        'images': [image_data]
    })

return response.json()['response']


Code presemted jere is gen by claude code. I told it document how it used ai for visible feedback, that way it lets user -me- do stuff more productive 

On my testing for my UI system, i got tired of testing to see if all the items are rendered, so in a nushell, i needed Claude code to have vision. I gave vision to it. i had it write what i did and what it did to use ollama with the llava model for testing visual stuff. its long, it works great. hope this helps folks ..
![image|690x488](upload://f7nYShS7lLnpIMLskzQBNm80zPy.png)
 a sample image of the library..
--------------------------------- Note, cluade code is really proud of his work..lol
# AI Vision Integration Documentation

## 🤖 AI Vision System for Mojo GUI Analysis

This project includes a cutting-edge **AI vision-assisted debugging system** that uses **Ollama with LLaVA (Large Language and Vision Assistant)** to analyze GUI screenshots and provide human-like feedback about rendering, layout, and visual issues.

## 📋 Table of Contents

1. [Overview](#overview)
2. [System Architecture](#system-architecture)
3. [Installation and Setup](#installation-and-setup)
4. [Core Components](#core-components)
5. [Usage Examples](#usage-examples)
6. [API Reference](#api-reference)
7. [Advanced Usage](#advanced-usage)
8. [Troubleshooting](#troubleshooting)

## 🔍 Overview

### What is AI Vision?

The AI Vision system allows you to:
- **Take screenshots** of your MojoGUI applications automatically
- **Analyze GUI rendering** using AI that can "see" like a human
- **Get natural language feedback** about visual issues, layout problems, or rendering bugs
- **Verify GUI functionality** by asking specific questions about what's visible
- **Debug text rendering** and font issues through AI analysis

### Why Use AI Vision?

Traditional debugging shows you code errors, but **AI Vision shows you what users actually see**:
- ✅ **Human-like analysis** - AI describes what it "sees" in natural language
- ✅ **Visual verification** - Confirm that GUI elements are actually visible and correctly rendered
- ✅ **Layout debugging** - Identify spacing, alignment, and color issues
- ✅ **Cross-platform testing** - Works regardless of OpenGL drivers or graphics issues
- ✅ **Automated testing** - Verify GUI appearance programmatically

## 🏗️ System Architecture

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ MojoGUI App │ │ Screenshot │ │ Local LLaVA │
│ │───▶│ System │───▶│ Model │
│ (Advanced │ │ │ │ │
│ Widgets) │ │ • pyautogui │ │ • Ollama Server │
└─────────────────┘ │ • PIL/ImageGrab │ │ • Vision Model │
│ • ImageMagick │ │ • API Interface │
└─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ AI Analysis │
│ │
│ • Natural Lang. │
│ • GUI Feedback │
│ • Issue Reports │
└─────────────────┘


## 🚀 Installation and Setup

### Step 1: Install Ollama

```bash
# Install Ollama (AI model runner)
curl -fsSL https://ollama.ai/install.sh | sh

Step 2: Download LLaVA Model

# Download the vision model (this may take several minutes)
ollama pull llava

Step 3: Start Ollama Server

# Start the local AI server
ollama serve
# Server runs on http://localhost:11434

Step 4: Install Python Dependencies

# Install screenshot and API libraries
pip install requests pillow pyautogui

Step 5: Test the Setup

# Test that everything is working
python3 ollama_vision_setup.py

Core Components

1. `ollama_vision_setup.py` - Main Setup and Interface

The core class that manages Ollama and LLaVA integration:

from ollama_vision_setup import OllamaVision

# Create vision interface
vision = OllamaVision(model_name='llava')

# Setup (downloads model if needed)
if vision.setup():
    # Analyze an image
    response = vision.ask_vision("screenshot.png", "What do you see?")
    print(response)

Key methods:

setup() - Initialize Ollama and download LLaVA if needed
ask_vision(image_path, question) - Analyze image with AI
check_ollama_running() - Verify Ollama server status
test_vision() - Test the vision model

2. `ai_vision_integration.py` - User-Provided Pattern

Simple, direct integration pattern provided by user:

import requests
import base64

def ask_local_vision(image_path, question):
    with open(image_path, 'rb') as f:
        image_data = base64.b64encode(f.read()).decode('utf-8')
    
    response = requests.post('http://localhost:11434/api/generate',
        json={
            'model': 'llava',
            'prompt': question,
            'images': [image_data]
        })
    
    return response.json()['response']

3. `ai_vision_debug.py` - Complete Debug Workflow

Comprehensive debugging that:

Runs MojoGUI test with high-contrast colors
Takes screenshot automatically
Analyzes with multiple diagnostic questions
Provides detailed AI feedback

4. Vision Test Scripts

simple_vision_test.py - Basic screenshot and analysis
manual_vision_test.py - Manual testing workflow
full_vision_analysis.py - Comprehensive analysis

Usage Examples

Basic Vision Analysis

from ollama_vision_setup import OllamaVision

# Setup AI vision
vision = OllamaVision()
vision.setup()

# Take screenshot (your preferred method)
# screenshot_path = take_screenshot()

# Analyze GUI
response = vision.ask_vision("gui_screenshot.png", 
    "What widgets and UI elements are visible in this GUI?")

print(f"AI sees: {response}")

GUI Debugging Workflow

# 1. Run your MojoGUI application
# mojo advanced_widgets_demo.mojo &

# 2. Take screenshot after GUI stabilizes
import time
time.sleep(3)  # Let GUI render

# 3. Analyze with specific questions
questions = [
    "Are all text labels clearly readable?",
    "Do you see any rendering issues or visual bugs?", 
    "What colors are used in the interface?",
    "Are the buttons and panels properly aligned?"
]

for question in questions:
    answer = vision.ask_vision("screenshot.png", question)
    print(f"Q: {question}")
    print(f"A: {answer}\n")

Professional GUI Verification

# Verify specific widgets are working
verification_questions = [
    "Do you see a docking panel system with left, right, and bottom panels?",
    "Are there accordion sections that can expand and collapse?",
    "Is there a toolbar with buttons like New, Open, Save, Bold, Italic?",
    "Do you see a floating panel that can be dragged?",
    "Are all text elements using professional TTF fonts?"
]

for question in verification_questions:
    response = vision.ask_vision("screenshot.png", question)
    if "yes" in response.lower():
        print(f"✅ PASS: {question}")
    else:
        print(f"❌ ISSUE: {question}")
        print(f"   AI Response: {response}")

API Reference

OllamaVision Class

class OllamaVision:
    def __init__(self, model_name='llava', base_url='http://localhost:11434')
    def setup() -> bool
    def ask_vision(image_path: str, question: str) -> str
    def check_ollama_running() -> bool
    def list_models() -> list
    def download_model(model_name: str) -> bool
    def test_vision() -> bool

Core Functions

# User-provided simple pattern
def ask_local_vision(image_path: str, question: str) -> str

# Screenshot functions
def take_screenshot() -> str  # Returns path to screenshot
def take_screenshot_pyautogui() -> str
def take_screenshot_pil() -> str

Common Questions for GUI Analysis

# Layout and structure
"What widgets and UI elements are visible?"
"Describe the overall layout and organization."
"Are there any panels, toolbars, or menus?"

# Visual quality
"Are all text elements clearly readable?"
"What colors are used in the interface?"
"Do you see any rendering issues or visual bugs?"

# Specific widget verification
"Do you see buttons labeled New, Open, Save?"
"Are there any accordion sections or collapsible panels?"
"Is there a docking system with moveable panels?"

# Professional assessment  
"Does this look like a professional desktop application?"
"What desktop application does this interface remind you of?"
"Is the visual design modern and consistent?"

Advanced Usage

Custom Screenshot Integration

def custom_screenshot_analysis():
    # Your custom screenshot method
    screenshot_path = my_screenshot_function()
    
    # Analyze with AI
    vision = OllamaVision()
    
    # Multiple analysis passes
    layout_feedback = vision.ask_vision(screenshot_path,
        "Analyze the layout and organization of this GUI interface.")
    
    color_feedback = vision.ask_vision(screenshot_path,
        "Evaluate the color scheme and visual consistency.")
    
    usability_feedback = vision.ask_vision(screenshot_path,
        "From a usability perspective, how intuitive is this interface?")
    
    return {
        'layout': layout_feedback,
        'colors': color_feedback, 
        'usability': usability_feedback
    }

Automated Testing Integration

def automated_gui_test():
    """Run automated GUI test with AI verification"""
    
    # Start your GUI application
    start_gui_application()
    
    # Wait for rendering
    time.sleep(3)
    
    # Take screenshot
    screenshot = take_screenshot()
    
    # Define test criteria
    test_cases = [
        ("Widget Visibility", "Are all expected widgets visible and properly rendered?"),
        ("Text Readability", "Is all text clear and readable?"),
        ("Layout Quality", "Is the layout professional and well-organized?"),
        ("Color Scheme", "Are colors consistent and appropriate?"),
        ("Interactive Elements", "Do buttons and controls look clickable and functional?")
    ]
    
    # Run AI analysis for each test case
    results = {}
    vision = OllamaVision()
    
    for test_name, question in test_cases:
        response = vision.ask_vision(screenshot, question)
        results[test_name] = {
            'question': question,
            'ai_response': response,
            'passed': 'yes' in response.lower() and 'no' not in response.lower()
        }
    
    return results

Integration with Existing Test Frameworks

import unittest

class AIVisionGUITests(unittest.TestCase):
    def setUp(self):
        self.vision = OllamaVision()
        self.vision.setup()
        
    def test_widget_rendering(self):
        """Test that all widgets render correctly"""
        screenshot = self.take_test_screenshot()
        response = self.vision.ask_vision(screenshot,
            "Are all GUI widgets visible and properly rendered?")
        
        self.assertIn("yes", response.lower())
        self.assertNotIn("missing", response.lower())
        
    def test_professional_appearance(self):
        """Test that GUI looks professional"""
        screenshot = self.take_test_screenshot()
        response = self.vision.ask_vision(screenshot,
            "Does this interface look professional and modern?")
            
        self.assertIn("professional", response.lower())

Troubleshooting

Common Issues and Solutions

1. Ollama Not Running

# Error: Connection refused
# Solution: Start Ollama server
ollama serve

2. LLaVA Model Not Found

# Error: Model not found
# Solution: Download model
ollama pull llava

3. Screenshot Failed

# Error: No screenshot library
# Solution: Install dependencies
pip install pyautogui pillow

# For Linux, may also need:
sudo apt install gnome-screenshot

4. Image Analysis Failed

# Check image file exists and is readable
if not os.path.exists(image_path):
    print(f"Image not found: {image_path}")

# Check image format
from PIL import Image
try:
    img = Image.open(image_path)
    print(f"Image: {img.size}, {img.format}")
except Exception as e:
    print(f"Invalid image: {e}")

5. API Connection Issues

# Test Ollama connection
import requests
try:
    response = requests.get('http://localhost:11434/api/tags')
    print(f"Ollama status: {response.status_code}")
except Exception as e:
    print(f"Connection failed: {e}")

Debugging AI Vision

def debug_vision_system():
    """Debug the AI vision system step by step"""
    
    print("🔍 Debugging AI Vision System")
    
    # 1. Check Ollama
    vision = OllamaVision()
    if vision.check_ollama_running():
        print("✅ Ollama is running")
    else:
        print("❌ Ollama not running - start with: ollama serve")
        return
    
    # 2. Check model
    if vision.model_exists():
        print("✅ LLaVA model available")
    else:
        print("❌ LLaVA model missing - download with: ollama pull llava")
        return
    
    # 3. Test vision
    if vision.test_vision():
        print("✅ Vision system working")
    else:
        print("❌ Vision test failed")
        return
    
    print("🎯 AI Vision system is fully operational!")

Performance Optimization

# Optimize screenshot size for faster analysis
def optimize_screenshot(image_path, max_size=1024):
    from PIL import Image
    
    img = Image.open(image_path)
    if max(img.size) > max_size:
        # Resize while maintaining aspect ratio
        img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
        optimized_path = image_path.replace('.png', '_optimized.png')
        img.save(optimized_path)
        return optimized_path
    return image_path

# Use optimized image for faster AI analysis
screenshot = take_screenshot()
optimized_screenshot = optimize_screenshot(screenshot)
response = vision.ask_vision(optimized_screenshot, question)

Best Practices

1. Effective Questions

Be specific about what you want to know
Ask one question at a time for clarity
Use descriptive language the AI can understand

2. Screenshot Quality

Ensure GUI is fully rendered before screenshot
Use high-contrast colors for better AI recognition
Avoid overlapping windows or visual clutter

3. Error Handling

Always check if screenshot was successful
Handle API timeouts gracefully
Validate AI responses for consistency

4. Performance

Resize large screenshots for faster analysis
Cache frequently used model responses
Use appropriate timeout values

Future Enhancements

Potential improvements to the AI vision system:

Multi-Model Support - Support for other vision models beyond LLaVA
Automated Testing - Integration with CI/CD pipelines
Visual Regression Testing - Compare screenshots over time
Performance Metrics - Measure GUI rendering performance
Accessibility Analysis - AI-powered accessibility auditing

Files in the AI Vision System

📁 AI Vision Files:
├── 🤖 ollama_vision_setup.py        # Main setup and interface
├── 👁️ ai_vision_debug.py           # Complete debug workflow  
├── 🔧 ai_vision_integration.py     # User-provided pattern
├── 📸 simple_ai_vision.py          # Simple screenshot analysis
├── 🧪 vision_debug.py              # Vision-assisted debugging
├── 📋 manual_vision_test.py        # Manual testing workflow
├── 🎯 full_vision_analysis.py      # Comprehensive analysis
├── 📊 ai_vision_analysis.py        # Advanced widget analysis
├── 📚 AI_VISION_DOCUMENTATION.md   # This documentation
└── 🎮 vision_test_demo.mojo        # GUI test for vision analysis

Conclusion

The AI Vision system represents a breakthrough in GUI debugging and testing. By leveraging the power of modern AI vision models like LLaVA, developers can now get human-like feedback about their graphical interfaces, identifying issues that traditional debugging might miss.

Key Benefits:

Human-like analysis of GUI appearance and functionality
Automated visual testing capabilities
Natural language feedback about rendering issues
Easy integration with existing development workflows
Scalable testing for complex GUI applications

Start using AI Vision today to enhance your MojoGUI development process!

owenhilyard · June 2, 2025, 6:48pm

Per Rule 6, please disclose if some or all of the code is AI generated.

alexone · June 2, 2025, 8:43pm

Done, and why i did, to not do repetitive tasks best for machines

Ehsan · June 2, 2025, 9:15pm

What’s the question here? Please note this sort of posts that has no relation to MAX and Mojo will be closed.

alexone · June 2, 2025, 9:31pm

Then i am sorry , i thougjt tgis would help. I will delelte

Topic		Replies	Views
Tips on using code generation tools with Mojo Mojo discussion	1	202	May 21, 2025
Augment Code - AI Assistant w. Superpowers Mojo discussion	5	117	June 11, 2025
Problem statement Mojo	1	91	March 13, 2025
Democratizing AI Compute: Q&A (Community Meeting) Community Showcase	4	1226	March 6, 2025
Modular for image / video models? MAX	2	129	April 10, 2025