Editing Multimodal AI (section)

== <span style="color: #FFFFFF;">Applying</span> ==
'''Visual question answering with a vision-language model (Qwen2-VL):'''

<syntaxhighlight lang="python">
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2-VL-7B-Instruct",
    torch_dtype="auto",
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# Load image
image = Image.open("chart.png")

# Build multimodal conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What trend does this chart show? "
                                      "Which year had the highest value?"}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=512)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)
</syntaxhighlight>

; Multimodal task → model mapping
: '''Image + text understanding''' → GPT-4o, Gemini 1.5, Claude 3, LLaVA-1.6
: '''Text-to-image generation''' → DALL-E 3, Stable Diffusion XL, Flux.1, Midjourney
: '''Speech recognition (ASR)''' → Whisper (OpenAI), SeamlessM4T
: '''Text-to-speech (TTS)''' → ElevenLabs, Bark, Kokoro
: '''Video understanding''' → Gemini 1.5 Pro, Video-LLaMA, LLaVA-Video
: '''Cross-modal retrieval''' → CLIP, SigLIP, OpenCLIP
: '''Document understanding (OCR + reasoning)''' → Qwen2-VL, GOT-OCR, Donut
</div>

<div style="background-color: #8B4500; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">