Editing Multimodal AI (section)

== <span style="color: #FFFFFF;">Creating</span> ==
Designing a multimodal AI application:

'''1. Modality requirements'''
<syntaxhighlight lang="text">
What modalities are inputs? (text, images, audio, video, documents?)
    ↓
What modalities are outputs? (text, images, audio, structured data?)
    ↓
Are modalities synchronous (video + audio) or asynchronous (user uploads image, asks question)?
    ↓
What are latency requirements? (real-time: <100ms; interactive: <3s; batch: minutes)
</syntaxhighlight>

'''2. Architecture decision for VLM application'''
<syntaxhighlight lang="text">
Input: Image(s) + Text Query
    ↓
[Vision Encoder: SigLIP or CLIP → visual tokens]
    ↓
[Projection layer: map visual tokens to LLM embedding dimension]
    ↓
[Interleave visual tokens with text tokens]
    ↓
[Language model forward pass with cross-modal attention]
    ↓
Text Response
</syntaxhighlight>

'''3. Production multimodal system'''
* Pre-process images server-side: resize, normalize, cache encoded visual features
* Use streaming generation (token-by-token) to reduce perceived latency
* Route simple text-only queries to a smaller/cheaper text model
* For document processing: OCR first, then pass text to LLM for faster, cheaper processing
* Rate limit image inputs (images are 5–10× more expensive in tokens than equivalent text)

'''4. Domain-specific fine-tuning'''
* Collect domain image-text pairs (medical images + radiology reports; product images + descriptions)
* Fine-tune vision encoder and adapter on domain data with contrastive or VQA loss
* Use LoRA on the language model component for instruction style
* Evaluate on domain-specific held-out set before deployment

[[Category:Artificial Intelligence]]
[[Category:Deep Learning]]
[[Category:Multimodal AI]]
</div>