Editing Multimodal Ai (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Modality''' — A type or channel of information: text, image, audio, video, code, tabular data, sensor data.
* '''Multimodal model''' — A model that processes or generates data from two or more modalities.
* '''Visual Question Answering (VQA)''' — Answering natural language questions about images.
* '''Image captioning''' — Generating a natural language description of an image.
* '''Text-to-image generation''' — Generating an image from a text prompt (Stable Diffusion, DALL-E 3).
* '''Optical Character Recognition (OCR)''' — Detecting and extracting text from images.
* '''Speech-to-text (ASR)''' — Converting spoken audio to written text. Example: Whisper.
* '''Text-to-speech (TTS)''' — Synthesizing natural-sounding speech from text.
* '''Video understanding''' — Analyzing video content for classification, captioning, action recognition, or temporal reasoning.
* '''CLIP (Contrastive Language-Image Pre-training)''' — A model by OpenAI trained to align text and image representations in a shared embedding space.
* '''Vision-Language Model (VLM)''' — A model combining a vision encoder and a language model. Examples: LLaVA, InternVL, Qwen-VL.
* '''Cross-modal attention''' — An attention mechanism that lets one modality attend to representations from another modality.
* '''Alignment (in multimodal)''' — The process of aligning representations across modalities so semantically similar concepts in different modalities are close in embedding space.
* '''Gemini''' — Google DeepMind's natively multimodal model series, trained on interleaved text, image, audio, and video from the start.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">