Ai Video
How to read this page: This article maps the topic from beginner to expert across six levels � Remembering, Understanding, Applying, Analyzing, Evaluating, and Creating. Scan the headings to see the full scope, then read from wherever your knowledge starts to feel uncertain. Learn more about how BloomWiki works ?
AI for video understanding applies deep learning to extract semantic meaning from video sequences — going beyond image classification to understand motion, actions, temporal relationships, and narrative structure. Video is the dominant format of online content (YouTube processes 500 hours of uploads per minute) and a rich sensor modality for robots, surveillance, sports analysis, and medical diagnosis. Key tasks include action recognition (what is happening?), temporal grounding (when does an action start/end?), video captioning (describe what's happening), video question answering, and video generation. Understanding video is fundamentally harder than understanding images — it adds the temporal dimension and the challenge of modeling motion and causality.
Remembering
- Action recognition — Classifying what action or activity is being performed in a video clip.
- Temporal modeling — Modeling how video content changes over time; the core challenge of video AI.
- Two-Stream Network — Early influential architecture combining spatial (RGB frames) and temporal (optical flow) streams for action recognition.
- 3D CNN — Convolution applied across both spatial and temporal dimensions; captures short-range motion patterns.
- I3D (Inflated 3D ConvNet) — Inflates 2D ImageNet-trained weights to 3D; seminal video understanding architecture.
- Video Transformer (ViViT, TimeSformer) — Transformer architectures for video; apply self-attention over space and time.
- Optical flow — A dense field of pixel motion vectors between consecutive frames; classical representation of video motion.
- Temporal grounding — Locating the start and end time of a described event in a video.
- Video captioning — Generating natural language descriptions of video content.
- Video QA — Answering natural language questions about video content.
- Kinetics dataset — A large-scale action recognition benchmark with 400-700 action classes and 240K–650K video clips.
- ActivityNet — A large benchmark for dense video captioning and activity recognition.
- Slowfast Networks — Two-pathway network: slow (high resolution, low frame rate) + fast (low resolution, high frame rate); models different temporal granularities.
- Video diffusion models — Applying diffusion model framework to generate realistic video sequences; Sora, Runway, Pika.
- Long-form video understanding — Reasoning about events across minutes or hours of video; challenging for models with limited temporal context.
Understanding
Video understanding adds two fundamental challenges over images:
- Temporal modeling — understanding how the scene changes over time, capturing motion, causality, and narrative.
- Computational cost — video is orders of magnitude more data-dense than images; processing every frame with image-level models is prohibitive.
The evolution of video architectures: Two-Stream networks (2014) processed optical flow separately to explicitly model motion. 3D CNNs (C3D, I3D) applied convolution in time and space simultaneously. SlowFast Networks (2019) introduced dual pathways at different frame rates. Video Transformers (TimeSformer, ViViT, 2021) apply attention across space and time — but naively applying attention to all frames is O(T×H×W)² in compute, requiring factorized approaches.
Efficient video transformers: TimeSformer factorizes attention: first spatial attention within each frame, then temporal attention across frames at each spatial position. ViViT uses a similar factorization. Video-Swin applies 3D shifted window attention. These reduce the quadratic cost of full space-time attention to manageable complexity.
Video-language models: CLIP's vision-language pre-training extended to video by representing video as sequence of frame embeddings. VideoCLIP, CLIP4Clip, and InternVideo2 learn joint video-text representations enabling zero-shot action recognition and text-video retrieval. Video-LLMs (VideoLLaMA, Video-ChatGPT, Qwen2-VL) enable conversational understanding of video content.
Video generation: Stable Diffusion extended to video through temporal attention (AnimateDiff); SORA (OpenAI) uses a DiT (Diffusion Transformer) on video latent tokens, generating remarkably coherent videos from text prompts. Runway Gen-3, Pika, and Kling are commercial products.
Applying
Video action recognition with TimeSformer: <syntaxhighlight lang="python"> from transformers import AutoImageProcessor, TimesformerForVideoClassification import torch from decord import VideoReader, cpu import numpy as np
- Load TimeSformer (trained on Kinetics-400 action recognition)
processor = AutoImageProcessor.from_pretrained("facebook/timesformer-base-finetuned-k400") model = TimesformerForVideoClassification.from_pretrained(
"facebook/timesformer-base-finetuned-k400"
)
def load_video_frames(video_path: str, num_frames: int = 8) -> list:
"""Sample num_frames evenly from a video.""" vr = VideoReader(video_path, ctx=cpu(0)) total_frames = len(vr) indices = np.linspace(0, total_frames - 1, num_frames, dtype=int) frames = vr.get_batch(indices).asnumpy() # (T, H, W, C) return [frame for frame in frames]
- Classify action in a video
video_frames = load_video_frames("soccer_dribbling.mp4", num_frames=8) inputs = processor(images=video_frames, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs) logits = outputs.logits
predicted_class = model.config.id2label[logits.argmax().item()] confidence = torch.softmax(logits, dim=1).max().item() print(f"Action: {predicted_class} ({confidence:.1%} confidence)")
- Zero-shot with VideoCLIP
from transformers import CLIPModel, CLIPProcessor
clip = CLIPModel.from_pretrained("openai/clip-vit-base-patch32") clip_proc = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
action_candidates = ["dribbling a soccer ball", "shooting a basketball", "swimming", "cooking"] frame = video_frames[4] # Single representative frame inputs = clip_proc(text=action_candidates, images=frame, return_tensors="pt", padding=True) outputs = clip(**inputs) probs = outputs.logits_per_image.softmax(1) for action, prob in zip(action_candidates, probs[0]):
print(f"{action}: {prob:.2%}")
</syntaxhighlight>
- Video AI tools and models
- Action recognition → TimeSformer, SlowFast, Video Swin Transformer
- Video-language → InternVideo2, Video-LLaMA, Qwen2-VL, VideoChat
- Video retrieval → CLIP4Clip, X-CLIP, VideoCLIP
- Video generation → Sora (OpenAI), Runway Gen-3, Pika, Kling, AnimateDiff
- Video segmentation → SAM2 (Meta) — zero-shot video object segmentation
Analyzing
| Model | Top-1 Accuracy | Params | Inference Speed |
|---|---|---|---|
| I3D (2018) | 72.2% | 25M | Fast |
| SlowFast R101 (2019) | 79.8% | 54M | Moderate |
| TimeSformer-L (2021) | 80.7% | 121M | Moderate |
| Video Swin-L (2022) | 83.1% | 197M | Slow |
| InternVideo2 (2023) | 93.0% | 6B | Very slow |
Failure modes: Short clip bias — many models only handle 8–16 frame clips, missing longer temporal context. Optical flow dependency — two-stream models degrade when optical flow is unavailable or poor. Temporal reasoning failure — models identify objects correctly but miss temporal ordering and causality. Domain shift between training (Kinetics clips) and deployment (surveillance, medical, sports).
Evaluating
Video understanding evaluation:
- Action recognition: top-1 and top-5 accuracy on Kinetics-400/600, ActivityNet, Something-Something (tests temporal reasoning specifically).
- Temporal grounding: R@1 at IoU=0.5/0.7 on ActivityNet Captions.
- Video QA: accuracy on NExT-QA, EgoSchema (long-form egocentric QA).
- Efficiency: GFLOPs per clip for deployment feasibility.
- Long-form: separate evaluation on long videos (>1 minute); most models degrade significantly.
Creating
Building a video understanding system:
- Use case defines architecture: short clip action recognition → TimeSformer/SlowFast; long video QA → Video-LLM; zero-shot → CLIP4Clip.
- Data: fine-tune on domain-specific video if Kinetics domain differs significantly.
- Efficient inference: sample 8 frames per clip; use VideoSwin-S or TimeSformer-B for production (balance accuracy/speed).
- SAM2 for video object tracking and segmentation with prompts.
- Edge deployment: quantize to INT8; use SlowFast-8x8-R50 (~25M params) for real-time on NVIDIA Jetson.