Editing Ai Video (section)

== <span style="color: #FFFFFF;">Understanding</span> ==
Video understanding adds two fundamental challenges over images:
# '''Temporal modeling''' — understanding how the scene changes over time, capturing motion, causality, and narrative.
# '''Computational cost''' — video is orders of magnitude more data-dense than images; processing every frame with image-level models is prohibitive.

'''The evolution of video architectures''': Two-Stream networks (2014) processed optical flow separately to explicitly model motion. 3D CNNs (C3D, I3D) applied convolution in time and space simultaneously. SlowFast Networks (2019) introduced dual pathways at different frame rates. Video Transformers (TimeSformer, ViViT, 2021) apply attention across space and time — but naively applying attention to all frames is O(T×H×W)² in compute, requiring factorized approaches.

'''Efficient video transformers''': TimeSformer factorizes attention: first spatial attention within each frame, then temporal attention across frames at each spatial position. ViViT uses a similar factorization. Video-Swin applies 3D shifted window attention. These reduce the quadratic cost of full space-time attention to manageable complexity.

'''Video-language models''': CLIP's vision-language pre-training extended to video by representing video as sequence of frame embeddings. VideoCLIP, CLIP4Clip, and InternVideo2 learn joint video-text representations enabling zero-shot action recognition and text-video retrieval. Video-LLMs (VideoLLaMA, Video-ChatGPT, Qwen2-VL) enable conversational understanding of video content.

'''Video generation''': Stable Diffusion extended to video through temporal attention (AnimateDiff); SORA (OpenAI) uses a DiT (Diffusion Transformer) on video latent tokens, generating remarkably coherent videos from text prompts. Runway Gen-3, Pika, and Kling are commercial products.
</div>

<div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">