Editing AI for Video Understanding (section)

== <span style="color: #FFFFFF;">Remembering</span> ==
* '''Action recognition''' — Classifying what action or activity is being performed in a video clip.
* '''Temporal modeling''' — Modeling how video content changes over time; the core challenge of video AI.
* '''Two-Stream Network''' — Early influential architecture combining spatial (RGB frames) and temporal (optical flow) streams for action recognition.
* '''3D CNN''' — Convolution applied across both spatial and temporal dimensions; captures short-range motion patterns.
* '''I3D (Inflated 3D ConvNet)''' — Inflates 2D ImageNet-trained weights to 3D; seminal video understanding architecture.
* '''Video Transformer (ViViT, TimeSformer)''' — Transformer architectures for video; apply self-attention over space and time.
* '''Optical flow''' — A dense field of pixel motion vectors between consecutive frames; classical representation of video motion.
* '''Temporal grounding''' — Locating the start and end time of a described event in a video.
* '''Video captioning''' — Generating natural language descriptions of video content.
* '''Video QA''' — Answering natural language questions about video content.
* '''Kinetics dataset''' — A large-scale action recognition benchmark with 400-700 action classes and 240K–650K video clips.
* '''ActivityNet''' — A large benchmark for dense video captioning and activity recognition.
* '''Slowfast Networks''' — Two-pathway network: slow (high resolution, low frame rate) + fast (low resolution, high frame rate); models different temporal granularities.
* '''Video diffusion models''' — Applying diffusion model framework to generate realistic video sequences; Sora, Runway, Pika.
* '''Long-form video understanding''' — Reasoning about events across minutes or hours of video; challenging for models with limited temporal context.
</div>

<div style="background-color: #006400; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">