Editing
Ai Video
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== <span style="color: #FFFFFF;">Understanding</span> == Video understanding adds two fundamental challenges over images: # '''Temporal modeling''' β understanding how the scene changes over time, capturing motion, causality, and narrative. # '''Computational cost''' β video is orders of magnitude more data-dense than images; processing every frame with image-level models is prohibitive. '''The evolution of video architectures''': Two-Stream networks (2014) processed optical flow separately to explicitly model motion. 3D CNNs (C3D, I3D) applied convolution in time and space simultaneously. SlowFast Networks (2019) introduced dual pathways at different frame rates. Video Transformers (TimeSformer, ViViT, 2021) apply attention across space and time β but naively applying attention to all frames is O(TΓHΓW)Β² in compute, requiring factorized approaches. '''Efficient video transformers''': TimeSformer factorizes attention: first spatial attention within each frame, then temporal attention across frames at each spatial position. ViViT uses a similar factorization. Video-Swin applies 3D shifted window attention. These reduce the quadratic cost of full space-time attention to manageable complexity. '''Video-language models''': CLIP's vision-language pre-training extended to video by representing video as sequence of frame embeddings. VideoCLIP, CLIP4Clip, and InternVideo2 learn joint video-text representations enabling zero-shot action recognition and text-video retrieval. Video-LLMs (VideoLLaMA, Video-ChatGPT, Qwen2-VL) enable conversational understanding of video content. '''Video generation''': Stable Diffusion extended to video through temporal attention (AnimateDiff); SORA (OpenAI) uses a DiT (Diffusion Transformer) on video latent tokens, generating remarkably coherent videos from text prompts. Runway Gen-3, Pika, and Kling are commercial products. </div> <div style="background-color: #8B0000; color: #FFFFFF; padding: 20px; border-radius: 8px; margin-bottom: 15px;">
Summary:
Please note that all contributions to BloomWiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
BloomWiki:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Tools
What links here
Related changes
Special pages
Page information