Lecture 06 — Video–Text Multimodal Intelligence
~4–5 hours (advanced + practical lecture)
🎥 Why Video Is the Hardest Modality
Video = Images + Time + Motion + Audio + Semantics
Compared to text or images, video introduces:
- ⏱ Temporal dependency
- 🎞 Motion understanding
- 🔊 Optional audio synchronization
- 🧠 Long-range reasoning
- 💾 Massive compute & memory cost
If you master Video–Text, you understand true multimodal intelligence.
🧠 What Is Video–Text-to–Text?
Video–Text models map:
- 📹 Video → 📝 Text
- 📹 + ❓ Question → ✍️ Answer
- 📹 + 🗣 Prompt → 📖 Explanation / Summary / Reasoning
Common Tasks
| Task | Example |
|---|---|
| Video Captioning | “A man is cooking pasta in a kitchen.” |
| Video QA | “What did the dog do after jumping?” |
| Video Summarization | “This video shows a traffic accident…” |
| Temporal Reasoning | “What happened before the explosion?” |
| Instruction Following | “Explain this experiment step-by-step.” |
🧩 Why Video Is Fundamentally Different
Image vs Video
| Image | Video |
|---|---|
| Static | Temporal |
| Single embedding | Sequence of embeddings |
| Local reasoning | Long-range reasoning |
| Easy batching | Memory explosion |
Key idea:
Video understanding = sequence modeling + vision + alignment
🧠 Thinking Like a Video–Text Architect
The Core Questions
- What is the unit of time?
- How many frames matter?
- Do we need motion, or just key frames?
- Can we compress time?
- Where does language interact?
🧱 Canonical Video–Text Architecture
Video Frames → Vision Encoder → Temporal Encoder → LLM → Text Output
Components
| Component | Examples |
|---|---|
| Vision Encoder | ViT, ConvNet, Swin |
| Temporal Encoder | Transformer, LSTM, Mamba |
| Fusion | Cross-attention |
| Language Model | LLaMA, GPT, T5 |
⏱ Temporal Modeling Strategies
1️⃣ Uniform Sampling
- Sample every n frames
- Cheap but may miss key events
2️⃣ Keyframe Extraction
- Shot detection
- Scene change detection
3️⃣ Learned Temporal Attention
- Model decides which frames matter
Rule:
Not all frames are equally intelligent.
🧠 Temporal Reasoning ≠ Frame Understanding
Examples:
- “What happened before the fall?”
- “Why did the crowd start running?”
- “What caused the explosion?”
This requires:
- Event ordering
- Causal reasoning
- Memory across time
🔗 Video–Text Alignment
Alignment Objectives
- Frame ↔ Word
- Segment ↔ Sentence
- Event ↔ Explanation
Losses commonly used:
- Contrastive loss (CLIP-style)
- Cross-entropy on generated text
- Temporal grounding loss
🧪 Popular Video–Text Models
| Model | Key Idea |
|---|---|
| VideoBERT | Treat frames as tokens |
| Flamingo | Perceiver-style attention |
| InternVideo | Unified video representation |
| Video-LLaMA | Video + LLM alignment |
| GPT-4V | Proprietary multimodal reasoning |
🐍 Python: Video–Text Pipeline (Conceptual)
Step 1: Load Video Frames
import cv2
def load_frames(video_path, max_frames=32):
cap = cv2.VideoCapture(video_path)
frames = []
while len(frames) < max_frames:
ret, frame = cap.read()
if not ret:
break
frames.append(frame)
cap.release()
return frames
Step 2: Encode Frames
import torch
with torch.no_grad():
frame_embeddings = vision_encoder(frames)
Step 3: Temporal Encoding
video_embedding = temporal_transformer(frame_embeddings)
Step 4: Language Generation
output = llm.generate(
video_embedding=video_embedding,
prompt="Describe what is happening in this video"
)
🧠 Memory & Efficiency Tricks (VERY IMPORTANT)
Problem
- Video = huge memory cost
Solutions
- Frame pooling
- Token pruning
- Sliding windows
- Temporal compression
- Hierarchical attention
Engineering intelligence matters as much as model size
🧪 Evaluation of Video–Text Models
Automatic Metrics
- BLEU
- METEOR
- CIDEr
- ROUGE
Human Evaluation (BEST)
- Temporal correctness
- Causal reasoning
- Hallucination rate
⚠️ Failure Modes
- ❌ Hallucinating events
- ❌ Mixing timelines
- ❌ Ignoring small but critical actions
- ❌ Overconfidence
Video hallucination is more dangerous than image hallucination.
🧠 Research Insight (Important)
Most “video understanding” models are actually image models with memory.
True video intelligence requires:
- Event abstraction
- Temporal causality
- Long-horizon planning
🧪 Student Self-Test (Hidden Answers)
Q1 — Objective
What makes video harder than images?
Answer
Temporal dependency, motion, memory, and causal reasoning.
Q2 — MCQ
Which component models time?
A. Vision Encoder B. Tokenizer C. Temporal Encoder D. Loss Function
Answer
C. Temporal Encoder
Q3 — MCQ
Which is NOT a video–text task?
A. Video captioning B. Video QA C. Video super-resolution D. Video summarization
Answer
C. Video super-resolution
Q4 — Objective
Why is uniform frame sampling risky?
Answer
It may miss critical events or actions.
Q5 — Objective
What is temporal hallucination?
Answer
Inventing events that never occurred in the video timeline.
🌱 Final Reflection
If AI understands video perfectly, what responsibility do humans still hold?
Interpretation, ethics, judgment, and accountability.
✅ Key Takeaways
- Video is the hardest modality
- Time is the real challenge
- Compression = intelligence
- Reasoning > perception
- Human evaluation is critical