Lecture 06 — Video–Text Multimodal Intelligence

~4–5 hours (advanced + practical lecture)


🎥 Why Video Is the Hardest Modality

Video = Images + Time + Motion + Audio + Semantics

Compared to text or images, video introduces:

  • ⏱ Temporal dependency
  • 🎞 Motion understanding
  • 🔊 Optional audio synchronization
  • 🧠 Long-range reasoning
  • 💾 Massive compute & memory cost

If you master Video–Text, you understand true multimodal intelligence.


🧠 What Is Video–Text-to–Text?

Video–Text models map:

  • 📹 Video → 📝 Text
  • 📹 + ❓ Question → ✍️ Answer
  • 📹 + 🗣 Prompt → 📖 Explanation / Summary / Reasoning

Common Tasks

Task Example
Video Captioning “A man is cooking pasta in a kitchen.”
Video QA “What did the dog do after jumping?”
Video Summarization “This video shows a traffic accident…”
Temporal Reasoning “What happened before the explosion?”
Instruction Following “Explain this experiment step-by-step.”

🧩 Why Video Is Fundamentally Different

Image vs Video

Image Video
Static Temporal
Single embedding Sequence of embeddings
Local reasoning Long-range reasoning
Easy batching Memory explosion

Key idea:
Video understanding = sequence modeling + vision + alignment


🧠 Thinking Like a Video–Text Architect

The Core Questions

  1. What is the unit of time?
  2. How many frames matter?
  3. Do we need motion, or just key frames?
  4. Can we compress time?
  5. Where does language interact?

🧱 Canonical Video–Text Architecture


Video Frames → Vision Encoder → Temporal Encoder → LLM → Text Output

Components

Component Examples
Vision Encoder ViT, ConvNet, Swin
Temporal Encoder Transformer, LSTM, Mamba
Fusion Cross-attention
Language Model LLaMA, GPT, T5

⏱ Temporal Modeling Strategies

1️⃣ Uniform Sampling

  • Sample every n frames
  • Cheap but may miss key events

2️⃣ Keyframe Extraction

  • Shot detection
  • Scene change detection

3️⃣ Learned Temporal Attention

  • Model decides which frames matter

Rule:
Not all frames are equally intelligent.


🧠 Temporal Reasoning ≠ Frame Understanding

Examples:

  • “What happened before the fall?”
  • “Why did the crowd start running?”
  • “What caused the explosion?”

This requires:

  • Event ordering
  • Causal reasoning
  • Memory across time

🔗 Video–Text Alignment

Alignment Objectives

  • Frame ↔ Word
  • Segment ↔ Sentence
  • Event ↔ Explanation

Losses commonly used:

  • Contrastive loss (CLIP-style)
  • Cross-entropy on generated text
  • Temporal grounding loss

Model Key Idea
VideoBERT Treat frames as tokens
Flamingo Perceiver-style attention
InternVideo Unified video representation
Video-LLaMA Video + LLM alignment
GPT-4V Proprietary multimodal reasoning

🐍 Python: Video–Text Pipeline (Conceptual)

Step 1: Load Video Frames

import cv2

def load_frames(video_path, max_frames=32):
    cap = cv2.VideoCapture(video_path)
    frames = []
    while len(frames) < max_frames:
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
    cap.release()
    return frames

Step 2: Encode Frames

import torch

with torch.no_grad():
    frame_embeddings = vision_encoder(frames)

Step 3: Temporal Encoding

video_embedding = temporal_transformer(frame_embeddings)

Step 4: Language Generation

output = llm.generate(
    video_embedding=video_embedding,
    prompt="Describe what is happening in this video"
)

🧠 Memory & Efficiency Tricks (VERY IMPORTANT)

Problem

  • Video = huge memory cost

Solutions

  • Frame pooling
  • Token pruning
  • Sliding windows
  • Temporal compression
  • Hierarchical attention

Engineering intelligence matters as much as model size


🧪 Evaluation of Video–Text Models

Automatic Metrics

  • BLEU
  • METEOR
  • CIDEr
  • ROUGE

Human Evaluation (BEST)

  • Temporal correctness
  • Causal reasoning
  • Hallucination rate

⚠️ Failure Modes

  • ❌ Hallucinating events
  • ❌ Mixing timelines
  • ❌ Ignoring small but critical actions
  • ❌ Overconfidence

Video hallucination is more dangerous than image hallucination.


🧠 Research Insight (Important)

Most “video understanding” models are actually image models with memory.

True video intelligence requires:

  • Event abstraction
  • Temporal causality
  • Long-horizon planning

🧪 Student Self-Test (Hidden Answers)

Q1 — Objective

What makes video harder than images?

Answer

Temporal dependency, motion, memory, and causal reasoning.


Q2 — MCQ

Which component models time?

A. Vision Encoder B. Tokenizer C. Temporal Encoder D. Loss Function

Answer

C. Temporal Encoder


Q3 — MCQ

Which is NOT a video–text task?

A. Video captioning B. Video QA C. Video super-resolution D. Video summarization

Answer

C. Video super-resolution


Q4 — Objective

Why is uniform frame sampling risky?

Answer

It may miss critical events or actions.


Q5 — Objective

What is temporal hallucination?

Answer

Inventing events that never occurred in the video timeline.


🌱 Final Reflection

If AI understands video perfectly, what responsibility do humans still hold?

Interpretation, ethics, judgment, and accountability.


✅ Key Takeaways

  • Video is the hardest modality
  • Time is the real challenge
  • Compression = intelligence
  • Reasoning > perception
  • Human evaluation is critical

Previous
Next