Lecture 09 — Evaluation of Multimodal & Agentic AI Systems
~4–6 hours (advanced, critical thinking lecture)
🧠 Why Evaluation Is the Hardest Problem in AI
If you cannot evaluate it, you do not understand it.
Modern AI systems:
- Generate free-form text
- Reason over images, videos, documents
- Use tools
- Act autonomously
❓ So how do we measure correctness, reasoning, safety, and usefulness?
Evaluation is harder than training.
⚠️ The Evaluation Crisis
Common mistakes:
- Using only BLEU / ROUGE
- Evaluating language but not reasoning
- Ignoring hallucination
- No human evaluation
- No failure analysis
High benchmark scores ≠ trustworthy intelligence
🧩 What Are We Actually Evaluating?
Evaluation must answer what kind of intelligence we care about.
| Dimension | Question |
|---|---|
| Accuracy | Is the answer correct? |
| Grounding | Is it supported by evidence? |
| Reasoning | Are the steps valid? |
| Robustness | Does it fail gracefully? |
| Safety | Is it harmful or biased? |
| Usefulness | Does it help a human? |
🧠 Evaluation by Task Type
📝 Text-only LLMs
- Fluency
- Factuality
- Reasoning
- Consistency
🖼 Image–Text
- Visual grounding
- Hallucination
- Spatial correctness
🎥 Video–Text
- Temporal reasoning
- Event ordering
- Causal understanding
📄 DocQA
- Exact match
- Numerical accuracy
- Layout grounding
🤖 Agents
- Task success
- Tool correctness
- Efficiency
- Safety
📏 Automatic Metrics (Know Their Limits)
Text Metrics
| Metric | Measures | Limitation |
|---|---|---|
| BLEU | N-gram overlap | Bad for reasoning |
| ROUGE | Recall | Shallow |
| METEOR | Semantic match | Still surface-level |
| Perplexity | Fluency | Not correctness |
Text similarity ≠ truth
📏 Vision-Language Metrics
| Metric | Task |
|---|---|
| Accuracy | VQA |
| CIDEr | Captioning |
| IoU | Grounding |
| Recall@K | Retrieval |
Problems:
- Sensitive to wording
- Miss reasoning errors
- Encourage shortcut learning
🧠 Faithfulness & Grounding Evaluation
Key question:
Did the model use the provided evidence?
Techniques:
- Attribution checks
- Citation verification
- Evidence overlap
- Counterfactual prompts
🧪 Hallucination Evaluation (CRITICAL)
Hallucination types:
- Factual hallucination
- Visual hallucination
- Temporal hallucination
- Tool hallucination
Detection:
- Human labeling
- Rule-based checks
- Retrieval consistency
- Self-verification prompts
👥 Human Evaluation (Gold Standard)
Humans evaluate:
- Correctness
- Clarity
- Trustworthiness
- Helpfulness
- Harm
Best practices:
- Multiple annotators
- Clear rubrics
- Inter-annotator agreement
- Blind comparison
Humans evaluate meaning, not tokens.
🧠 Evaluation of Reasoning
Bad evaluation:
“Is the final answer correct?”
Good evaluation:
- Are intermediate steps valid?
- Are assumptions reasonable?
- Is reasoning grounded?
Techniques:
- Chain-of-thought review
- Step-by-step scoring
- Error categorization
🤖 Evaluating Agents Is Different
Agents are:
- Non-deterministic
- Multi-step
- Tool-dependent
Metrics:
- Task completion rate
- Number of steps
- Cost
- Error recovery
- Safety violations
🐍 Python: Simple Evaluation Loop
results = []
for example in dataset:
prediction = model(example.input)
score = evaluate(prediction, example.answer)
results.append(score)
print(sum(results) / len(results))
Simple code, deep thinking required.
🧠 Benchmark vs Reality
Benchmarks:
- Controlled
- Clean
- Known distribution
Reality:
- Messy
- Ambiguous
- Adversarial
- High stakes
Always test on your own data.
⚠️ Overfitting to Benchmarks
Symptoms:
- SOTA on paper
- Poor real-world behavior
- Fragile prompts
- Dataset leakage
Cure:
- Diverse evaluation
- Stress testing
- Out-of-distribution tests
🧠 Research Insight
The future of evaluation is interactive, human-centered, and continuous.
Trends:
- LLM-as-judge (with caution)
- Hybrid human–AI evaluation
- Online evaluation in deployment
- Value-aligned metrics
🧪 Student Knowledge Check (Hidden)
Q1 — Objective
Why are BLEU/ROUGE insufficient for modern AI?
Answer
They measure surface similarity, not reasoning or truth.
Q2 — MCQ
Which is the gold standard for evaluation?
A. Automatic metrics B. Benchmarks C. Human evaluation D. Leaderboards
Answer
C. Human evaluation
Q3 — MCQ
Which is most important for DocQA?
A. Fluency B. Creativity C. Exact Match D. Perplexity
Answer
C. Exact Match
Q4 — Objective
What is hallucination?
Answer
Producing confident but unsupported or false information.
Q5 — Objective
Why is agent evaluation harder?
Answer
Because agents are multi-step, non-deterministic, and tool-dependent.
🌱 Final Reflection
If an AI scores high but harms people, is it a good model?
No — evaluation must include human values and impact.
✅ Key Takeaways
- Evaluation defines intelligence
- Automatic metrics are tools, not truth
- Grounding matters more than fluency
- Human judgment is essential
- Ethics begins at evaluation