Lecture 07 — Visual Question Answering (VQA) & Document Question Answering (DocQA)
~4–5 hours (applied multimodal reasoning)
👁️📄 Why VQA & DocQA Matter to Humanity
VQA & DocQA turn perception into understanding.
They allow machines to:
- 👁️ See
- 📄 Read
- ❓ Understand questions
- 🧠 Reason
- ✍️ Answer grounded in evidence
This is the foundation of:
- Assistive AI
- Education AI
- Legal & medical AI
- Scientific discovery
- Accessibility technology
🧠 What Is Visual Question Answering (VQA)?
Input
- 🖼 Image
- ❓ Natural language question
Output
- ✍️ Text answer grounded in the image
Example:
Q: “How many people are wearing helmets?”
A: “Two.”
🧠 What Is Document Question Answering (DocQA)?
Input
- 📄 Document image / PDF
- ❓ Question
Output
- ✍️ Answer extracted or reasoned from the document
Example:
Q: “What is the invoice total?”
A: “$1,245.50”
🔍 VQA vs DocQA (Key Differences)
| Aspect | VQA | DocQA |
|---|---|---|
| Main Challenge | Visual reasoning | Text + layout understanding |
| Precision | Often coarse | Extremely precise |
| OCR Required | Optional | Mandatory |
| Hallucination Risk | Medium | Very high |
| Evaluation | Semantic | Exact match critical |
DocQA punishes mistakes much harder than VQA.
🧠 Cognitive Skills Required
Both tasks require:
- Visual grounding
- Cross-modal alignment
- Reasoning
- Attention control
But DocQA adds:
- Layout reasoning
- Reading order
- Table understanding
- Key-value extraction
🧱 Canonical Architecture (Unified View)
Image / Document
↓
Vision Encoder (ViT / CNN)
↓
OCR (DocQA only)
↓
Layout / Spatial Encoder
↓
Cross-Attention Fusion
↓
LLM Reasoning
↓
Text Answer
👁️ Vision Encoding
Common backbones:
- ViT
- Swin Transformer
- CNN + FPN
Key requirement:
Preserve spatial information
🔤 OCR Is NOT Optional for DocQA
OCR extracts:
- Text
- Bounding boxes
- Confidence scores
Popular OCR tools:
- Tesseract
- PaddleOCR
- EasyOCR
- TrOCR (Transformer-based)
Bad OCR = Impossible DocQA
🧭 Layout Understanding (CRITICAL)
Documents are not sentences — they are spatial graphs.
Layout Features
- X, Y coordinates
- Width / height
- Reading order
- Font size / style
Models:
- LayoutLM
- LayoutLMv3
- DocFormer
🧠 Reasoning Types in VQA & DocQA
VQA Reasoning
- Counting
- Spatial (“left of”, “behind”)
- Attribute recognition
- Commonsense
DocQA Reasoning
- Lookup
- Aggregation
- Comparison
- Multi-hop reasoning
🐍 Python: Simple VQA Pipeline
image = load_image("scene.jpg")
question = "How many cars are parked?"
vision_features = vision_encoder(image)
answer = llm.generate(
visual_features=vision_features,
prompt=question
)
print(answer)
🐍 Python: DocQA Pipeline (Conceptual)
doc_image = load_image("invoice.png")
ocr_tokens, boxes = ocr_engine(doc_image)
doc_embedding = layout_encoder(ocr_tokens, boxes)
answer = llm.generate(
document_embedding=doc_embedding,
prompt="What is the total amount?"
)
🧠 Prompt Engineering for VQA & DocQA
Bad prompt:
“Answer the question.”
Good prompt:
“Answer using only visible evidence. If uncertain, say ‘Not found’.”
Prompt discipline reduces hallucination.
🧪 Datasets (Must-Know)
VQA
- VQA v2
- GQA
- CLEVR (synthetic reasoning)
DocQA
- DocVQA
- FUNSD
- RVL-CDIP
- CORD
📏 Evaluation Metrics
VQA
- Accuracy
- Consensus-based scoring
DocQA
- Exact Match (EM)
- F1 score
- String normalization critical
One wrong digit = wrong answer
⚠️ Hallucination: The Silent Killer
Common failure cases:
- Answering from prior knowledge
- Guessing missing fields
- Confusing similar layouts
Mitigation:
- Evidence-aware prompting
- Answer verification
- Human-in-the-loop (later lecture)
🧠 Research Insight
DocQA is closer to symbolic reasoning than vision.
Strong DocQA models:
- Behave more like databases
- Require constraint enforcement
- Prefer abstention over guessing
🧪 Student Knowledge Check (Hidden Answers)
Q1 — Objective
What extra component does DocQA require compared to VQA?
Answer
OCR and layout understanding.
Q2 — MCQ
Which model explicitly encodes layout?
A. CLIP B. ViT C. LayoutLM D. ResNet
Answer
C. LayoutLM
Q3 — MCQ
Which metric is most important for DocQA?
A. BLEU B. ROUGE C. Exact Match D. Perplexity
Answer
C. Exact Match
Q4 — Objective
Why is hallucination more dangerous in DocQA?
Answer
Because answers must be exact and legally or financially correct.
Q5 — Objective
What is layout reasoning?
Answer
Understanding text based on spatial structure, not just content.
🌱 Final Reflection
If AI can read documents perfectly, what must humans still verify?
Truth, intent, context, and ethical consequences.
✅ Key Takeaways
- VQA = visual reasoning
- DocQA = precision reasoning
- OCR quality defines upper bound
- Layout is intelligence
- Abstaining is better than hallucinating