Lecture 07 — Visual Question Answering (VQA) & Document Question Answering (DocQA)

~4–5 hours (applied multimodal reasoning)

👁️📄 Why VQA & DocQA Matter to Humanity

VQA & DocQA turn perception into understanding.

They allow machines to:

👁️ See
📄 Read
❓ Understand questions
🧠 Reason
✍️ Answer grounded in evidence

This is the foundation of:

Assistive AI
Education AI
Legal & medical AI
Scientific discovery
Accessibility technology

🧠 What Is Visual Question Answering (VQA)?

Input

🖼 Image
❓ Natural language question

Output

✍️ Text answer grounded in the image

Example:

Q: “How many people are wearing helmets?”
A: “Two.”

🧠 What Is Document Question Answering (DocQA)?

Input

📄 Document image / PDF
❓ Question

Output

✍️ Answer extracted or reasoned from the document

Example:

Q: “What is the invoice total?”
A: “$1,245.50”

🔍 VQA vs DocQA (Key Differences)

Aspect	VQA	DocQA
Main Challenge	Visual reasoning	Text + layout understanding
Precision	Often coarse	Extremely precise
OCR Required	Optional	Mandatory
Hallucination Risk	Medium	Very high
Evaluation	Semantic	Exact match critical

DocQA punishes mistakes much harder than VQA.

🧠 Cognitive Skills Required

Both tasks require:

Visual grounding
Cross-modal alignment
Reasoning
Attention control

But DocQA adds:

Layout reasoning
Reading order
Table understanding
Key-value extraction

🧱 Canonical Architecture (Unified View)


Image / Document
↓
Vision Encoder (ViT / CNN)
↓
OCR (DocQA only)
↓
Layout / Spatial Encoder
↓
Cross-Attention Fusion
↓
LLM Reasoning
↓
Text Answer

👁️ Vision Encoding

Common backbones:

ViT
Swin Transformer
CNN + FPN

Key requirement:

Preserve spatial information

🔤 OCR Is NOT Optional for DocQA

OCR extracts:

Text
Bounding boxes
Confidence scores

Popular OCR tools:

Tesseract
PaddleOCR
EasyOCR
TrOCR (Transformer-based)

Bad OCR = Impossible DocQA

🧭 Layout Understanding (CRITICAL)

Documents are not sentences — they are spatial graphs.

Layout Features

X, Y coordinates
Width / height
Reading order
Font size / style

Models:

LayoutLM
LayoutLMv3
DocFormer

🧠 Reasoning Types in VQA & DocQA

VQA Reasoning

Counting
Spatial (“left of”, “behind”)
Attribute recognition
Commonsense

DocQA Reasoning

Lookup
Aggregation
Comparison
Multi-hop reasoning

🐍 Python: Simple VQA Pipeline

image = load_image("scene.jpg")
question = "How many cars are parked?"

vision_features = vision_encoder(image)
answer = llm.generate(
    visual_features=vision_features,
    prompt=question
)

print(answer)

🐍 Python: DocQA Pipeline (Conceptual)

doc_image = load_image("invoice.png")
ocr_tokens, boxes = ocr_engine(doc_image)

doc_embedding = layout_encoder(ocr_tokens, boxes)

answer = llm.generate(
    document_embedding=doc_embedding,
    prompt="What is the total amount?"
)

🧠 Prompt Engineering for VQA & DocQA

Bad prompt:

“Answer the question.”

Good prompt:

“Answer using only visible evidence. If uncertain, say ‘Not found’.”

Prompt discipline reduces hallucination.

🧪 Datasets (Must-Know)

VQA

VQA v2
GQA
CLEVR (synthetic reasoning)

DocQA

DocVQA
FUNSD
RVL-CDIP
CORD

📏 Evaluation Metrics

VQA

Accuracy
Consensus-based scoring

DocQA

Exact Match (EM)
F1 score
String normalization critical

One wrong digit = wrong answer

⚠️ Hallucination: The Silent Killer

Common failure cases:

Answering from prior knowledge
Guessing missing fields
Confusing similar layouts

Mitigation:

Evidence-aware prompting
Answer verification
Human-in-the-loop (later lecture)

🧠 Research Insight

DocQA is closer to symbolic reasoning than vision.

Strong DocQA models:

Behave more like databases
Require constraint enforcement
Prefer abstention over guessing

🧪 Student Knowledge Check (Hidden Answers)

Q1 — Objective

What extra component does DocQA require compared to VQA?

Answer

OCR and layout understanding.

Q2 — MCQ

Which model explicitly encodes layout?

A. CLIP B. ViT C. LayoutLM D. ResNet

Answer

C. LayoutLM

Q3 — MCQ

Which metric is most important for DocQA?

A. BLEU B. ROUGE C. Exact Match D. Perplexity

Answer

C. Exact Match

Q4 — Objective

Why is hallucination more dangerous in DocQA?

Answer

Because answers must be exact and legally or financially correct.

Q5 — Objective

What is layout reasoning?

Answer

Understanding text based on spatial structure, not just content.

🌱 Final Reflection

If AI can read documents perfectly, what must humans still verify?

Truth, intent, context, and ethical consequences.

✅ Key Takeaways

VQA = visual reasoning
DocQA = precision reasoning
OCR quality defines upper bound
Layout is intelligence
Abstaining is better than hallucinating

Last updated on 2026