Lecture 05 — Image ↔ Text: Teaching Machines to See and Reason

~5 hours (core multimodal lecture)

🌍 Why Image–Text Is So Important

Vision is the dominant human sense.

Images contain:

spatial structure
objects
relationships
context
ambiguity

Teaching machines to see and explain is one of the hardest problems in AI.

Seeing is easy. Understanding is hard.

🧠 What Is an Image–Text Multimodal System?

An Image–Text-to-Text system takes:

🖼 an image
✍️ optional text (question, instruction)
🧠 performs reasoning
🗣 produces text

Examples:

Image captioning
Visual Question Answering (VQA)
Image-based reasoning
Medical imaging reports
Autonomous perception explanations

🧩 High-Level Architecture


Image
↓
Vision Encoder (ViT / CNN)
↓
Projection / Alignment
↓
LLM (Reasoning)
↓
Text Output

Key idea:

Vision models perceive. LLMs reason.

🧠 Why Vision Alone Is Not Enough

Vision models are great at:

detecting patterns
recognizing objects
learning spatial features

But they struggle with:

logic
explanation
abstraction
causality

Images do not reason. Language does.

🧪 Knowledge Check — Fundamentals

Q1 (True / False)

A vision model alone can perform logical reasoning.

Answer

False.

🧠 Core Image–Text Tasks

Task	Input	Output
Image Captioning	Image	Text
VQA	Image + Question	Answer
Grounding	Image + Text	Region
Image QA	Image	Explanation
OCR + Reasoning	Image	Text + Answer

This lecture focuses on captioning + VQA.

👁 Vision Encoders (The Eyes)

Popular encoders:

ResNet (CNN-based)
Vision Transformer (ViT)
Swin Transformer
ConvNeXt

They convert pixels → embeddings.

🧠 Vision Encoder Output


Image → Patch embeddings → Sequence of vectors

Each patch represents local visual semantics.

🧪 Knowledge Check — Vision

Q2 (MCQ)

Which model introduced patch-based vision processing?

A) ResNet
B) YOLO
C) ViT
D) AlexNet

Correct Answer

C) ViT

🧠 The Alignment Problem (CRITICAL)

Vision embeddings ≠ language embeddings.

Alignment answers:

How does a pixel become a word?

🧩 Alignment Strategies

Strategy	Description
Linear projection	Simple, efficient
MLP	Nonlinear mapping
Cross-attention	Deep fusion
Q-Former	Learned query alignment

Most failures come from poor alignment.

🧪 Knowledge Check — Alignment

Q3 (Objective)

What is the purpose of the projection layer?

Answer

To map vision embeddings into the language embedding space.

🧠 CLIP — A Foundational Breakthrough

CLIP learned:


Image ↔ Text similarity

By contrastive learning:

match image with correct caption
separate mismatched pairs

Result:

Shared semantic space

🧪 Knowledge Check — CLIP

Q4 (MCQ)

What learning paradigm does CLIP use?

A) Supervised classification
B) Reinforcement learning
C) Contrastive learning
D) Autoencoding

Correct Answer

C) Contrastive learning

🐍 Python Lab 1 — Image Captioning with BLIP

📦 Install Dependencies

pip install transformers pillow torch torchvision

🧠 Load Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

model_name = "Salesforce/blip-image-captioning-base"

processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

🖼 Load Image

image = Image.open("example.jpg").convert("RGB")

✍️ Generate Caption

inputs = processor(image, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(out[0], skip_special_tokens=True)
print(caption)

🎉 You just taught a machine to describe what it sees.

🧪 Knowledge Check — Code

Q5 (Objective)

What role does the processor play in BLIP?

Answer

It preprocesses images and decodes generated tokens.

🧠 Adding Reasoning with an LLM

Captioning ≠ understanding.

We add an LLM to:

answer questions
infer relationships
explain scenes

🏗 Full Image–Text Reasoning Pipeline

Image → Vision Encoder → Alignment → LLM → Answer

Optional:

+ Question

🐍 Python Lab 2 — Visual Question Answering

question = "What is the person doing in the image?"

prompt = f"""
You are a vision-language assistant.
Image description:
{caption}

Question:
{question}
"""

from transformers import AutoTokenizer, AutoModelForCausalLM

llm_name = "meta-llama/Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(llm_name)
llm = AutoModelForCausalLM.from_pretrained(llm_name)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = llm.generate(**inputs, max_new_tokens=100)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

🧪 Knowledge Check — Reasoning

Q6 (True / False)

Image captioning alone is sufficient for VQA.

Answer

False.

🧠 Common Failure Modes

Hallucinated objects
Missed small details
Incorrect spatial relations
Cultural bias
Overconfidence

Seeing ≠ understanding ≠ truth

🧪 Knowledge Check — Failures

Q7 (MCQ)

Which failure is most dangerous?

A) Slow inference B) Hallucinated objects C) Low resolution D) Long captions

Correct Answer

B) Hallucinated objects

🧠 Evaluation Metrics

Metric	Use
BLEU / CIDEr	Caption quality
VQA Accuracy	QA correctness
Human Eval	Trust & clarity
Calibration	Confidence reliability

🧪 Knowledge Check — Evaluation

Q8 (Objective)

Why is human evaluation important for vision-language tasks?

Answer

Because correctness depends on semantics and human perception.

🌱 Ethics in Vision–Language AI

Risks:

surveillance
facial recognition abuse
bias in datasets
privacy violation

If a machine sees people, it must respect humanity.

🧪 Knowledge Check — Ethics

Q9 (True / False)

Image–text systems can be ethically neutral.

Answer

False.

🧠 Human-in-the-Loop (Best Practice)

Human review for sensitive outputs
Confidence thresholds
Explainable responses
Feedback-based refinement

✅ Final Takeaways

Vision provides perception
Language provides reasoning
Alignment is the hardest part
Python pipelines are modular
Ethics is not optional

🌍 Final Reflection

If a machine describes a human incorrectly, who is harmed?

The human — therefore responsibility lies with designers.

Last updated on 2026