Lecture 05 — Image ↔ Text: Teaching Machines to See and Reason

~5 hours (core multimodal lecture)


🌍 Why Image–Text Is So Important

Vision is the dominant human sense.

Images contain:

  • spatial structure
  • objects
  • relationships
  • context
  • ambiguity

Teaching machines to see and explain is one of the hardest problems in AI.

Seeing is easy. Understanding is hard.


🧠 What Is an Image–Text Multimodal System?

An Image–Text-to-Text system takes:

  • 🖼 an image
  • ✍️ optional text (question, instruction)
  • 🧠 performs reasoning
  • 🗣 produces text

Examples:

  • Image captioning
  • Visual Question Answering (VQA)
  • Image-based reasoning
  • Medical imaging reports
  • Autonomous perception explanations

🧩 High-Level Architecture


Image
↓
Vision Encoder (ViT / CNN)
↓
Projection / Alignment
↓
LLM (Reasoning)
↓
Text Output

Key idea:

Vision models perceive. LLMs reason.


🧠 Why Vision Alone Is Not Enough

Vision models are great at:

  • detecting patterns
  • recognizing objects
  • learning spatial features

But they struggle with:

  • logic
  • explanation
  • abstraction
  • causality

Images do not reason. Language does.


🧪 Knowledge Check — Fundamentals

Q1 (True / False)

A vision model alone can perform logical reasoning.

Answer

False.


🧠 Core Image–Text Tasks

Task Input Output
Image Captioning Image Text
VQA Image + Question Answer
Grounding Image + Text Region
Image QA Image Explanation
OCR + Reasoning Image Text + Answer

This lecture focuses on captioning + VQA.


👁 Vision Encoders (The Eyes)

Popular encoders:

  • ResNet (CNN-based)
  • Vision Transformer (ViT)
  • Swin Transformer
  • ConvNeXt

They convert pixels → embeddings.


🧠 Vision Encoder Output


Image → Patch embeddings → Sequence of vectors

Each patch represents local visual semantics.


🧪 Knowledge Check — Vision

Q2 (MCQ)

Which model introduced patch-based vision processing?

A) ResNet
B) YOLO
C) ViT
D) AlexNet

Correct Answer

C) ViT


🧠 The Alignment Problem (CRITICAL)

Vision embeddings ≠ language embeddings.

Alignment answers:

How does a pixel become a word?


🧩 Alignment Strategies

Strategy Description
Linear projection Simple, efficient
MLP Nonlinear mapping
Cross-attention Deep fusion
Q-Former Learned query alignment

Most failures come from poor alignment.


🧪 Knowledge Check — Alignment

Q3 (Objective)

What is the purpose of the projection layer?

Answer

To map vision embeddings into the language embedding space.


🧠 CLIP — A Foundational Breakthrough

CLIP learned:


Image ↔ Text similarity

By contrastive learning:

  • match image with correct caption
  • separate mismatched pairs

Result:

Shared semantic space


🧪 Knowledge Check — CLIP

Q4 (MCQ)

What learning paradigm does CLIP use?

A) Supervised classification
B) Reinforcement learning
C) Contrastive learning
D) Autoencoding

Correct Answer

C) Contrastive learning


🐍 Python Lab 1 — Image Captioning with BLIP

📦 Install Dependencies

pip install transformers pillow torch torchvision

🧠 Load Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image

model_name = "Salesforce/blip-image-captioning-base"

processor = BlipProcessor.from_pretrained(model_name)
model = BlipForConditionalGeneration.from_pretrained(model_name)

🖼 Load Image

image = Image.open("example.jpg").convert("RGB")

✍️ Generate Caption

inputs = processor(image, return_tensors="pt")

out = model.generate(**inputs, max_new_tokens=50)

caption = processor.decode(out[0], skip_special_tokens=True)
print(caption)

🎉 You just taught a machine to describe what it sees.


🧪 Knowledge Check — Code

Q5 (Objective)

What role does the processor play in BLIP?

Answer

It preprocesses images and decodes generated tokens.


🧠 Adding Reasoning with an LLM

Captioning ≠ understanding.

We add an LLM to:

  • answer questions
  • infer relationships
  • explain scenes

🏗 Full Image–Text Reasoning Pipeline

Image → Vision Encoder → Alignment → LLM → Answer

Optional:

+ Question

🐍 Python Lab 2 — Visual Question Answering

question = "What is the person doing in the image?"

prompt = f"""
You are a vision-language assistant.
Image description:
{caption}

Question:
{question}
"""

from transformers import AutoTokenizer, AutoModelForCausalLM

llm_name = "meta-llama/Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(llm_name)
llm = AutoModelForCausalLM.from_pretrained(llm_name)

inputs = tokenizer(prompt, return_tensors="pt")
outputs = llm.generate(**inputs, max_new_tokens=100)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

🧪 Knowledge Check — Reasoning

Q6 (True / False)

Image captioning alone is sufficient for VQA.

Answer

False.


🧠 Common Failure Modes

  • Hallucinated objects
  • Missed small details
  • Incorrect spatial relations
  • Cultural bias
  • Overconfidence

Seeing ≠ understanding ≠ truth


🧪 Knowledge Check — Failures

Q7 (MCQ)

Which failure is most dangerous?

A) Slow inference B) Hallucinated objects C) Low resolution D) Long captions

Correct Answer

B) Hallucinated objects


🧠 Evaluation Metrics

Metric Use
BLEU / CIDEr Caption quality
VQA Accuracy QA correctness
Human Eval Trust & clarity
Calibration Confidence reliability

🧪 Knowledge Check — Evaluation

Q8 (Objective)

Why is human evaluation important for vision-language tasks?

Answer

Because correctness depends on semantics and human perception.


🌱 Ethics in Vision–Language AI

Risks:

  • surveillance
  • facial recognition abuse
  • bias in datasets
  • privacy violation

If a machine sees people, it must respect humanity.


🧪 Knowledge Check — Ethics

Q9 (True / False)

Image–text systems can be ethically neutral.

Answer

False.


🧠 Human-in-the-Loop (Best Practice)

  • Human review for sensitive outputs
  • Confidence thresholds
  • Explainable responses
  • Feedback-based refinement

✅ Final Takeaways

  • Vision provides perception
  • Language provides reasoning
  • Alignment is the hardest part
  • Python pipelines are modular
  • Ethics is not optional

🌍 Final Reflection

If a machine describes a human incorrectly, who is harmed?

The human — therefore responsibility lies with designers.


Previous
Next