Lecture 04 — Audio ↔ Text ↔ Reasoning: Teaching Machines to Listen

~4–5 hours (theory + code)


🌍 Why Audio–Text Matters

Sound is the first interface of intelligence.

Before writing:

  • humans listened
  • learned tone
  • sensed emotion
  • reacted in real time

Audio–Text systems allow machines to:

  • transcribe speech
  • understand intent
  • reason about meaning
  • respond naturally

If vision gives machines eyes, audio gives them ears.


🧠 What Is an Audio–Text Multimodal System?

An Audio–Text-to-Text system takes:

  • 🎧 audio input (speech, sound)
  • 🧠 converts it to representations
  • ✍️ produces text output

Examples:

  • Speech-to-text (ASR)
  • Audio question answering
  • Voice assistants
  • Meeting summarization
  • Emotion-aware chatbots

🧩 High-Level Architecture


Raw Audio
↓
Audio Encoder (Whisper / Wav2Vec)
↓
Projection / Alignment
↓
LLM (Reasoning)
↓
Text Output

Audio models perceive. LLMs reason.


🎧 Understanding Audio (First Principles)

Audio is a time-domain signal:


Amplitude vs Time

Key challenges:

  • noise
  • accents
  • speed variation
  • emotion
  • silence

🧠 Why Audio Is Harder Than Text

Aspect Audio Text
Noise High Low
Ambiguity High Medium
Temporal Yes No
Emotion Yes Rare

Audio carries meaning beyond words.


🧪 Knowledge Check — Fundamentals

Q1 (Objective)

Why is audio considered a high-entropy modality?

Answer

Because it varies continuously in time, tone, speed, and noise.


🧠 Core Audio Tasks

Task Description
ASR Speech → Text
Audio QA Audio + Question → Answer
Emotion Recognition Audio → Emotion
Speaker ID Audio → Identity
Reasoning Audio → Meaning

This lecture focuses on ASR + reasoning.


🎧 Audio Encoders (The Ears)

Popular models:

  • Whisper (OpenAI)
  • Wav2Vec 2.0
  • HuBERT

They convert waveform → embeddings.


🧠 Encoder Output


Audio waveform → Frame-level embeddings → Sequence

Then:

  • pooled
  • projected
  • aligned to language space

🧪 Knowledge Check — Encoders

Q2 (MCQ)

Which model is widely used for multilingual ASR?

A) ResNet
B) Whisper
C) CLIP
D) BERT

Correct Answer

B) Whisper


🐍 Python Lab 1 — Speech-to-Text with Whisper

📦 Install Dependencies

pip install transformers accelerate torch torchaudio

🧠 Load Model

from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio

model_name = "openai/whisper-small"

processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

🎧 Load Audio

waveform, sample_rate = torchaudio.load("sample.wav")

inputs = processor(
    waveform.squeeze(),
    sampling_rate=sample_rate,
    return_tensors="pt"
)

✍️ Generate Text

with torch.no_grad():
    predicted_ids = model.generate(inputs["input_features"])

text = processor.batch_decode(
    predicted_ids,
    skip_special_tokens=True
)

print(text)

🎉 You just built a speech-to-text system.


🧪 Knowledge Check — Code Understanding

Q3 (Objective)

What is the role of the processor in Whisper?

Answer

It converts raw audio into model-ready features and decodes outputs.


🧠 Adding Reasoning with an LLM

ASR alone is not intelligence.

We add an LLM to:

  • summarize
  • answer questions
  • extract intent
  • reason

🏗 Full Audio–Text Reasoning Pipeline

Audio → ASR → Transcript → Prompt → LLM → Answer

This separation is architecturally clean.


🐍 Python Lab 2 — Audio QA with LLM

from transformers import AutoTokenizer, AutoModelForCausalLM

llm_name = "meta-llama/Llama-3-8B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(llm_name)
llm = AutoModelForCausalLM.from_pretrained(llm_name)

🧠 Prompting with Transcript

transcript = text[0]

prompt = f"""
You are an assistant.
Audio transcript:
{transcript}

Question:
What is the main topic of this audio?
"""

inputs = tokenizer(prompt, return_tensors="pt")

outputs = llm.generate(**inputs, max_new_tokens=100)

answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)

🧪 Knowledge Check — System Design

Q4 (True / False)

ASR models alone can reason about audio content.

Answer

False.


🧠 Audio–Text Alignment Strategies

Strategy Use Case
Transcript-only Simple QA
Timestamp alignment Video/audio sync
Embedding fusion Emotion + content
Multitask training Robust systems

⚠️ Common Failure Modes

  • Background noise
  • Code-switching
  • Accents
  • Domain-specific terms
  • Emotional nuance loss

Architects design for failure, not perfection.


🧪 Knowledge Check — Failures

Q5 (MCQ)

Which issue is hardest for ASR?

A) Silence B) Clear speech C) Accents D) Typed text

Correct Answer

C) Accents


🧠 Evaluation Metrics

Metric Meaning
WER Word Error Rate
CER Character Error Rate
QA Accuracy Reasoning quality
Latency Real-time ability

🧪 Knowledge Check — Evaluation

Q6 (Objective)

Why is low WER not sufficient for good QA?

Answer

Because reasoning depends on semantic correctness, not just word accuracy.


🌱 Ethics & Audio AI

Risks:

  • surveillance
  • consent violation
  • accent discrimination
  • voice cloning

Listening without permission is not intelligence.


🧪 Knowledge Check — Ethics

Q7 (True / False)

Audio AI systems should always inform users they are listening.

Answer

True.


🧠 Human-in-the-Loop (HITL)

Best practice:

  • humans review transcripts
  • corrections feed back
  • confidence thresholds trigger review

✅ Final Takeaways

  • Audio is rich but noisy
  • Encoders perceive, LLMs reason
  • Separation of ASR and reasoning is powerful
  • Python pipelines are modular
  • Ethics matter deeply for audio

🌍 Final Reflection

If a machine understands your voice, what responsibility does it have?

To respect consent, privacy, and human dignity.


Previous
Next