Lecture 04 — Audio ↔ Text ↔ Reasoning: Teaching Machines to Listen
~4–5 hours (theory + code)
🌍 Why Audio–Text Matters
Sound is the first interface of intelligence.
Before writing:
- humans listened
- learned tone
- sensed emotion
- reacted in real time
Audio–Text systems allow machines to:
- transcribe speech
- understand intent
- reason about meaning
- respond naturally
If vision gives machines eyes, audio gives them ears.
🧠 What Is an Audio–Text Multimodal System?
An Audio–Text-to-Text system takes:
- 🎧 audio input (speech, sound)
- 🧠 converts it to representations
- ✍️ produces text output
Examples:
- Speech-to-text (ASR)
- Audio question answering
- Voice assistants
- Meeting summarization
- Emotion-aware chatbots
🧩 High-Level Architecture
Raw Audio
↓
Audio Encoder (Whisper / Wav2Vec)
↓
Projection / Alignment
↓
LLM (Reasoning)
↓
Text Output
Audio models perceive. LLMs reason.
🎧 Understanding Audio (First Principles)
Audio is a time-domain signal:
Amplitude vs Time
Key challenges:
- noise
- accents
- speed variation
- emotion
- silence
🧠 Why Audio Is Harder Than Text
| Aspect | Audio | Text |
|---|---|---|
| Noise | High | Low |
| Ambiguity | High | Medium |
| Temporal | Yes | No |
| Emotion | Yes | Rare |
Audio carries meaning beyond words.
🧪 Knowledge Check — Fundamentals
Q1 (Objective)
Why is audio considered a high-entropy modality?
Answer
Because it varies continuously in time, tone, speed, and noise.
🧠 Core Audio Tasks
| Task | Description |
|---|---|
| ASR | Speech → Text |
| Audio QA | Audio + Question → Answer |
| Emotion Recognition | Audio → Emotion |
| Speaker ID | Audio → Identity |
| Reasoning | Audio → Meaning |
This lecture focuses on ASR + reasoning.
🎧 Audio Encoders (The Ears)
Popular models:
- Whisper (OpenAI)
- Wav2Vec 2.0
- HuBERT
They convert waveform → embeddings.
🧠 Encoder Output
Audio waveform → Frame-level embeddings → Sequence
Then:
- pooled
- projected
- aligned to language space
🧪 Knowledge Check — Encoders
Q2 (MCQ)
Which model is widely used for multilingual ASR?
A) ResNet
B) Whisper
C) CLIP
D) BERT
Correct Answer
B) Whisper
🐍 Python Lab 1 — Speech-to-Text with Whisper
📦 Install Dependencies
pip install transformers accelerate torch torchaudio
🧠 Load Model
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
model_name = "openai/whisper-small"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)
🎧 Load Audio
waveform, sample_rate = torchaudio.load("sample.wav")
inputs = processor(
waveform.squeeze(),
sampling_rate=sample_rate,
return_tensors="pt"
)
✍️ Generate Text
with torch.no_grad():
predicted_ids = model.generate(inputs["input_features"])
text = processor.batch_decode(
predicted_ids,
skip_special_tokens=True
)
print(text)
🎉 You just built a speech-to-text system.
🧪 Knowledge Check — Code Understanding
Q3 (Objective)
What is the role of the processor in Whisper?
Answer
It converts raw audio into model-ready features and decodes outputs.
🧠 Adding Reasoning with an LLM
ASR alone is not intelligence.
We add an LLM to:
- summarize
- answer questions
- extract intent
- reason
🏗 Full Audio–Text Reasoning Pipeline
Audio → ASR → Transcript → Prompt → LLM → Answer
This separation is architecturally clean.
🐍 Python Lab 2 — Audio QA with LLM
from transformers import AutoTokenizer, AutoModelForCausalLM
llm_name = "meta-llama/Llama-3-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(llm_name)
llm = AutoModelForCausalLM.from_pretrained(llm_name)
🧠 Prompting with Transcript
transcript = text[0]
prompt = f"""
You are an assistant.
Audio transcript:
{transcript}
Question:
What is the main topic of this audio?
"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = llm.generate(**inputs, max_new_tokens=100)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(answer)
🧪 Knowledge Check — System Design
Q4 (True / False)
ASR models alone can reason about audio content.
Answer
False.
🧠 Audio–Text Alignment Strategies
| Strategy | Use Case |
|---|---|
| Transcript-only | Simple QA |
| Timestamp alignment | Video/audio sync |
| Embedding fusion | Emotion + content |
| Multitask training | Robust systems |
⚠️ Common Failure Modes
- Background noise
- Code-switching
- Accents
- Domain-specific terms
- Emotional nuance loss
Architects design for failure, not perfection.
🧪 Knowledge Check — Failures
Q5 (MCQ)
Which issue is hardest for ASR?
A) Silence B) Clear speech C) Accents D) Typed text
Correct Answer
C) Accents
🧠 Evaluation Metrics
| Metric | Meaning |
|---|---|
| WER | Word Error Rate |
| CER | Character Error Rate |
| QA Accuracy | Reasoning quality |
| Latency | Real-time ability |
🧪 Knowledge Check — Evaluation
Q6 (Objective)
Why is low WER not sufficient for good QA?
Answer
Because reasoning depends on semantic correctness, not just word accuracy.
🌱 Ethics & Audio AI
Risks:
- surveillance
- consent violation
- accent discrimination
- voice cloning
Listening without permission is not intelligence.
🧪 Knowledge Check — Ethics
Q7 (True / False)
Audio AI systems should always inform users they are listening.
Answer
True.
🧠 Human-in-the-Loop (HITL)
Best practice:
- humans review transcripts
- corrections feed back
- confidence thresholds trigger review
✅ Final Takeaways
- Audio is rich but noisy
- Encoders perceive, LLMs reason
- Separation of ASR and reasoning is powerful
- Python pipelines are modular
- Ethics matter deeply for audio
🌍 Final Reflection
If a machine understands your voice, what responsibility does it have?
To respect consent, privacy, and human dignity.