Contents

LLM in Multimodal: Text, Vision, Audio, Video

Table of Contents

🌍 Why this course exists

The world is entering the multimodal intelligence era.

AI can now:

👀 see images
👂 hear audio
🎥 understand video
📄 reason over documents
🧠 think across modalities

But power without understanding is dangerous.

This course exists to:

rebuild multimodal LLMs from first principles
empower learners to build, not just use
place ethics and humanity at the center

🎯 Learning outcomes

By the end of this course, learners will be able to:

Explain what multimodal LLMs really are
Design multimodal pipelines from scratch
Choose between pretraining, fine-tuning, RAG, and agents
Implement basic multimodal models in code
Evaluate models correctly (not just accuracy)
Share models responsibly with the global community

🧩 Modalities covered

Audio ↔ Text
Image ↔ Text
Image ↔ Video
Documents & QA
Agents & RAG
Human-in-the-loop systems

🧠 Course philosophy

We do not build AI to replace humans.
We build AI to help humans become better humans.

🧩 Course structure

👨‍🏫 Instructor

Teerapong Panboonyuen

Start Lecture 01

Last updated on 2026