LLM in Multimodal: Text, Vision, Audio, Video

Table of Contents

🌍 Why this course exists

The world is entering the multimodal intelligence era.

AI can now:

  • πŸ‘€ see images
  • πŸ‘‚ hear audio
  • πŸŽ₯ understand video
  • πŸ“„ reason over documents
  • 🧠 think across modalities

But power without understanding is dangerous.

This course exists to:

  • rebuild multimodal LLMs from first principles
  • empower learners to build, not just use
  • place ethics and humanity at the center

🎯 Learning outcomes

By the end of this course, learners will be able to:

  • Explain what multimodal LLMs really are
  • Design multimodal pipelines from scratch
  • Choose between pretraining, fine-tuning, RAG, and agents
  • Implement basic multimodal models in code
  • Evaluate models correctly (not just accuracy)
  • Share models responsibly with the global community

🧩 Modalities covered

  • Audio ↔ Text
  • Image ↔ Text
  • Image ↔ Video
  • Documents & QA
  • Agents & RAG
  • Human-in-the-loop systems

🧠 Course philosophy

We do not build AI to replace humans.
We build AI to help humans become better humans.


🧩 Course structure


πŸ‘¨β€πŸ« Instructor

Teerapong Panboonyuen