CS588 — Deep Multimodal Retrieval and Agentic AI (Spring 2026)

Course Overview

RAG diagram — **The Retrieval-Augmented Generation (RAG)** (left) enhances LLMs by retrieving relevant documents from vector databases. **An agentic AI system** (right) features a generalist agent that plans and reasons across internet-scale multimodal knowledge.

Agentic AI diagram — **The Retrieval-Augmented Generation (RAG)** (left) enhances LLMs by retrieving relevant documents from vector databases. **An agentic AI system** (right) features a generalist agent that plans and reasons across internet-scale multimodal knowledge.

Thanks to rapid advances in digital sensors, large-scale data collection, and foundation models, we can now generate, process, and understand not only images and videos, but also text, audio, and other modalities at unprecedented scale. As a result, massive multimodal datasets are continuously created across the web, industry, and personal devices. For example, modern platforms host billions of multimodal items, with new content being generated and updated in real time.

These large-scale multimodal databases pose fundamental technical challenges, including representation learning, retrieval, reasoning, and efficient interaction between perception and language. Beyond classical image search, modern systems must retrieve and reason over heterogeneous data, align vision with language, and support downstream decision-making and generation.

In this course, we study scalable techniques for deep multimodal retrieval and their extension toward agentic AI systems. Topics include multimodal foundation models, advanced retrieval and Retrieval-Augmented Generation (RAG), and language models acting as agents that can plan, reason, and interact with retrieved knowledge. The course emphasizes both core algorithms and emerging applications built on web-scale multimodal data.

In summary, what you will gain by the end of the course:

A broad understanding of multimodal retrieval across vision, language, and other modalities
In-depth knowledge of modern retrieval and RAG techniques for web-scale data
An understanding of how language models function as agents that leverage retrieval for reasoning and decision-making
Exposure to emerging applications built on large-scale multimodal and agentic systems

What you will do:

Choose and present a few papers from recent conferences.
Final project: come up with your own idea related to the topic, (optionally) implement it to improve the state-of-the-art techniques.
Mid-term exam: reviewing basic multi-model search and RAG methods.

Lecture Schedule (subject to change)

Date	Topics and slides	Related material(s)
Mar. 3 (Tue)
Mar. 5 (Thu)	Deep learning based image retrieval	Programming Assignment 1 Due: Thu, Mar. 19
Mar. 10 (Tue)
Mar. 12 (Thu)
Mar. 17 (Tue)
Mar. 19 (Thu)
Mar. 24 (Tue)
Mar. 26 (Thu)
Mar. 31 (Tue)
Apr. 2 (Thu)
Apr. 7 (Tue)
Apr. 9 (Thu)
Apr. 14 (Tue)
Apr. 16 (Thu)
Apr. 21 (Tue)	Midterm Week
Apr. 23 (Thu)	Midterm Week
Apr. 28 (Tue)
Apr. 30 (Thu)
May. 5 (Tue)	Children's Day
May. 7 (Thu)
May. 12 (Tue)
May. 14 (Wed)
May. 19 (Tue)
May. 21 (Thu)
May. 26 (Tue)
May. 28 (Thu)
Jun. 2 (Tue)
Jun. 4 (Thu)
Jun. 9 (Tue)
Jun. 11 (Thu)
Jun. 16 (Tue) / Jun. 18 (Thu)	Final exam period

Acknowledgements & Copyright

Acknowledgements: The course materials are based on those of Prof. Fei-Fei Li, Stanford. Thank you so much!

Copyright 2024. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the author.

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

CS588: Deep Multimodal Retrieval and Agentic AI (Spring 2026)

Course Overview

Textbook

Lecture Schedule (subject to change)

Student Presentations and Reports

Additional Reference Materials and Links

Previous course homepages

Computer vision resources

Paper search

Acknowledgements & Copyright