CS588: Deep Multimodal Retrieval and Agentic AI (Spring 2026)

(formerly: “Deep Learning based Image Search”)
• Focus on deep multimodal retrieval, RAG, and agentic AI systems.

Instructor: Sung-eui Yoon

When: 10:30–12:00 on Tue. and Thur.

Where: Lecture room 2443, Information Science and Electronics Bldg (E3)

First class: Mar. 3 (Tue)

Class KLMS: KLMS

Course Overview

RAG diagram Agentic AI diagram
The Retrieval-Augmented Generation (RAG) (left) enhances LLMs by retrieving relevant documents from vector databases. An agentic AI system (right) features a generalist agent that plans and reasons across internet-scale multimodal knowledge.

Thanks to rapid advances in digital sensors, large-scale data collection, and foundation models, we can now generate, process, and understand not only images and videos, but also text, audio, and other modalities at unprecedented scale. As a result, massive multimodal datasets are continuously created across the web, industry, and personal devices. For example, modern platforms host billions of multimodal items, with new content being generated and updated in real time.

These large-scale multimodal databases pose fundamental technical challenges, including representation learning, retrieval, reasoning, and efficient interaction between perception and language. Beyond classical image search, modern systems must retrieve and reason over heterogeneous data, align vision with language, and support downstream decision-making and generation.

In this course, we study scalable techniques for deep multimodal retrieval and their extension toward agentic AI systems. Topics include multimodal foundation models, advanced retrieval and Retrieval-Augmented Generation (RAG), and language models acting as agents that can plan, reason, and interact with retrieved knowledge. The course emphasizes both core algorithms and emerging applications built on web-scale multimodal data.

In summary, what you will gain by the end of the course:

What you will do:

Textbook

In-class handouts and ongoing draft (web), ongoing draft (pdf) on image search

Lecture Schedule (subject to change)

Date Topics and slides Related material(s)
Mar. 3 (Tue)
Mar. 5 (Thu) Deep learning based image retrieval Programming Assignment 1
Due: Thu, Mar. 19
Mar. 10 (Tue)
Mar. 12 (Thu)
Mar. 17 (Tue)
Mar. 19 (Thu)
Mar. 24 (Tue)
Mar. 26 (Thu)
Mar. 31 (Tue)
Apr. 2 (Thu)
Apr. 7 (Tue)
Apr. 9 (Thu)
Apr. 14 (Tue)
Apr. 16 (Thu)
Apr. 21 (Tue)Midterm Week
Apr. 23 (Thu)Midterm Week
Apr. 28 (Tue)
Apr. 30 (Thu)
May. 5 (Tue) Children's Day
May. 7 (Thu)
May. 12 (Tue)
May. 14 (Wed)
May. 19 (Tue)
May. 21 (Thu)
May. 26 (Tue)
May. 28 (Thu)
Jun. 2 (Tue)
Jun. 4 (Thu)
Jun. 9 (Tue)
Jun. 11 (Thu)
Jun. 16 (Tue) / Jun. 18 (Thu)Final exam period

Student Presentations and Reports

For your presentations, please use this powerpoint template; paper presentation guideline is available.
For your final report, please use this latex template.

Additional Reference Materials and Links

Previous course homepages

Computer vision resources

Paper search

Acknowledgements & Copyright

Acknowledgements: The course materials are based on those of Prof. Fei-Fei Li, Stanford. Thank you so much!

Copyright 2024. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the author.

This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.