On Mon June 09, 2025

Speaker

Hyunwoo J Kim


Title

Efficient Deep Video Understanding Towards AGI


Abstract

Video has become one of the most popular modalities that modern individuals consume and produce. However, developing AI systems that deeply understand videos is still a challenging goal due to the difficulty of annotations, the sheer volume of data, and the substantial computational burden required for training and inference of video models. To address these problems, I introduce new strategies for pre-training and fine-tuning video foundation models, including parameter-efficient fine-tuning (PEFT). Additionally, to deploy video models to users, I present training-free cost-efficient inference techniques for video transformers. To demonstrate the generalizability of video foundation models, I highlight our recent work in 'Video Question Answering' which implicitly requires tackling various subtasks and achieving a deeper understanding of videos. Lastly, I discuss how Video QA and Multimodal QA systems can serve as stepping stones towards artificial general intelligence, and outline future research directions.


Bio

Hyunwoo J. Kim is an associate professor at Korea University, where he leads Machine Learning and Vision Lab (MLV). His lab focuses on developing techniques for general-purpose AI systems, including multimodal foundation models, multi-modal question answering, efficient inference, and new neural network architectures. Prior to this position, he worked at Amazon Lab126 in Sunnyvale, California. He obtained a Ph.D. in Computer Sciences at the University of Wisconsin-Madison (Ph.D minor in statistics). He has served (or is serving) as an Area Chair for ICLR 2025, ICCV 2025, CVPR 2025, 2024 and co-organized the 1st and 2nd MICCAI workshops on Foundation Models for General Medical AI in 2023 and 2024.


Language

English