📋 Main Topics¶
Introduction to Vision Language Models - Understanding how AI systems combine visual and textual information to achieve multimodal understanding
Core Components and Architecture - Exploring the fundamental building blocks that enable models to process and connect different modalities
Training and Learning Approaches - Methods and strategies for teaching models to understand and relate visual and language data
Applications and Use Cases - Real-world applications of VLMs across various domains and their practical capabilities
Challenges and Future Directions - Current limitations, open research questions, and emerging trends in multimodal AI
🧠Class Activity - Labs¶
- Building a simple VLM application for image understanding
📚 Recommended Readings¶
🎥 Recommended Videos¶
- What Are Vision Language Models? How AI Sees & Understands Images - IBM (10 min) Watch on YouTube
- [CVPR24 Vision Foundation Models Tutorial] Image Generation - Zhengyuan Yang (58 min) Watch on YouTube
- [CVPR24 Vision Foundation Model Tutorial] Vision in LMMs - Jianwei Yang (56 min) Watch on YouTube
- [CVPR24 Vision Foundation Model Tutorial] Large Multimodal Models - Chunyuan Li Watch on YouTube