
This module explores cutting-edge deep learning architectures for text, vision, and multimodal understanding. Students gain an in-depth grasp of transformer mechanisms, embeddings, and prompting, progressing to instruction and parameter-efficient tuning, retrieval-augmented generation (RAG), and autonomous agent orchestration. The syllabus integrates Vision Transformers (ViT), CLIP contrastive encoders, Vision-Language Models (VLMs), and diffusion models. Through hands-on labs aligned with the Hugging Face LLM course and NVIDIA’s rapid-development framework, learners acquire both theoretical insight and practical skills for safe, optimized deployment of advanced generative-AI applications.
- Teacher: Khouloud Chelbi