📋 Main Topics¶
Introduction & Motivation
- Operational challenges with LLMs
- Differences between traditional ML and LLM pipelines
Model Optimization & Inference Acceleration
- Techniques: Quantization, Speculative Sampling, MoE
- Cold start mitigation strategies
Infrastructure, Observability, Continuous Deployment & Automation
- Hardware selection, scaling, and CI/CD with Kubeflow
- Load balancing and redundancy
- Performance monitoring and canary deployments
🧠Class Activity - Labs¶
- Lab 1: Optimized Inference, Quantization, Speculative Sampling and MoE
- Lab 2: Build an LLM inference pipeline using Kubeflow and serve with FAST API
📚 Recommended Readings¶
- Understanding LLMOps: Large Language Model Operations by Weights and Biases
- Speculative Decoding for 2x Faster Whisper Inference
- FastAPI Tutorial