Model Distillation — Making Giant Models Small

You’ve fine-tuned a 70B parameter model and it’s amazing. It’s also $0.03 per request and takes 2 seconds to respond. For production with 100,000 requests per day, that’s $3,000/day and unacceptable latency. Distillation creates a smaller model that captures most of the large model’s knowledge at a fraction of the cost and latency.

The idea: train a small “student” model to mimic the outputs of a large “teacher” model. The student never sees the original training data — it learns by studying how the teacher answers questions.

How Distillation Works

The key insight is that the teacher’s probability distribution over all possible outputs contains more information than just the correct answer. When the teacher says “the answer is Paris” with 90% confidence, “Lyon” with 5%, and “Marseille” with 3%, those secondary probabilities encode meaningful relationships between concepts.

Model Distillation — Teacher to Student

Teacher Model

70B parameters

High accuracy, slow, expensive

Soft labels (probability distributions)

→→→

Student Model

7B parameters

90% accuracy, 10x faster, 10x cheaper

💡 The student learns from the teacher's probability distribution (soft labels), not just the correct answer (hard labels). This transfers "dark knowledge" — the teacher's uncertainty and nuance.

Hard labels say “the answer is Paris.” Soft labels say “the answer is probably Paris, but Lyon and Marseille are reasonable too, while Tokyo is definitely not.” That nuance — which Hinton called “dark knowledge” — helps the student model learn faster and generalize better than training on hard labels alone.

In practice, distillation for LLMs follows a simple pipeline: generate a large dataset of prompt-completion pairs using the teacher model, then fine-tune the student model on these pairs. The student learns to produce similar outputs for similar inputs, inheriting the teacher’s behavior patterns without needing the teacher’s parameter count.

The accuracy-cost tradeoff is remarkably favorable. A well-distilled 7B model typically achieves 85-95% of the teacher’s quality on the target task at 10x less compute cost. For many production use cases, that 5-15% quality gap is invisible to users while the cost and latency improvements are dramatic.