Few-Shot Knowledge-Distillation
Good idea from this blog post by Steve Krenzel: Few-Shot Knowledge Distillation (FSKD)
A retrieval LLM routing idea where we:
- Store input/output task embeddings from a high-quality model.
- For new examples, compute a novelty score: measure how similar the new example is to other examples.
- Send novel examples to
gpt4o
(or some big, expensive model). - Send other examples to
gpt4o-mini
(or some cheaper, lower performance model), but include similargpt4o
examples in the prompt (few-shot knowledge distillation).
The novelty score is determined by:
- And the task embedding (T).
- The similarity threshold (θ) - which defines how similar an entry needs to be to be considered a match
- The matches threshold (m) specifies the number of matches that need to be considered "low novelty."
The defaults of and m = 3 can be tuned for cost/performance trade-offs.
It's cool because it's self-adapting - if there's Domain Shift new examples are sent to the larger model until it builds up new examples.
Results on real inventory moderation task:
- 63.4% cost reduction
- Slight improvement over a large model: 90.9% accuracy (vs 87.6% for the large model alone)
- ~69% of tasks handled by a smaller model