Home /permanent

Thought Policy Optimisation

A fine-tuning technique based on Direct Preference Optimisation (DPO), uses a Judge model to evaluate model outputs based solely on the responses, without access to the thought process; this lets the model learn and refine its "thinking" abilities without relying on supervised thought data.

See Thinking LLMs: General Instruction Following with Thought Generation (Oct 2024).