Thought Policy Optimisation
A fine-tuning technique based on Direct Preference Optimisation (DPO), uses a Judge model to evaluate model outputs based solely on the responses, without access to the thought process; this lets the model learn and refine its "thinking" abilities without relying on supervised thought data.
See Thinking LLMs: General Instruction Following with Thought Generation (Oct 2024).