Thought Policy Optimisation

Oct 16, 2024 permanent

A fine-tuning technique based on Direct Preference Optimisation (DPO), uses a Judge model to evaluate model outputs based solely on the responses, without access to the thought process; this lets the model learn and refine its "thinking" abilities without relying on supervised thought data.

See Thinking LLMs: General Instruction Following with Thought Generation (Oct 2024).

Tags

Notes by Lex Toumbourou

Thought Policy Optimisation