Home About

Tags

MachineLearning (25) LinearAlgebra (16) GameDesign (12) ComputerScience (11) SoftwareEngineering (11) LargeLanguageModels (7) AudioEngineering (7) DiscreteMath (6) AutomatedTesting (6) Roblox (5) Zettelkasten (5) DataStructures (4) More

Notes by Lex Toumbourou

Thought Policy Optimisation

Oct 16, 2024 permanent

A fine-tuning technique based on Direct Preference Optimisation (DPO), uses a Judge model to evaluate model outputs based solely on the responses, without access to the thought process; this lets the model learn and refine its "thinking" abilities without relying on supervised thought data.

See Thinking LLMs: General Instruction Following with Thought Generation (Oct 2024).