Absolute Zero: Reinforced Self-play Reasoning with Zero Data

May 12, 2025 reference/papers ReinforcementLearning ReasoningModels LargeLanguageModels

My summary of Absolute Zero: Reinforced Self-play Reasoning with Zero Data by Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Yang Yue, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang.

This paper introduces the Absolute Zero Reasoner.

The idea is to train a reasoning model (from a foundation Qwen-2.5-7B* model) without needing any human-annotated data. Previous DeepSeek-R1-Zero approach learned CoT reasoning just from input/output pairs, but this goes a step further and proposes the inputs and outputs to solve.

The proposing and solving steps both have relevant reward functions: they reward the model to propose examples that are the correct level of difficulty for it, and penalise wrong answers in the solving step.

The tasks it proposes and solves are either:

deduction (prediction output, given program and input)
abduction (prediction input, given program and output)
induction (predict/synthesis program, given input and output)

Of note, is that they also notice some potentially unsafe reasoning chains generated throughout the training, which they call an "uh-oh moment".

It's pretty wild that this works. In theory, we can improve reasoning capability of models by just training them longer with no additional data needed, although I'm a bit dubious about how far it can be pushed, given that the proposing step is still limited to the distribution of the foundation model's training data.

Backlinks

Learning to Reason without External Rewards

Tags

Notes by Lex Toumbourou

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Backlinks