Heavy Thinking: A Test-Time Scaling Pattern for Hard Problems
Now we have GPT Pro at home
Now we have GPT Pro at home
A large-scale study on long-horizon document tasks.
An agentic framework for end-to-end game creation
An approach to agentic software development that I use
Self-generated agent context files don't help.
Curated skills boost agent performance by 16 points; self-generated ones don't help at all.
in which the LLM maintains a spec file alongside the project
Using evolutionary algorithms with LLM-coding agents
a prompting and fine-tuning method that enables LLMs to engage in a "thinking" process before generating responses
a comprehensive evaluation of o1-preview across many tasks and domains.