It grades its own homework.
Here’s the curve.
Measured every night against federal rulings it has never seen. Every number on this page clicks through to the Phoenix experiment that produced it.
Exact-match accuracy, round over round
Each point is one nightly Phoenix experiment — click it to open the run, its traces, and the prompt version it tested. The curve only counts rulings published after the model’s training cutoff.
Its own named weaknesses
The agent clusters every miss into a family, builds a Phoenix dataset for it, and keeps re-testing it forever — fixed means measured-fixed.
No graded rounds yet.
Watch it improve, right now
A full self-improvement round, scoped to a minute.
Grades itself on ~20 held-out rulings, reads its misses over Phoenix MCP, writes its own fix, and only ships it if the score goes up. Nobody touches anything.
Round history
What it taught itself
Prompt lineage
No graded rounds yet.
Solid = promoted by the gate. Dashed = written, tested, and refused — a prompt only ships if it beats the incumbent on held-out rulings. Every version is in Phoenix prompt history ↗.