the report card

It grades its own homework.
Here’s the curve.

Measured every night against federal rulings it has never seen. Every number on this page clicks through to the Phoenix experiment that produced it.

10-digit exact match
held-out rulings, all post-dating the model
live prompt version

Exact-match accuracy, round over round

10-digit 6-digit 4-digit

Each point is one nightly Phoenix experiment — click it to open the run, its traces, and the prompt version it tested. The curve only counts rulings published after the model’s training cutoff.

Its own named weaknesses

The agent clusters every miss into a family, builds a Phoenix dataset for it, and keeps re-testing it forever — fixed means measured-fixed.

Watch it improve, right now

A full self-improvement round, scoped to a minute.

Grades itself on ~20 held-out rulings, reads its misses over Phoenix MCP, writes its own fix, and only ships it if the score goes up. Nobody touches anything.

grade introspect fix gate

Round history

roundrulings4-digit6-digit10-digitjudgegate

What it taught itself

Prompt lineage

Solid = promoted by the gate. Dashed = written, tested, and refused — a prompt only ships if it beats the incumbent on held-out rulings. Every version is in Phoenix prompt history ↗.