the report card

It grades its own homework.
Here’s the curve.

Measured every night against federal rulings it has never seen. Every number on this page clicks through to the Phoenix experiment that produced it.

—10-digit exact match

—held-out rulings, all post-dating the model

—live prompt version

Exact-match accuracy, round over round

10-digit 6-digit 4-digit

Each point is one nightly Phoenix experiment — click it to open the run, its traces, and the prompt version it tested. The curve only counts rulings published after the model’s training cutoff.

Its own named weaknesses

The agent clusters every miss into a family, builds a Phoenix dataset for it, and keeps re-testing it forever — fixed means measured-fixed.

Watch it improve, right now

A full self-improvement round, scoped to a minute.

Grades itself on ~20 held-out rulings, reads its misses over Phoenix MCP, writes its own fix, and only ships it if the score goes up. Nobody touches anything.

grade→ introspect→ fix→ gate

Round history

round	rulings	4-digit	6-digit	10-digit	judge	gate

What it taught itself

Prompt lineage

Solid = promoted by the gate. Dashed = written, tested, and refused — a prompt only ships if it beats the incumbent on held-out rulings. Every version is in Phoenix prompt history ↗.

It grades its own homework.Here’s the curve.

Exact-match accuracy, round over round

Its own named weaknesses

Watch it improve, right now

A full self-improvement round, scoped to a minute.

Round history

What it taught itself

Prompt lineage

It grades its own homework.
Here’s the curve.