Make capability measurable — with real data-science discipline.
01
Who it is for
For people who can measure what others assume — benchmarking, agent and model diagnosis, judgement pipelines. You work close to frontier AI, use Arc's evaluation substrate (Lancet), and run a full professional data-science process end to end — collect, validate, preprocess, EDA, pipeline, test, eval — often producing a corpus, benchmark, or dataset that resists gaming.
02
What it trains
- Running a full data-science process: collect → validate → preprocess → EDA → pipeline → test → eval
- Designing benchmarks that resist gaming
- Diagnosing where a system actually fails
- Working with Arc's evaluation substrate (Lancet)
03
Example missions
- Design a benchmark for a capability that lacks one
- Build an agent-failure diagnosis harness
- Stand up a judgement pipeline with quality controls
- Audit a system's claims against measured evidence
04
What you leave with
- A benchmark, corpus, or dataset
- A diagnosis others can trust
- An evaluation pipeline
05
How a mission works
Arc shows you a few real projects it judges you ready for, and you choose the one that draws you. Then it is mission-based and asynchronous — a clear brief in, a concrete artifact out; you investigate, decide, and return with evidence, and Arc evaluates the outcome, not the motion. Expect the start to be hard — unfamiliar tools, an unfamiliar problem space; that crossing is the point.
06
What it is not
- Not a course or a bootcamp — the work is real, and harder
- Not employment, salary, a title, or a guaranteed role — a cultivation path, not a job
- Not metric theatre
07
Selection
- Recognised through real work, by invitation — not an application
- Rigour under ambiguity; you measure honestly, including failure