feat(evals): enrich Braintrust upload with granular scores and tracing

Add per-test pass/fail parsing from vitest verbose output, thread prompt
content and individual test results through the runner, and rewrite
uploadToBraintrust with experiment naming (model-variant-timestamp),
granular scores (pass, test_pass_rate, per-test), rich metadata, and
tool-call tracing via experiment.traced(). Also document --force flag
for cached mise tasks and add Braintrust env vars to AGENTS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Pedro Rodrigues
2026-02-24 13:26:48 +00:00
parent 3c3d1f55ca
commit 2da5cae2ac
6 changed files with 185 additions and 54 deletions

View File

@@ -45,10 +45,18 @@ This prevents the agent from "teaching to the test."
## Running Evals
Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
source files haven't changed. Use `--force` to bypass caching when you need
to re-run evals regardless (e.g., after changing environment variables or
re-running the same scenario):
```bash
# Run all scenarios with skills (default)
mise run eval
# Force re-run (bypass source caching)
mise run --force eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
@@ -63,9 +71,12 @@ EVAL_SKILL=supabase mise run eval
# Upload results to Braintrust
mise run eval:upload
# Force upload (bypass cache)
mise run --force eval:upload
```
Or directly:
Or directly (no caching, always runs):
```bash
cd packages/evals
@@ -99,12 +110,15 @@ Compare the results to measure how much skills improve agent output.
## Environment
```
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO=... # Optional: run single scenario
EVAL_SKILL=... # Optional: install only this skill (e.g., "supabase")
EVAL_BASELINE=true # Optional: run without skills (baseline mode)
BRAINTRUST_UPLOAD=true # Optional: upload results to Braintrust
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO=... # Optional: run single scenario
EVAL_SKILL=... # Optional: install only this skill (e.g., "supabase")
EVAL_BASELINE=true # Optional: run without skills (baseline mode)
BRAINTRUST_UPLOAD=true # Optional: upload results to Braintrust
BRAINTRUST_API_KEY=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_PROJECT_ID=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_BASE_EXPERIMENT=... # Optional: compare against a named experiment
```
## Key Files