feat(evals): enrich Braintrust upload with granular scores and tracing

Add per-test pass/fail parsing from vitest verbose output, thread prompt content and individual test results through the runner, and rewrite uploadToBraintrust with experiment naming (model-variant-timestamp), granular scores (pass, test_pass_rate, per-test), rich metadata, and tool-call tracing via experiment.traced(). Also document --force flag for cached mise tasks and add Braintrust env vars to AGENTS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-03-27 10:09:26 +08:00 · 2026-02-24 13:26:48 +00:00
parent 3c3d1f55ca
commit 2da5cae2ac
6 changed files with 185 additions and 54 deletions
--- a/packages/evals/AGENTS.md
+++ b/packages/evals/AGENTS.md
@@ -45,10 +45,18 @@ This prevents the agent from "teaching to the test."

 ## Running Evals

+Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
+source files haven't changed. Use `--force` to bypass caching when you need
+to re-run evals regardless (e.g., after changing environment variables or
+re-running the same scenario):
+
 ```bash
 # Run all scenarios with skills (default)
 mise run eval

+# Force re-run (bypass source caching)
+mise run --force eval
+
 # Run a specific scenario
 EVAL_SCENARIO=auth-rls-new-project mise run eval

@@ -63,9 +71,12 @@ EVAL_SKILL=supabase mise run eval

 # Upload results to Braintrust
 mise run eval:upload
+
+# Force upload (bypass cache)
+mise run --force eval:upload
 ```

-Or directly:
+Or directly (no caching, always runs):

 ```bash
 cd packages/evals
@@ -99,12 +110,15 @@ Compare the results to measure how much skills improve agent output.
 ## Environment

 ```
-ANTHROPIC_API_KEY=sk-ant-...    # Required: Claude Code authentication
-EVAL_MODEL=...                  # Optional: override model (default: claude-sonnet-4-5-20250929)
-EVAL_SCENARIO=...               # Optional: run single scenario
-EVAL_SKILL=...                  # Optional: install only this skill (e.g., "supabase")
-EVAL_BASELINE=true              # Optional: run without skills (baseline mode)
-BRAINTRUST_UPLOAD=true          # Optional: upload results to Braintrust
+ANTHROPIC_API_KEY=sk-ant-...       # Required: Claude Code authentication
+EVAL_MODEL=...                     # Optional: override model (default: claude-sonnet-4-5-20250929)
+EVAL_SCENARIO=...                  # Optional: run single scenario
+EVAL_SKILL=...                     # Optional: install only this skill (e.g., "supabase")
+EVAL_BASELINE=true                 # Optional: run without skills (baseline mode)
+BRAINTRUST_UPLOAD=true             # Optional: upload results to Braintrust
+BRAINTRUST_API_KEY=...             # Required when BRAINTRUST_UPLOAD=true
+BRAINTRUST_PROJECT_ID=...          # Required when BRAINTRUST_UPLOAD=true
+BRAINTRUST_BASE_EXPERIMENT=...     # Optional: compare against a named experiment
 ```

 ## Key Files