feat(evals): enrich Braintrust upload with granular scores and tracing

Add per-test pass/fail parsing from vitest verbose output, thread prompt
content and individual test results through the runner, and rewrite
uploadToBraintrust with experiment naming (model-variant-timestamp),
granular scores (pass, test_pass_rate, per-test), rich metadata, and
tool-call tracing via experiment.traced(). Also document --force flag
for cached mise tasks and add Braintrust env vars to AGENTS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This commit is contained in:
Pedro Rodrigues
2026-02-24 13:26:48 +00:00
parent 3c3d1f55ca
commit 2da5cae2ac
6 changed files with 185 additions and 54 deletions

View File

@@ -1,5 +1,5 @@
import { mkdirSync, readdirSync, statSync, writeFileSync } from "node:fs";
import { join, resolve } from "node:path";
import { readdirSync, statSync } from "node:fs";
import { join } from "node:path";
import type { EvalRunResult } from "../types.js";
/**