Commit Graph

4 Commits

Author SHA1 Message Date
Pedro Rodrigues
2da5cae2ac feat(evals): enrich Braintrust upload with granular scores and tracing
Add per-test pass/fail parsing from vitest verbose output, thread prompt
content and individual test results through the runner, and rewrite
uploadToBraintrust with experiment naming (model-variant-timestamp),
granular scores (pass, test_pass_rate, per-test), rich metadata, and
tool-call tracing via experiment.traced(). Also document --force flag
for cached mise tasks and add Braintrust env vars to AGENTS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 13:26:48 +00:00
Pedro Rodrigues
baf94b04e3 load skills through skills CLI 2026-02-20 17:41:41 +00:00
Pedro Rodrigues
e03bc99ebb more two scenarios and claude code cli is now a dependency 2026-02-20 15:02:59 +00:00
Pedro Rodrigues
e06a567846 workflow evals with one scenario 2026-02-19 17:06:17 +00:00