Commit Graph

13 Commits

Author SHA1 Message Date
Pedro Rodrigues
e65642b752 remove some braintrust headers 2026-02-25 19:11:56 +00:00
Pedro Rodrigues
9b08864e94 feat(evals): replace mock CLIs with real Supabase instance per eval run
Start a shared local Supabase stack once before all scenarios and reset
the database (drop/recreate public schema + clear migration history) between
each run. This lets agents apply migrations via `supabase db push` against a
real Postgres instance instead of mock shell scripts.

- Add supabase-setup.ts: startSupabase / stopSupabase / resetDB / getKeys
- Update runner.ts to start/stop Supabase and inject keys into process.env
- Update agent.ts to point MCP config at the local Supabase HTTP endpoint
- Update preflight.ts to check supabase CLI availability and Docker socket
- Update scaffold.ts to seed workspace with supabase/config.toml
- Add passThreshold support (test.ts / results.ts / types.ts) for partial pass
- Delete mock shell scripts (mocks/docker, mocks/psql, mocks/supabase)
- Update Dockerfile/docker-compose to mount Docker socket for supabase CLI

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-25 14:39:54 +00:00
Pedro Rodrigues
2da5cae2ac feat(evals): enrich Braintrust upload with granular scores and tracing
Add per-test pass/fail parsing from vitest verbose output, thread prompt
content and individual test results through the runner, and rewrite
uploadToBraintrust with experiment naming (model-variant-timestamp),
granular scores (pass, test_pass_rate, per-test), rich metadata, and
tool-call tracing via experiment.traced(). Also document --force flag
for cached mise tasks and add Braintrust env vars to AGENTS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 13:26:48 +00:00
Pedro Rodrigues
3c3d1f55ca containerize eval environment with Docker and mock CLIs
Host now only needs Docker + ANTHROPIC_API_KEY to run evals. Adds
multi-stage Dockerfile, mock supabase/docker/psql scripts, entrypoint,
docker-compose for local use, and switches CI to Docker-based execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:22:47 +00:00
Pedro Rodrigues
93a49374de realtime scenario 2026-02-23 10:25:50 +00:00
Pedro Rodrigues
baf94b04e3 load skills through skills CLI 2026-02-20 17:41:41 +00:00
Pedro Rodrigues
ce7eb8b28b simple edge function creation example 2026-02-20 16:54:01 +00:00
Pedro Rodrigues
386b9fbb05 storage workflow 2026-02-20 15:11:07 +00:00
Pedro Rodrigues
e03bc99ebb more two scenarios and claude code cli is now a dependency 2026-02-20 15:02:59 +00:00
Pedro Rodrigues
9a23c6b021 upgrade braintrust to ~v3.0.0 2026-02-19 17:14:27 +00:00
Pedro Rodrigues
e06a567846 workflow evals with one scenario 2026-02-19 17:06:17 +00:00
Pedro Rodrigues
082eac2a01 multi model testing 2026-02-18 13:28:42 +00:00
Pedro Rodrigues
27d7af255d initial skills evals 2026-02-18 12:02:28 +00:00