Commit Graph

7 Commits

Author SHA1 Message Date
Pedro Rodrigues
9c6fd293eb use agent-evals package 2026-02-27 15:32:55 +00:00
Pedro Rodrigues
e65642b752 remove some braintrust headers 2026-02-25 19:11:56 +00:00
Pedro Rodrigues
3c3d1f55ca containerize eval environment with Docker and mock CLIs
Host now only needs Docker + ANTHROPIC_API_KEY to run evals. Adds
multi-stage Dockerfile, mock supabase/docker/psql scripts, entrypoint,
docker-compose for local use, and switches CI to Docker-based execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 19:22:47 +00:00
Pedro Rodrigues
e06a567846 workflow evals with one scenario 2026-02-19 17:06:17 +00:00
Pedro Rodrigues
082eac2a01 multi model testing 2026-02-18 13:28:42 +00:00
Pedro Rodrigues
27d7af255d initial skills evals 2026-02-18 12:02:28 +00:00
Pedro Rodrigues
760460c221 chore: using mise to manage node versions and tasks (#44)
* add mise for Node.js version management

Replace ad-hoc Node version pinning with mise (mise.toml + mise.lock).
This ensures all contributors and CI use the same Node LTS version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* expand mise with env loading, PATH management, and task runner

- [env] _.file loads .env files for local API keys (replaces dotenv)
- [env] _.path adds node_modules/.bin to PATH for direct tool access
- [tasks] mirrors npm scripts with sources/outputs for file-based caching
- eval tasks for running LLM evaluations via mise run
- update AGENTS.md and CONTRIBUTING.md to use mise run commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 12:44:44 +00:00