# Evals — Agent Guide This package evaluates whether AI agents correctly implement Supabase tasks when using skill documentation. Built on [@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a self-contained scenario with a task prompt, the agent works in a Docker sandbox, and hidden vitest assertions check the result. Binary pass/fail. ## Architecture ``` 1. eval.sh starts Supabase, exports keys 2. agent-eval reads experiments/experiment.ts 3. For each scenario: a. setup() resets DB, writes config + skills into Docker sandbox b. Agent (Claude Code) runs PROMPT.md in the sandbox c. EVAL.ts (vitest) asserts against agent output 4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/ 5. Optional: upload.ts pushes results to Braintrust ``` The agent is **Claude Code** running inside a Docker sandbox managed by `@vercel/agent-eval`. It operates on a real filesystem and can read/write files freely. ## File Structure ``` packages/evals/ experiments/ experiment.ts # ExperimentConfig — agent, sandbox, setup() hook scripts/ eval.sh # Supabase lifecycle wrapper (start → eval → stop) src/ upload.ts # Standalone Braintrust result uploader evals/ eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.) {scenario}/ PROMPT.md # Task description (visible to agent) EVAL.ts # Vitest assertions (hidden from agent during run) meta.ts # expectedReferenceFiles for scoring package.json # Minimal manifest with vitest devDep project/ supabase/ config.toml # Shared Supabase config seeded into each sandbox scenarios/ # Workflow scenario proposals results/ # Output from eval runs (gitignored) ``` ## Running Evals ```bash # Run all scenarios with skills mise run eval # Force re-run (bypass source caching) mise run --force eval # Run a specific scenario EVAL_SCENARIO=auth-rls-new-project mise run eval # Override model EVAL_MODEL=claude-opus-4-6 mise run eval # Run without skills (baseline) EVAL_BASELINE=true mise run eval # Dry run (no API calls) mise run eval:dry # Upload results to Braintrust mise run eval:upload ``` ## Baseline Mode Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By default, skill files from `skills/supabase/` are written into the sandbox. Compare with-skill vs baseline: ```bash mise run eval # with skills EVAL_BASELINE=true mise run eval # without skills (baseline) ``` ## Adding Scenarios 1. Create `evals/{scenario-name}/` with: - `PROMPT.md` — task description for the agent - `EVAL.ts` — vitest assertions checking agent output - `meta.ts` — export `expectedReferenceFiles` array for scoring - `package.json` — `{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }` 2. Add any starter files the agent should see (they get copied via `setup()`) 3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`) ## Environment ``` ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6) EVAL_SCENARIO=... # Optional: run single scenario EVAL_BASELINE=true # Optional: run without skills BRAINTRUST_API_KEY=... # Required for eval:upload BRAINTRUST_PROJECT_ID=... # Required for eval:upload ``` ## Docker Evals Build and run evals inside Docker (e.g., for CI): ```bash mise run eval:docker:build # Build the eval Docker image mise run eval:docker # Run evals in Docker mise run eval:docker:shell # Debug shell in eval container ```