Files
supabase-postgres-best-prac…/packages/evals/AGENTS.md
Pedro Rodrigues 2da5cae2ac feat(evals): enrich Braintrust upload with granular scores and tracing
Add per-test pass/fail parsing from vitest verbose output, thread prompt
content and individual test results through the runner, and rewrite
uploadToBraintrust with experiment naming (model-variant-timestamp),
granular scores (pass, test_pass_rate, per-test), rich metadata, and
tool-call tracing via experiment.traced(). Also document --force flag
for cached mise tasks and add Braintrust env vars to AGENTS.md.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 13:26:48 +00:00

140 lines
4.6 KiB
Markdown

# Evals — Agent Guide
This package evaluates whether AI agents correctly implement Supabase tasks
when using skill documentation. Modeled after
[Vercel's next-evals-oss](https://github.com/vercel-labs/next-evals-oss): each
eval is a self-contained project with a task prompt, the agent works on it, and
hidden tests check the result. Binary pass/fail.
## Architecture
```
1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir)
2. Install skills via `skills add` CLI (or skip for baseline)
3. Run: claude -p "prompt" --cwd /tmp/eval-xxx
4. Agent reads skill, creates migrations/code in the workspace
5. Copy hidden EVAL.ts into workspace, run vitest
6. Capture pass/fail
```
The agent is **Claude Code** invoked via `claude -p` (print mode). It operates
on a real filesystem in a temp directory and can read/write files freely.
**Important**: MCP servers are disabled via `--strict-mcp-config` with an empty
config. This ensures the agent uses only local tools (Bash, Edit, Write, Read,
Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All
work must happen on the local filesystem — e.g., creating migration files in
`supabase/migrations/`, not applying them to a remote project.
## Eval Structure
Each eval lives in `evals/{scenario-name}/`:
```
evals/auth-rls-new-project/
PROMPT.md # Task description (visible to agent)
EVAL.ts # Vitest assertions (hidden from agent during run)
package.json # Minimal project manifest
supabase/
config.toml # Pre-initialized supabase config
migrations/ # Empty — agent creates files here
```
**EVAL.ts** is never copied to the workspace until after the agent finishes.
This prevents the agent from "teaching to the test."
## Running Evals
Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
source files haven't changed. Use `--force` to bypass caching when you need
to re-run evals regardless (e.g., after changing environment variables or
re-running the same scenario):
```bash
# Run all scenarios with skills (default)
mise run eval
# Force re-run (bypass source caching)
mise run --force eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
# Run without skills (baseline)
EVAL_BASELINE=true mise run eval
# Install only a specific skill
EVAL_SKILL=supabase mise run eval
# Upload results to Braintrust
mise run eval:upload
# Force upload (bypass cache)
mise run --force eval:upload
```
Or directly (no caching, always runs):
```bash
cd packages/evals
npx tsx src/runner.ts
# Single scenario, baseline mode
EVAL_BASELINE=true EVAL_SCENARIO=auth-rls-new-project npx tsx src/runner.ts
```
## Baseline Mode
Set `EVAL_BASELINE=true` to run scenarios **without** skills. By default,
scenarios run with skills installed via the `skills` CLI.
To compare with-skill vs baseline, run evals twice:
```bash
mise run eval # with skills
EVAL_BASELINE=true mise run eval # without skills (baseline)
```
Compare the results to measure how much skills improve agent output.
## Adding Scenarios
1. Create `evals/{scenario-name}/` with `PROMPT.md`, `EVAL.ts`, `package.json`
2. Add any starter files the agent should see (e.g., `supabase/config.toml`)
3. Write vitest assertions in `EVAL.ts` that check the agent's output files
4. Document the scenario in `scenarios/SCENARIOS.md`
## Environment
```
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO=... # Optional: run single scenario
EVAL_SKILL=... # Optional: install only this skill (e.g., "supabase")
EVAL_BASELINE=true # Optional: run without skills (baseline mode)
BRAINTRUST_UPLOAD=true # Optional: upload results to Braintrust
BRAINTRUST_API_KEY=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_PROJECT_ID=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_BASE_EXPERIMENT=... # Optional: compare against a named experiment
```
## Key Files
```
src/
runner.ts # Main orchestrator
types.ts # Core interfaces
runner/
scaffold.ts # Creates temp workspace from eval template
agent.ts # Invokes claude -p as subprocess
test.ts # Runs vitest EVAL.ts against workspace
results.ts # Collects results and prints summary
evals/
auth-rls-new-project/ # Scenario 1
scenarios/
SCENARIOS.md # Scenario descriptions
```