Files
supabase-postgres-best-prac…/packages/evals/AGENTS.md
2026-02-27 15:32:55 +00:00

117 lines
3.8 KiB
Markdown

# Evals — Agent Guide
This package evaluates whether AI agents correctly implement Supabase tasks
when using skill documentation. Built on
[@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
self-contained scenario with a task prompt, the agent works in a Docker sandbox,
and hidden vitest assertions check the result. Binary pass/fail.
## Architecture
```
1. eval.sh starts Supabase, exports keys
2. agent-eval reads experiments/experiment.ts
3. For each scenario:
a. setup() resets DB, writes config + skills into Docker sandbox
b. Agent (Claude Code) runs PROMPT.md in the sandbox
c. EVAL.ts (vitest) asserts against agent output
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
5. Optional: upload.ts pushes results to Braintrust
```
The agent is **Claude Code** running inside a Docker sandbox managed by
`@vercel/agent-eval`. It operates on a real filesystem and can read/write files
freely.
## File Structure
```
packages/evals/
experiments/
experiment.ts # ExperimentConfig — agent, sandbox, setup() hook
scripts/
eval.sh # Supabase lifecycle wrapper (start → eval → stop)
src/
upload.ts # Standalone Braintrust result uploader
evals/
eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
{scenario}/
PROMPT.md # Task description (visible to agent)
EVAL.ts # Vitest assertions (hidden from agent during run)
meta.ts # expectedReferenceFiles for scoring
package.json # Minimal manifest with vitest devDep
project/
supabase/
config.toml # Shared Supabase config seeded into each sandbox
scenarios/ # Workflow scenario proposals
results/ # Output from eval runs (gitignored)
```
## Running Evals
```bash
# Run all scenarios with skills
mise run eval
# Force re-run (bypass source caching)
mise run --force eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
# Run without skills (baseline)
EVAL_BASELINE=true mise run eval
# Dry run (no API calls)
mise run eval:dry
# Upload results to Braintrust
mise run eval:upload
```
## Baseline Mode
Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
default, skill files from `skills/supabase/` are written into the sandbox.
Compare with-skill vs baseline:
```bash
mise run eval # with skills
EVAL_BASELINE=true mise run eval # without skills (baseline)
```
## Adding Scenarios
1. Create `evals/{scenario-name}/` with:
- `PROMPT.md` — task description for the agent
- `EVAL.ts` — vitest assertions checking agent output
- `meta.ts` — export `expectedReferenceFiles` array for scoring
- `package.json``{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
2. Add any starter files the agent should see (they get copied via `setup()`)
3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)
## Environment
```
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6)
EVAL_SCENARIO=... # Optional: run single scenario
EVAL_BASELINE=true # Optional: run without skills
BRAINTRUST_API_KEY=... # Required for eval:upload
BRAINTRUST_PROJECT_ID=... # Required for eval:upload
```
## Docker Evals
Build and run evals inside Docker (e.g., for CI):
```bash
mise run eval:docker:build # Build the eval Docker image
mise run eval:docker # Run evals in Docker
mise run eval:docker:shell # Debug shell in eval container
```