Files
supabase-postgres-best-prac…/packages/evals/README.md
2026-02-19 17:06:17 +00:00

52 lines
1.3 KiB
Markdown

# Evals
Agent evaluation system for Supabase skills. Tests whether AI agents (starting
with Claude Code) correctly implement Supabase tasks when given access to skill
documentation.
## How It Works
Each eval is a self-contained project directory with a task prompt. The agent
works on it autonomously, then hidden vitest assertions check the result.
Binary pass/fail.
```
1. Create temp workspace from eval template
2. Agent (claude -p) reads prompt and creates files
3. Hidden EVAL.ts runs vitest assertions against the output
4. Pass/fail
```
## Usage
```bash
# Run all scenarios
mise run eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
# Run with baseline comparison (with-skill vs without-skill)
EVAL_BASELINE=true mise run eval
# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
```
### Environment Variables
```
ANTHROPIC_API_KEY Required: Claude Code authentication
EVAL_MODEL Override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO Run single scenario by name
EVAL_BASELINE=true Run baseline comparison (no skill)
```
## Adding Scenarios
1. Create `evals/{name}/` with `PROMPT.md`, `EVAL.ts`, and starter files
2. Write vitest assertions in `EVAL.ts`
3. Document in `scenarios/SCENARIOS.md`
See [AGENTS.md](AGENTS.md) for full details.