# Evals

Agent evaluation system for Supabase skills. Tests whether AI agents (starting
with Claude Code) correctly implement Supabase tasks when given access to skill
documentation.

## How It Works

Each eval is a self-contained project directory with a task prompt. The agent
works on it autonomously, then hidden vitest assertions check the result.
Binary pass/fail.

```
1. Create temp workspace from eval template
2. Agent (claude -p) reads prompt and creates files
3. Hidden EVAL.ts runs vitest assertions against the output
4. Pass/fail
```

## Usage

```bash
# Run all scenarios
mise run eval

# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval

# Run with baseline comparison (with-skill vs without-skill)
EVAL_BASELINE=true mise run eval

# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
```

### Environment Variables

```
ANTHROPIC_API_KEY         Required: Claude Code authentication
EVAL_MODEL                Override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO             Run single scenario by name
EVAL_BASELINE=true        Run baseline comparison (no skill)
```

## Adding Scenarios

1. Create `evals/{name}/` with `PROMPT.md`, `EVAL.ts`, and starter files
2. Write vitest assertions in `EVAL.ts`
3. Document in `scenarios/SCENARIOS.md`

See [AGENTS.md](AGENTS.md) for full details.