Agent evaluation system for Supabase skills. Tests whether AI agents (starting with Claude Code) correctly implement Supabase tasks when given access to skill documentation.

How It Works

Each eval is a self-contained project directory with a task prompt. The agent works on it autonomously, then hidden vitest assertions check the result. Binary pass/fail.

1. Create temp workspace from eval template
2. Agent (claude -p) reads prompt and creates files
3. Hidden EVAL.ts runs vitest assertions against the output
4. Pass/fail

Usage

# Run all scenarios
mise run eval

# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval

# Run with baseline comparison (with-skill vs without-skill)
EVAL_BASELINE=true mise run eval

# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval

Environment Variables

ANTHROPIC_API_KEY         Required: Claude Code authentication
EVAL_MODEL                Override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO             Run single scenario by name
EVAL_BASELINE=true        Run baseline comparison (no skill)

Adding Scenarios

Create evals/{name}/ with PROMPT.md, EVAL.ts, and starter files
Write vitest assertions in EVAL.ts
Document in scenarios/SCENARIOS.md

See AGENTS.md for full details.