Evals — Agent Guide

This package evaluates whether AI agents correctly implement Supabase tasks when using skill documentation. Modeled after Vercel's next-evals-oss: each eval is a self-contained project with a task prompt, the agent works on it, and hidden tests check the result. Binary pass/fail.

Architecture

1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir)
2. Install skills via `skills add` CLI (or skip for baseline)
3. Run: claude -p "prompt" --cwd /tmp/eval-xxx
4. Agent reads skill, creates migrations/code in the workspace
5. Copy hidden EVAL.ts into workspace, run vitest
6. Capture pass/fail

The agent is Claude Code invoked via claude -p (print mode). It operates on a real filesystem in a temp directory and can read/write files freely.

Important: MCP servers are disabled via --strict-mcp-config with an empty config. This ensures the agent uses only local tools (Bash, Edit, Write, Read, Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All work must happen on the local filesystem — e.g., creating migration files in supabase/migrations/, not applying them to a remote project.

Eval Structure

Each eval lives in evals/{scenario-name}/:

evals/auth-rls-new-project/
  PROMPT.md          # Task description (visible to agent)
  EVAL.ts            # Vitest assertions (hidden from agent during run)
  package.json       # Minimal project manifest
  supabase/
    config.toml      # Pre-initialized supabase config
    migrations/      # Empty — agent creates files here

EVAL.ts is never copied to the workspace until after the agent finishes. This prevents the agent from "teaching to the test."

Running Evals

Eval tasks in mise.toml have sources defined, so mise skips them when source files haven't changed. Use --force to bypass caching when you need to re-run evals regardless (e.g., after changing environment variables or re-running the same scenario):

# Run all scenarios with skills (default)
mise run eval

# Force re-run (bypass source caching)
mise run --force eval

# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval

# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval

# Run without skills (baseline)
EVAL_BASELINE=true mise run eval

# Install only a specific skill
EVAL_SKILL=supabase mise run eval

# Upload results to Braintrust
mise run eval:upload

# Force upload (bypass cache)
mise run --force eval:upload

Baseline Mode

Set EVAL_BASELINE=true to run scenarios without skills. By default, scenarios run with skills installed via the skills CLI.

To compare with-skill vs baseline, run evals twice:

mise run eval                        # with skills
EVAL_BASELINE=true mise run eval     # without skills (baseline)

Compare the results to measure how much skills improve agent output.

Adding Scenarios

Create evals/{scenario-name}/ with PROMPT.md, EVAL.ts, package.json
Add any starter files the agent should see (e.g., supabase/config.toml)
Write vitest assertions in EVAL.ts that check the agent's output files
Document the scenario in scenarios/SCENARIOS.md

Environment

ANTHROPIC_API_KEY=sk-ant-...       # Required: Claude Code authentication
EVAL_MODEL=...                     # Optional: override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO=...                  # Optional: run single scenario
EVAL_SKILL=...                     # Optional: install only this skill (e.g., "supabase")
EVAL_BASELINE=true                 # Optional: run without skills (baseline mode)
BRAINTRUST_UPLOAD=true             # Optional: upload results to Braintrust
BRAINTRUST_API_KEY=...             # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_PROJECT_ID=...          # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_BASE_EXPERIMENT=...     # Optional: compare against a named experiment

Key Files

src/
  runner.ts              # Main orchestrator
  types.ts               # Core interfaces
  runner/
    scaffold.ts          # Creates temp workspace from eval template
    agent.ts             # Invokes claude -p as subprocess
    test.ts              # Runs vitest EVAL.ts against workspace
    results.ts           # Collects results and prints summary
evals/
  auth-rls-new-project/  # Scenario 1
scenarios/
  SCENARIOS.md           # Scenario descriptions

4.4 KiB Raw Blame History