Evals — Agent Guide

This package evaluates whether AI agents correctly implement Supabase tasks when using skill documentation. Built on @vercel/agent-eval: each eval is a self-contained scenario with a task prompt, the agent works in a Docker sandbox, and hidden vitest assertions check the result. Binary pass/fail.

Architecture

1. eval.sh starts Supabase, exports keys
2. agent-eval reads experiments/experiment.ts
3. For each scenario:
   a. setup() resets DB, writes config + skills into Docker sandbox
   b. Agent (Claude Code) runs PROMPT.md in the sandbox
   c. EVAL.ts (vitest) asserts against agent output
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
5. Optional: upload.ts pushes results to Braintrust

The agent is Claude Code running inside a Docker sandbox managed by @vercel/agent-eval. It operates on a real filesystem and can read/write files freely.

File Structure

packages/evals/
  experiments/
    experiment.ts        # ExperimentConfig — agent, sandbox, setup() hook
  scripts/
    eval.sh              # Supabase lifecycle wrapper (start → eval → stop)
  src/
    upload.ts            # Standalone Braintrust result uploader
  evals/
    eval-utils.ts        # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
    {scenario}/
      PROMPT.md          # Task description (visible to agent)
      EVAL.ts            # Vitest assertions (hidden from agent during run)
      meta.ts            # expectedReferenceFiles for scoring
      package.json       # Minimal manifest with vitest devDep
  project/
    supabase/
      config.toml        # Shared Supabase config seeded into each sandbox
  scenarios/             # Workflow scenario proposals
  results/               # Output from eval runs (gitignored)

Running Evals

# Run all scenarios with skills
mise run eval

# Force re-run (bypass source caching)
mise run --force eval

# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval

# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval

# Run without skills (baseline)
EVAL_BASELINE=true mise run eval

# Dry run (no API calls)
mise run eval:dry

# Upload results to Braintrust
mise run eval:upload

Baseline Mode

Set EVAL_BASELINE=true to run scenarios without skills injected. By default, skill files from skills/supabase/ are written into the sandbox.

Compare with-skill vs baseline:

mise run eval                        # with skills
EVAL_BASELINE=true mise run eval     # without skills (baseline)

Adding Scenarios

Create evals/{scenario-name}/ with:
- PROMPT.md — task description for the agent
- EVAL.ts — vitest assertions checking agent output
- meta.ts — export expectedReferenceFiles array for scoring
- package.json — { "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }
Add any starter files the agent should see (they get copied via setup())
Assertions use helpers from ../eval-utils.ts (e.g., findMigrationFiles, getMigrationSQL)

Environment

ANTHROPIC_API_KEY=sk-ant-...         # Required: Claude Code authentication
EVAL_MODEL=...                       # Optional: override model (default: claude-sonnet-4-6)
EVAL_SCENARIO=...                    # Optional: run single scenario
EVAL_BASELINE=true                   # Optional: run without skills
BRAINTRUST_API_KEY=...               # Required for eval:upload
BRAINTRUST_PROJECT_ID=...            # Required for eval:upload

Docker Evals

Build and run evals inside Docker (e.g., for CI):

mise run eval:docker:build           # Build the eval Docker image
mise run eval:docker                 # Run evals in Docker
mise run eval:docker:shell           # Debug shell in eval container

3.8 KiB Raw Permalink Blame History