mirror of
https://github.com/supabase/agent-skills.git
synced 2026-03-27 10:09:26 +08:00
3.8 KiB
3.8 KiB
Evals — Agent Guide
This package evaluates whether AI agents correctly implement Supabase tasks when using skill documentation. Built on @vercel/agent-eval: each eval is a self-contained scenario with a task prompt, the agent works in a Docker sandbox, and hidden vitest assertions check the result. Binary pass/fail.
Architecture
1. eval.sh starts Supabase, exports keys
2. agent-eval reads experiments/experiment.ts
3. For each scenario:
a. setup() resets DB, writes config + skills into Docker sandbox
b. Agent (Claude Code) runs PROMPT.md in the sandbox
c. EVAL.ts (vitest) asserts against agent output
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
5. Optional: upload.ts pushes results to Braintrust
The agent is Claude Code running inside a Docker sandbox managed by
@vercel/agent-eval. It operates on a real filesystem and can read/write files
freely.
File Structure
packages/evals/
experiments/
experiment.ts # ExperimentConfig — agent, sandbox, setup() hook
scripts/
eval.sh # Supabase lifecycle wrapper (start → eval → stop)
src/
upload.ts # Standalone Braintrust result uploader
evals/
eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
{scenario}/
PROMPT.md # Task description (visible to agent)
EVAL.ts # Vitest assertions (hidden from agent during run)
meta.ts # expectedReferenceFiles for scoring
package.json # Minimal manifest with vitest devDep
project/
supabase/
config.toml # Shared Supabase config seeded into each sandbox
scenarios/ # Workflow scenario proposals
results/ # Output from eval runs (gitignored)
Running Evals
# Run all scenarios with skills
mise run eval
# Force re-run (bypass source caching)
mise run --force eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
# Run without skills (baseline)
EVAL_BASELINE=true mise run eval
# Dry run (no API calls)
mise run eval:dry
# Upload results to Braintrust
mise run eval:upload
Baseline Mode
Set EVAL_BASELINE=true to run scenarios without skills injected. By
default, skill files from skills/supabase/ are written into the sandbox.
Compare with-skill vs baseline:
mise run eval # with skills
EVAL_BASELINE=true mise run eval # without skills (baseline)
Adding Scenarios
- Create
evals/{scenario-name}/with:PROMPT.md— task description for the agentEVAL.ts— vitest assertions checking agent outputmeta.ts— exportexpectedReferenceFilesarray for scoringpackage.json—{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }
- Add any starter files the agent should see (they get copied via
setup()) - Assertions use helpers from
../eval-utils.ts(e.g.,findMigrationFiles,getMigrationSQL)
Environment
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6)
EVAL_SCENARIO=... # Optional: run single scenario
EVAL_BASELINE=true # Optional: run without skills
BRAINTRUST_API_KEY=... # Required for eval:upload
BRAINTRUST_PROJECT_ID=... # Required for eval:upload
Docker Evals
Build and run evals inside Docker (e.g., for CI):
mise run eval:docker:build # Build the eval Docker image
mise run eval:docker # Run evals in Docker
mise run eval:docker:shell # Debug shell in eval container