mirror of
https://github.com/supabase/agent-skills.git
synced 2026-03-27 10:09:26 +08:00
117 lines
3.8 KiB
Markdown
117 lines
3.8 KiB
Markdown
# Evals — Agent Guide
|
|
|
|
This package evaluates whether AI agents correctly implement Supabase tasks
|
|
when using skill documentation. Built on
|
|
[@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
|
|
self-contained scenario with a task prompt, the agent works in a Docker sandbox,
|
|
and hidden vitest assertions check the result. Binary pass/fail.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
1. eval.sh starts Supabase, exports keys
|
|
2. agent-eval reads experiments/experiment.ts
|
|
3. For each scenario:
|
|
a. setup() resets DB, writes config + skills into Docker sandbox
|
|
b. Agent (Claude Code) runs PROMPT.md in the sandbox
|
|
c. EVAL.ts (vitest) asserts against agent output
|
|
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
|
|
5. Optional: upload.ts pushes results to Braintrust
|
|
```
|
|
|
|
The agent is **Claude Code** running inside a Docker sandbox managed by
|
|
`@vercel/agent-eval`. It operates on a real filesystem and can read/write files
|
|
freely.
|
|
|
|
## File Structure
|
|
|
|
```
|
|
packages/evals/
|
|
experiments/
|
|
experiment.ts # ExperimentConfig — agent, sandbox, setup() hook
|
|
scripts/
|
|
eval.sh # Supabase lifecycle wrapper (start → eval → stop)
|
|
src/
|
|
upload.ts # Standalone Braintrust result uploader
|
|
evals/
|
|
eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
|
|
{scenario}/
|
|
PROMPT.md # Task description (visible to agent)
|
|
EVAL.ts # Vitest assertions (hidden from agent during run)
|
|
meta.ts # expectedReferenceFiles for scoring
|
|
package.json # Minimal manifest with vitest devDep
|
|
project/
|
|
supabase/
|
|
config.toml # Shared Supabase config seeded into each sandbox
|
|
scenarios/ # Workflow scenario proposals
|
|
results/ # Output from eval runs (gitignored)
|
|
```
|
|
|
|
## Running Evals
|
|
|
|
```bash
|
|
# Run all scenarios with skills
|
|
mise run eval
|
|
|
|
# Force re-run (bypass source caching)
|
|
mise run --force eval
|
|
|
|
# Run a specific scenario
|
|
EVAL_SCENARIO=auth-rls-new-project mise run eval
|
|
|
|
# Override model
|
|
EVAL_MODEL=claude-opus-4-6 mise run eval
|
|
|
|
# Run without skills (baseline)
|
|
EVAL_BASELINE=true mise run eval
|
|
|
|
# Dry run (no API calls)
|
|
mise run eval:dry
|
|
|
|
# Upload results to Braintrust
|
|
mise run eval:upload
|
|
```
|
|
|
|
## Baseline Mode
|
|
|
|
Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
|
|
default, skill files from `skills/supabase/` are written into the sandbox.
|
|
|
|
Compare with-skill vs baseline:
|
|
|
|
```bash
|
|
mise run eval # with skills
|
|
EVAL_BASELINE=true mise run eval # without skills (baseline)
|
|
```
|
|
|
|
## Adding Scenarios
|
|
|
|
1. Create `evals/{scenario-name}/` with:
|
|
- `PROMPT.md` — task description for the agent
|
|
- `EVAL.ts` — vitest assertions checking agent output
|
|
- `meta.ts` — export `expectedReferenceFiles` array for scoring
|
|
- `package.json` — `{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
|
|
2. Add any starter files the agent should see (they get copied via `setup()`)
|
|
3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)
|
|
|
|
## Environment
|
|
|
|
```
|
|
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
|
|
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6)
|
|
EVAL_SCENARIO=... # Optional: run single scenario
|
|
EVAL_BASELINE=true # Optional: run without skills
|
|
BRAINTRUST_API_KEY=... # Required for eval:upload
|
|
BRAINTRUST_PROJECT_ID=... # Required for eval:upload
|
|
```
|
|
|
|
## Docker Evals
|
|
|
|
Build and run evals inside Docker (e.g., for CI):
|
|
|
|
```bash
|
|
mise run eval:docker:build # Build the eval Docker image
|
|
mise run eval:docker # Run evals in Docker
|
|
mise run eval:docker:shell # Debug shell in eval container
|
|
```
|