mirror of https://github.com/supabase/agent-skills.git synced 2026-03-27 10:09:26 +08:00

Files

Pedro Rodrigues 3c3d1f55ca containerize eval environment with Docker and mock CLIs

Host now only needs Docker + ANTHROPIC_API_KEY to run evals. Adds
multi-stage Dockerfile, mock supabase/docker/psql scripts, entrypoint,
docker-compose for local use, and switches CI to Docker-based execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-23 19:22:47 +00:00

evals

realtime scenario

2026-02-23 10:25:50 +00:00

mocks

containerize eval environment with Docker and mock CLIs

2026-02-23 19:22:47 +00:00

scenarios

realtime scenario

2026-02-23 10:25:50 +00:00

src

containerize eval environment with Docker and mock CLIs

2026-02-23 19:22:47 +00:00

.env.example

workflow evals with one scenario

2026-02-19 17:06:17 +00:00

AGENTS.md

load skills through skills CLI

2026-02-20 17:41:41 +00:00

CLAUDE.md

initial skills evals

2026-02-18 12:02:28 +00:00

docker-compose.yml

containerize eval environment with Docker and mock CLIs

2026-02-23 19:22:47 +00:00

docker-entrypoint.sh

containerize eval environment with Docker and mock CLIs

2026-02-23 19:22:47 +00:00

Dockerfile

containerize eval environment with Docker and mock CLIs

2026-02-23 19:22:47 +00:00

package-lock.json

load skills through skills CLI

2026-02-20 17:41:41 +00:00

package.json

load skills through skills CLI

2026-02-20 17:41:41 +00:00

README.md

workflow evals with one scenario

2026-02-19 17:06:17 +00:00

tsconfig.json

workflow evals with one scenario

2026-02-19 17:06:17 +00:00

README.md

Evals

Agent evaluation system for Supabase skills. Tests whether AI agents (starting with Claude Code) correctly implement Supabase tasks when given access to skill documentation.

How It Works

Each eval is a self-contained project directory with a task prompt. The agent works on it autonomously, then hidden vitest assertions check the result. Binary pass/fail.

1. Create temp workspace from eval template
2. Agent (claude -p) reads prompt and creates files
3. Hidden EVAL.ts runs vitest assertions against the output
4. Pass/fail

Usage

# Run all scenarios
mise run eval

# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval

# Run with baseline comparison (with-skill vs without-skill)
EVAL_BASELINE=true mise run eval

# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval

Environment Variables

ANTHROPIC_API_KEY         Required: Claude Code authentication
EVAL_MODEL                Override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO             Run single scenario by name
EVAL_BASELINE=true        Run baseline comparison (no skill)

Adding Scenarios

Create evals/{name}/ with PROMPT.md, EVAL.ts, and starter files
Write vitest assertions in EVAL.ts
Document in scenarios/SCENARIOS.md

See AGENTS.md for full details.