mirror of
https://github.com/supabase/agent-skills.git
synced 2026-03-27 10:09:26 +08:00
Add per-test pass/fail parsing from vitest verbose output, thread prompt content and individual test results through the runner, and rewrite uploadToBraintrust with experiment naming (model-variant-timestamp), granular scores (pass, test_pass_rate, per-test), rich metadata, and tool-call tracing via experiment.traced(). Also document --force flag for cached mise tasks and add Braintrust env vars to AGENTS.md. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Evals
Agent evaluation system for Supabase skills. Tests whether AI agents (starting with Claude Code) correctly implement Supabase tasks when given access to skill documentation.
How It Works
Each eval is a self-contained project directory with a task prompt. The agent works on it autonomously, then hidden vitest assertions check the result. Binary pass/fail.
1. Create temp workspace from eval template
2. Agent (claude -p) reads prompt and creates files
3. Hidden EVAL.ts runs vitest assertions against the output
4. Pass/fail
Usage
# Run all scenarios
mise run eval
# Run a specific scenario
EVAL_SCENARIO=auth-rls-new-project mise run eval
# Run with baseline comparison (with-skill vs without-skill)
EVAL_BASELINE=true mise run eval
# Override model
EVAL_MODEL=claude-opus-4-6 mise run eval
Environment Variables
ANTHROPIC_API_KEY Required: Claude Code authentication
EVAL_MODEL Override model (default: claude-sonnet-4-5-20250929)
EVAL_SCENARIO Run single scenario by name
EVAL_BASELINE=true Run baseline comparison (no skill)
Adding Scenarios
- Create
evals/{name}/withPROMPT.md,EVAL.ts, and starter files - Write vitest assertions in
EVAL.ts - Document in
scenarios/SCENARIOS.md
See AGENTS.md for full details.