4.0 KiB
Evals — Agent Guide
This package evaluates whether LLMs correctly apply Supabase best practices using skill documentation as context. It uses Braintrust for eval orchestration and the Vercel AI SDK for LLM calls.
Architecture
Two-step LLM-as-judge pattern powered by Braintrust's Eval():
- The eval model (default:
claude-sonnet-4-5-20250929) receives a prompt with skill context and produces a code fix. - Three independent judge scorers (default:
claude-opus-4-6) evaluate the fix via structured output (Zod schemas via AI SDK'sOutput.object()).
Key files:
src/
code-fix.eval.ts # Braintrust Eval() entry point
dataset.ts # Maps extracted test cases to EvalCase format
scorer.ts # Three AI SDK-based scorers (Correctness, Completeness, Best Practice)
models.ts # Model provider factory (Anthropic / OpenAI)
dataset/
types.ts # CodeFixTestCase interface
extract.ts # Auto-extracts test cases from skill references
prompts/
code-fix.ts # System + user prompts for the eval model
How It Works
Test cases are auto-extracted from skills/*/references/*.md. The extractor
(dataset/extract.ts) finds consecutive **Incorrect:** / **Correct:** code
block pairs under ## sections. Each pair becomes one test case.
Three independent scorers evaluate each fix (0–1 scale):
- Correctness — does the fix address the core issue?
- Completeness — does the fix include all necessary changes?
- Best Practice — does the fix follow Supabase conventions?
Braintrust aggregates the scores and provides a dashboard for tracking regressions over time.
Adding Test Cases
No code changes needed. Add paired Incorrect/Correct blocks to any skill reference file. The extractor picks them up automatically.
Required format in a reference .md file:
## Section Title
Explanation of the issue.
**Incorrect:**
\```sql
-- bad code
\```
**Correct:**
\```sql
-- good code
\```
Rules:
- Pairs must be consecutive — an Incorrect block immediately followed by a Correct block
- Labels are matched case-insensitively. Bad labels:
Incorrect,Wrong,Bad. Good labels:Correct,Good,Usage,Implementation,Example,Recommended - The optional parenthetical in the label becomes the
descriptionfield:**Incorrect (missing RLS):** - Files prefixed with
_(like_sections.md,_template.md) are skipped - Each pair gets an ID like
supabase/db-rls-mandatory#0(skill/filename#index)
Modifying Prompts
src/prompts/code-fix.ts— what the eval model seessrc/scorer.ts— judge prompts for each scorer dimension
Temperature settings:
- Eval model:
0.2(incode-fix.eval.ts) - Judge model:
0.1(inscorer.ts)
Modifying Scoring
Each scorer in src/scorer.ts is independent. To add a new dimension:
- Create a new
EvalScorerfunction inscorer.ts - Add it to the
scoresarray incode-fix.eval.ts
Running Evals
# Run locally (no Braintrust upload)
mise run eval
# Run and upload to Braintrust dashboard
mise run eval:upload
Or directly:
cd packages/evals
# Local run
npx braintrust eval --no-send-logs src/code-fix.eval.ts
# Upload to Braintrust
npx braintrust eval src/code-fix.eval.ts
In CI, evals run via braintrustdata/eval-action@v1 and are gated by the
run-evals PR label.
Environment
API keys are loaded by mise from packages/evals/.env (configured in root
mise.toml). Copy .env.example to .env and fill in the keys.
ANTHROPIC_API_KEY=sk-ant-... # Required: eval model + judge model
BRAINTRUST_API_KEY=... # Required for upload to Braintrust dashboard
BRAINTRUST_PROJECT_ID=... # Required for upload to Braintrust dashboard
Optional overrides:
EVAL_MODEL=claude-sonnet-4-5-20250929 # Model under test
EVAL_JUDGE_MODEL=claude-opus-4-6 # Judge model for scorers