use agent-evals package

2026-03-27 10:09:26 +08:00 · 2026-02-27 15:32:55 +00:00
parent 0894f5683e
commit 9c6fd293eb
61 changed files with 4208 additions and 4652 deletions
--- a/mise.toml
+++ b/mise.toml
@@ -46,14 +46,19 @@ sources = ["test/**", "skills/**"]
 # ── Eval tasks ────────────────────────────────────────────────────────

 [tasks.eval]
-description = "Run workflow evals"
-run = "npm --prefix packages/evals run eval"
-sources = ["packages/evals/src/**", "packages/evals/evals/**"]
+description = "Run workflow evals (use -- to pass args, e.g. mise run eval -- --skill supabase --scenario rls-update-needs-select)"
+run = "bash packages/evals/scripts/eval.sh"
+sources = ["packages/evals/evals/**", "packages/evals/experiments/**"]
+
+[tasks."eval:dry"]
+description = "Dry run workflow evals (no API calls)"
+run = "npm --prefix packages/evals run eval:dry"
+sources = ["packages/evals/evals/**", "packages/evals/experiments/**"]

 [tasks."eval:upload"]
-description = "Run workflow evals and upload to Braintrust"
+description = "Upload eval results to Braintrust"
 run = "npm --prefix packages/evals run eval:upload"
-sources = ["packages/evals/src/**", "packages/evals/evals/**"]
+sources = ["packages/evals/results/**"]

 # ── Docker eval tasks ────────────────────────────────────────────────

@@ -71,7 +76,6 @@ docker run --rm \
  -e EVAL_SCENARIO \
  -e EVAL_BASELINE \
  -e EVAL_SKILL \
-  -e BRAINTRUST_UPLOAD \
  -e BRAINTRUST_API_KEY \
  -e BRAINTRUST_PROJECT_ID \
  -e EVAL_RESULTS_DIR=/app/results \
--- a/packages/evals/AGENTS.md
+++ b/packages/evals/AGENTS.md
@@ -1,57 +1,56 @@
 # Evals — Agent Guide

 This package evaluates whether AI agents correctly implement Supabase tasks
-when using skill documentation. Modeled after
-[Vercel's next-evals-oss](https://github.com/vercel-labs/next-evals-oss): each
-eval is a self-contained project with a task prompt, the agent works on it, and
-hidden tests check the result. Binary pass/fail.
+when using skill documentation. Built on
+[@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
+self-contained scenario with a task prompt, the agent works in a Docker sandbox,
+and hidden vitest assertions check the result. Binary pass/fail.

 ## Architecture

 ```
-1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir)
-2. Install skills via `skills add` CLI (or skip for baseline)
-3. Run: claude -p "prompt" --cwd /tmp/eval-xxx
-4. Agent reads skill, creates migrations/code in the workspace
-5. Copy hidden EVAL.ts into workspace, run vitest
-6. Capture pass/fail
+1. eval.sh starts Supabase, exports keys
+2. agent-eval reads experiments/experiment.ts
+3. For each scenario:
+   a. setup() resets DB, writes config + skills into Docker sandbox
+   b. Agent (Claude Code) runs PROMPT.md in the sandbox
+   c. EVAL.ts (vitest) asserts against agent output
+4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
+5. Optional: upload.ts pushes results to Braintrust
 ```

-The agent is **Claude Code** invoked via `claude -p` (print mode). It operates
-on a real filesystem in a temp directory and can read/write files freely.
+The agent is **Claude Code** running inside a Docker sandbox managed by
+`@vercel/agent-eval`. It operates on a real filesystem and can read/write files
+freely.

-**Important**: MCP servers are disabled via `--strict-mcp-config` with an empty
-config. This ensures the agent uses only local tools (Bash, Edit, Write, Read,
-Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All
-work must happen on the local filesystem — e.g., creating migration files in
-`supabase/migrations/`, not applying them to a remote project.
-
-## Eval Structure
-
-Each eval lives in `evals/{scenario-name}/`:
+## File Structure

 ```
-evals/auth-rls-new-project/
+packages/evals/
+  experiments/
+    experiment.ts        # ExperimentConfig — agent, sandbox, setup() hook
+  scripts/
+    eval.sh              # Supabase lifecycle wrapper (start → eval → stop)
+  src/
+    upload.ts            # Standalone Braintrust result uploader
+  evals/
+    eval-utils.ts        # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
+    {scenario}/
      PROMPT.md          # Task description (visible to agent)
      EVAL.ts            # Vitest assertions (hidden from agent during run)
-  package.json       # Minimal project manifest
+      meta.ts            # expectedReferenceFiles for scoring
+      package.json       # Minimal manifest with vitest devDep
+  project/
    supabase/
-    config.toml      # Pre-initialized supabase config
-    migrations/      # Empty — agent creates files here
+      config.toml        # Shared Supabase config seeded into each sandbox
+  scenarios/             # Workflow scenario proposals
+  results/               # Output from eval runs (gitignored)
 ```

-**EVAL.ts** is never copied to the workspace until after the agent finishes.
-This prevents the agent from "teaching to the test."
-
 ## Running Evals

-Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
-source files haven't changed. Use `--force` to bypass caching when you need
-to re-run evals regardless (e.g., after changing environment variables or
-re-running the same scenario):
-
 ```bash
-# Run all scenarios with skills (default)
+# Run all scenarios with skills
 mise run eval

 # Force re-run (bypass source caching)
@@ -66,64 +65,52 @@ EVAL_MODEL=claude-opus-4-6 mise run eval
 # Run without skills (baseline)
 EVAL_BASELINE=true mise run eval

-# Install only a specific skill
-EVAL_SKILL=supabase mise run eval
+# Dry run (no API calls)
+mise run eval:dry

 # Upload results to Braintrust
 mise run eval:upload
-
-# Force upload (bypass cache)
-mise run --force eval:upload
 ```

 ## Baseline Mode

-Set `EVAL_BASELINE=true` to run scenarios **without** skills. By default,
-scenarios run with skills installed via the `skills` CLI.
+Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
+default, skill files from `skills/supabase/` are written into the sandbox.

-To compare with-skill vs baseline, run evals twice:
+Compare with-skill vs baseline:

 ```bash
 mise run eval                        # with skills
 EVAL_BASELINE=true mise run eval     # without skills (baseline)
 ```

-Compare the results to measure how much skills improve agent output.
-
 ## Adding Scenarios

-1. Create `evals/{scenario-name}/` with `PROMPT.md`, `EVAL.ts`, `package.json`
-2. Add any starter files the agent should see (e.g., `supabase/config.toml`)
-3. Write vitest assertions in `EVAL.ts` that check the agent's output files
-4. Document the scenario in `scenarios/SCENARIOS.md`
+1. Create `evals/{scenario-name}/` with:
+   - `PROMPT.md` — task description for the agent
+   - `EVAL.ts` — vitest assertions checking agent output
+   - `meta.ts` — export `expectedReferenceFiles` array for scoring
+   - `package.json` — `{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
+2. Add any starter files the agent should see (they get copied via `setup()`)
+3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)

 ## Environment

 ```
 ANTHROPIC_API_KEY=sk-ant-...         # Required: Claude Code authentication
-EVAL_MODEL=...                     # Optional: override model (default: claude-sonnet-4-5-20250929)
+EVAL_MODEL=...                       # Optional: override model (default: claude-sonnet-4-6)
 EVAL_SCENARIO=...                    # Optional: run single scenario
-EVAL_SKILL=...                     # Optional: install only this skill (e.g., "supabase")
-EVAL_BASELINE=true                 # Optional: run without skills (baseline mode)
-BRAINTRUST_UPLOAD=true             # Optional: upload results to Braintrust
-BRAINTRUST_API_KEY=...             # Required when BRAINTRUST_UPLOAD=true
-BRAINTRUST_PROJECT_ID=...          # Required when BRAINTRUST_UPLOAD=true
-BRAINTRUST_BASE_EXPERIMENT=...     # Optional: compare against a named experiment
+EVAL_BASELINE=true                   # Optional: run without skills
+BRAINTRUST_API_KEY=...               # Required for eval:upload
+BRAINTRUST_PROJECT_ID=...            # Required for eval:upload
 ```

-## Key Files
+## Docker Evals

-```
-src/
-  runner.ts              # Main orchestrator
-  types.ts               # Core interfaces
-  runner/
-    scaffold.ts          # Creates temp workspace from eval template
-    agent.ts             # Invokes claude -p as subprocess
-    test.ts              # Runs vitest EVAL.ts against workspace
-    results.ts           # Collects results and prints summary
-evals/
-  auth-rls-new-project/  # Scenario 1
-scenarios/
-  SCENARIOS.md           # Scenario descriptions
+Build and run evals inside Docker (e.g., for CI):
+
+```bash
+mise run eval:docker:build           # Build the eval Docker image
+mise run eval:docker                 # Run evals in Docker
+mise run eval:docker:shell           # Debug shell in eval container
 ```
--- a/packages/evals/evals/auth-fk-cascade-delete/EVAL.ts
+++ b/packages/evals/evals/auth-fk-cascade-delete/EVAL.ts
@@ -1,77 +1,68 @@
-export const expectedReferenceFiles = [
-	"db-schema-auth-fk.md",
-	"db-security-functions.md",
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates profiles table",
-		check: () => {
+test("creates profiles table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /profiles/.test(sql);
-		},
-	},
-	{
-		name: "FK references auth.users",
-		check: () =>
-			/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "ON DELETE CASCADE present",
-		check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "RLS enabled on profiles",
-		check: () =>
+	expect(/create\s+table/.test(sql) && /profiles/.test(sql)).toBe(true);
+});
+
+test("FK references auth.users", () => {
+	expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("ON DELETE CASCADE present", () => {
+	expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("RLS enabled on profiles", () => {
+	expect(
 		/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "trigger function uses SECURITY DEFINER",
-		check: () => /security\s+definer/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "trigger function sets search_path",
-		check: () =>
+	).toBe(true);
+});
+
+test("trigger function uses SECURITY DEFINER", () => {
+	expect(/security\s+definer/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("trigger function sets search_path", () => {
+	expect(
 		/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "trigger created on auth.users",
-		check: () =>
+	).toBe(true);
+});
+
+test("trigger created on auth.users", () => {
+	expect(
 		/create\s+trigger[\s\S]*?on\s+auth\.users/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "policies scoped to authenticated",
-		check: () => {
+	).toBe(true);
+});
+
+test("policies scoped to authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
 		/security\s+definer/.test(sql),
 		/set\s+search_path\s*=\s*''/.test(sql),
@@ -79,7 +70,5 @@ export const assertions: EvalAssertion[] = [
 		policyBlocks.length > 0 &&
 			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
 	];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/auth-fk-cascade-delete/meta.ts
+++ b/packages/evals/evals/auth-fk-cascade-delete/meta.ts
@@ -0,0 +1,6 @@
+export const expectedReferenceFiles = [
+	"db-schema-auth-fk.md",
+	"db-security-functions.md",
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+];
--- a/packages/evals/evals/auth-fk-cascade-delete/package.json
+++ b/packages/evals/evals/auth-fk-cascade-delete/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "auth-fk-cascade-delete",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/auth-rls-new-project/EVAL.ts
+++ b/packages/evals/evals/auth-rls-new-project/EVAL.ts
@@ -1,16 +1,6 @@
-export const expectedReferenceFiles = [
-	"dev-getting-started.md",
-	"db-rls-mandatory.md",
-	"db-rls-policy-types.md",
-	"db-rls-common-mistakes.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-migrations-idempotent.md",
-];
-
 import { existsSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	anonSeeesNoRows,
@@ -19,43 +9,42 @@ import {
 	getSupabaseDir,
 	queryTable,
 	tableExists,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized (config.toml exists)",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "migration file exists in supabase/migrations/",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates tasks table",
-		check: () => {
+test("supabase project initialized (config.toml exists)", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("migration file exists in supabase/migrations/", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});
+
+test("creates tasks table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /tasks/.test(sql);
-		},
-	},
-	{
-		name: "enables RLS on tasks table",
-		check: () =>
+	expect(/create\s+table/.test(sql) && /tasks/.test(sql)).toBe(true);
+});
+
+test("enables RLS on tasks table", () => {
+	expect(
 		/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "has foreign key to auth.users",
-		check: () =>
-			/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "uses ON DELETE CASCADE for auth FK",
-		check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "uses (select auth.uid()) not bare auth.uid() in policies",
-		check: () => {
+	).toBe(true);
+});
+
+test("has foreign key to auth.users", () => {
+	expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("uses ON DELETE CASCADE for auth FK", () => {
+	expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	for (const policy of policyBlocks) {
@@ -63,61 +52,53 @@ export const assertions: EvalAssertion[] = [
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp for time columns",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp for time columns", () => {
 	const rawSql = getMigrationSQL().toLowerCase();
 	const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
 	if (
 		sql.includes("created_at") ||
 		sql.includes("updated_at") ||
 		sql.includes("due_date")
 	) {
-				return !hasPlainTimestamp.test(sql);
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
 	}
-			return true;
-		},
-	},
-	{
-		name: "creates index on user_id column",
-		check: () => {
+});
+
+test("creates index on user_id column", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+index/.test(sql) && /user_id/.test(sql);
-		},
-	},
-	{
-		name: "does not use SERIAL or BIGSERIAL for primary key",
-		check: () => {
+	expect(/create\s+index/.test(sql) && /user_id/.test(sql)).toBe(true);
+});
+
+test("does not use SERIAL or BIGSERIAL for primary key", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return !/\bserial\b/.test(sql) && !/\bbigserial\b/.test(sql);
-		},
-	},
-	{
-		name: "migration is idempotent (uses IF NOT EXISTS)",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
+	expect(/\bserial\b/.test(sql)).toBe(false);
+	expect(/\bbigserial\b/.test(sql)).toBe(false);
+});
+
+test("migration is idempotent (uses IF NOT EXISTS)", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const signals = [
 		/enable\s+row\s+level\s+security/,
@@ -126,25 +107,18 @@ export const assertions: EvalAssertion[] = [
 		/on\s+delete\s+cascade/,
 		/create\s+index/,
 	];
-			return signals.filter((r) => r.test(sql)).length >= 4;
-		},
-	},
-	{
-		name: "tasks table exists in the database after migration",
-		check: () => tableExists("tasks"),
-		timeout: 10_000,
-	},
-	{
-		name: "tasks table is queryable with service role",
-		check: async () => {
+	expect(signals.filter((r) => r.test(sql)).length >= 4).toBe(true);
+});
+
+test("tasks table exists in the database after migration", async () => {
+	expect(await tableExists("tasks")).toBe(true);
+}, 10_000);
+
+test("tasks table is queryable with service role", async () => {
 	const { error } = await queryTable("tasks", "service_role");
-			return error === null;
-		},
-		timeout: 10_000,
-	},
-	{
-		name: "tasks table returns no rows for anon (RLS is active)",
-		check: () => anonSeeesNoRows("tasks"),
-		timeout: 10_000,
-	},
-];
+	expect(error === null).toBe(true);
+}, 10_000);
+
+test("tasks table returns no rows for anon (RLS is active)", async () => {
+	expect(await anonSeeesNoRows("tasks")).toBe(true);
+}, 10_000);
--- a/packages/evals/evals/auth-rls-new-project/meta.ts
+++ b/packages/evals/evals/auth-rls-new-project/meta.ts
@@ -0,0 +1,9 @@
+export const expectedReferenceFiles = [
+	"dev-getting-started.md",
+	"db-rls-mandatory.md",
+	"db-rls-policy-types.md",
+	"db-rls-common-mistakes.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/auth-rls-new-project/package.json
+++ b/packages/evals/evals/auth-rls-new-project/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "auth-rls-new-project",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/cli-hallucinated-commands/EVAL.ts
+++ b/packages/evals/evals/cli-hallucinated-commands/EVAL.ts
@@ -1,11 +1,6 @@
-export const expectedReferenceFiles = [
-	"dev-getting-started.md",
-	"edge-fun-quickstart.md",
-];
-
 import { readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const cwd = process.cwd();

@@ -27,79 +22,72 @@ function getReferenceContent(): string {
 	return readFileSync(file, "utf-8");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "CLI_REFERENCE.md exists in project root",
-		check: () => findReferenceFile() !== null,
-	},
-	{
-		name: "no hallucinated functions log command",
-		check: () => {
+test("CLI_REFERENCE.md exists in project root", () => {
+	expect(findReferenceFile() !== null).toBe(true);
+});
+
+test("no hallucinated functions log command", () => {
 	const content = getReferenceContent();
-			return (
-				!/`supabase\s+functions\s+log`/.test(content) &&
-				!/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) &&
-				!/^\s*supabase\s+functions\s+log\b/m.test(content)
-			);
-		},
-	},
-	{
-		name: "no hallucinated db query command",
-		check: () => {
+	expect(
+		/`supabase\s+functions\s+log`/.test(content) ||
+			/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) ||
+			/^\s*supabase\s+functions\s+log\b/m.test(content),
+	).toBe(false);
+});
+
+test("no hallucinated db query command", () => {
 	const content = getReferenceContent();
-			return (
-				!/`supabase\s+db\s+query`/.test(content) &&
-				!/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) &&
-				!/^\s*supabase\s+db\s+query\b/m.test(content)
-			);
-		},
-	},
-	{
-		name: "mentions supabase functions serve for local development",
-		check: () =>
+	expect(
+		/`supabase\s+db\s+query`/.test(content) ||
+			/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) ||
+			/^\s*supabase\s+db\s+query\b/m.test(content),
+	).toBe(false);
+});
+
+test("mentions supabase functions serve for local development", () => {
+	expect(
 		/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions supabase functions deploy",
-		check: () =>
+	).toBe(true);
+});
+
+test("mentions supabase functions deploy", () => {
+	expect(
 		/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions psql or SQL Editor or connection string for ad-hoc SQL",
-		check: () => {
+	).toBe(true);
+});
+
+test("mentions psql or SQL Editor or connection string for ad-hoc SQL", () => {
 	const content = getReferenceContent().toLowerCase();
-			return (
+	expect(
 		/\bpsql\b/.test(content) ||
 			/sql\s+editor/.test(content) ||
 			/connection\s+string/.test(content) ||
-				/supabase\s+db\s+dump/.test(content)
-			);
-		},
-	},
-	{
-		name: "mentions supabase db push or supabase db reset for migrations",
-		check: () => {
+			/supabase\s+db\s+dump/.test(content),
+	).toBe(true);
+});
+
+test("mentions supabase db push or supabase db reset for migrations", () => {
 	const content = getReferenceContent().toLowerCase();
-			return (
+	expect(
 		/supabase\s+db\s+push/.test(content) ||
-				/supabase\s+db\s+reset/.test(content)
+			/supabase\s+db\s+reset/.test(content),
+	).toBe(true);
+});
+
+test("mentions supabase start for local stack", () => {
+	expect(/supabase\s+start/.test(getReferenceContent().toLowerCase())).toBe(
+		true,
 	);
-		},
-	},
-	{
-		name: "mentions supabase start for local stack",
-		check: () => /supabase\s+start/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions Dashboard or Logs Explorer for production log viewing",
-		check: () => {
+});
+
+test("mentions Dashboard or Logs Explorer for production log viewing", () => {
 	const content = getReferenceContent().toLowerCase();
-			return /\bdashboard\b/.test(content) || /logs\s+explorer/.test(content);
-		},
-	},
-	{
-		name: "overall quality: uses real CLI commands throughout",
-		check: () => {
+	expect(/\bdashboard\b/.test(content) || /logs\s+explorer/.test(content)).toBe(
+		true,
+	);
+});
+
+test("overall quality: uses real CLI commands throughout", () => {
 	const content = getReferenceContent().toLowerCase();
 	const signals = [
 		/supabase\s+start/,
@@ -122,7 +110,5 @@ export const assertions: EvalAssertion[] = [
 	const hallucinationMatches = hallucinations.filter((r) =>
 		r.test(content),
 	).length;
-			return positiveMatches >= 5 && hallucinationMatches === 0;
-		},
-	},
-];
+	expect(positiveMatches >= 5 && hallucinationMatches === 0).toBe(true);
+});
--- a/packages/evals/evals/cli-hallucinated-commands/meta.ts
+++ b/packages/evals/evals/cli-hallucinated-commands/meta.ts
@@ -0,0 +1,4 @@
+export const expectedReferenceFiles = [
+	"dev-getting-started.md",
+	"edge-fun-quickstart.md",
+];
--- a/packages/evals/evals/cli-hallucinated-commands/package.json
+++ b/packages/evals/evals/cli-hallucinated-commands/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "cli-hallucinated-commands",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/collaborative-rooms-realtime/EVAL.ts
+++ b/packages/evals/evals/collaborative-rooms-realtime/EVAL.ts
@@ -1,71 +1,48 @@
-export const expectedReferenceFiles = [
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-security-functions.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-schema-realtime.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-	"realtime-setup-auth.md",
-	"realtime-broadcast-database.md",
-	"realtime-setup-channels.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates rooms table",
-		check: () =>
+test("creates rooms table", () => {
+	expect(
 		/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "creates room_members table",
-		check: () => {
+	).toBe(true);
+});
+
+test("creates room_members table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/create\s+table[\s\S]*?room_members/.test(sql) ||
 			/create\s+table[\s\S]*?room_users/.test(sql) ||
-				/create\s+table[\s\S]*?memberships/.test(sql)
-			);
-		},
-	},
-	{
-		name: "creates content table",
-		check: () => {
+			/create\s+table[\s\S]*?memberships/.test(sql),
+	).toBe(true);
+});
+
+test("creates content table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/create\s+table[\s\S]*?content/.test(sql) ||
 			/create\s+table[\s\S]*?items/.test(sql) ||
 			/create\s+table[\s\S]*?documents/.test(sql) ||
 			/create\s+table[\s\S]*?posts/.test(sql) ||
-				/create\s+table[\s\S]*?messages/.test(sql)
-			);
-		},
-	},
-	{
-		name: "room_members has role column with owner/editor/viewer",
-		check: () => {
+			/create\s+table[\s\S]*?messages/.test(sql),
+	).toBe(true);
+});
+
+test("room_members has role column with owner/editor/viewer", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/role/.test(sql) &&
 			/owner/.test(sql) &&
 			/editor/.test(sql) &&
-				/viewer/.test(sql)
-			);
-		},
-	},
-	{
-		name: "enables RLS on all application tables",
-		check: () => {
+			/viewer/.test(sql),
+	).toBe(true);
+});
+
+test("enables RLS on all application tables", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const roomsRls =
 		/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
@@ -97,72 +74,66 @@ export const assertions: EvalAssertion[] = [
 		/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
 			sql,
 		);
-			return roomsRls && membershipRls && contentRls;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
+	expect(roomsRls && membershipRls && contentRls).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "content has room_id FK referencing rooms",
-		check: () =>
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("content has room_id FK referencing rooms", () => {
+	expect(
 		/room_id[\s\S]*?references[\s\S]*?rooms/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "policies use (select auth.uid())",
-		check: () => {
+	).toBe(true);
+});
+
+test("policies use (select auth.uid())", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			if (policyBlocks.length === 0) return false;
+	if (policyBlocks.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
 	for (const policy of policyBlocks) {
 		if (
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const appPolicies = policyBlocks.filter(
 		(p) => !p.includes("realtime.messages"),
 	);
-			return (
+	expect(
 		appPolicies.length > 0 &&
-				appPolicies.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "private schema with security_definer helper function",
-		check: () => {
+			appPolicies.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("private schema with security_definer helper function", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/create\s+schema[\s\S]*?private/.test(sql) &&
 			/private\./.test(sql) &&
 			/security\s+definer/.test(sql) &&
-				/set\s+search_path\s*=\s*''/.test(sql)
-			);
-		},
-	},
-	{
-		name: "role-based write policies: content INSERT/UPDATE restricted to owner or editor",
-		check: () => {
+			/set\s+search_path\s*=\s*''/.test(sql),
+	).toBe(true);
+});
+
+test("role-based write policies: content INSERT/UPDATE restricted to owner or editor", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const writePolicies = policyBlocks.filter(
@@ -174,14 +145,12 @@ export const assertions: EvalAssertion[] = [
 				p.includes("posts") ||
 				p.includes("messages")),
 	);
-			return writePolicies.some(
-				(p) => p.includes("owner") || p.includes("editor"),
-			);
-		},
-	},
-	{
-		name: "viewer role is read-only (no write access to content)",
-		check: () => {
+	expect(
+		writePolicies.some((p) => p.includes("owner") || p.includes("editor")),
+	).toBe(true);
+});
+
+test("viewer role is read-only (no write access to content)", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const contentWritePolicies = policyBlocks.filter(
@@ -193,8 +162,11 @@ export const assertions: EvalAssertion[] = [
 				p.includes("posts") ||
 				p.includes("messages")),
 	);
-			if (contentWritePolicies.length === 0) return true;
-			return !contentWritePolicies.some((p) => {
+	if (contentWritePolicies.length === 0) {
+		expect(true).toBe(true);
+		return;
+	}
+	const result = !contentWritePolicies.some((p) => {
 		const mentionsRole =
 			p.includes("owner") || p.includes("editor") || p.includes("viewer");
 		if (!mentionsRole) return true;
@@ -202,26 +174,26 @@ export const assertions: EvalAssertion[] = [
 			p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
 		);
 	});
-		},
-	},
-	{
-		name: "indexes on membership lookup columns",
-		check: () => {
+	expect(result).toBe(true);
+});
+
+test("indexes on membership lookup columns", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			if (!/create\s+index/.test(sql)) return false;
+	if (!/create\s+index/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
 	const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		indexBlocks.filter(
 			(idx) =>
 				idx.toLowerCase().includes("user_id") ||
 				idx.toLowerCase().includes("room_id"),
-				).length >= 1
-			);
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp",
-		check: () => {
+		).length >= 1,
+	).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp", () => {
 	const rawSql = getMigrationSQL().toLowerCase();
 	const sql = rawSql.replace(/--[^\n]*/g, "");
 	const hasPlainTimestamp =
@@ -231,36 +203,33 @@ export const assertions: EvalAssertion[] = [
 		sql.includes("updated_at") ||
 		sql.includes("_at ")
 	) {
-				return !hasPlainTimestamp.test(sql);
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
 	}
-			return true;
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "realtime publication enabled for content table",
-		check: () =>
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("realtime publication enabled for content table", () => {
+	expect(
 		/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "broadcast trigger for content changes",
-		check: () => {
+	).toBe(true);
+});
+
+test("broadcast trigger for content changes", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				(/realtime\.broadcast_changes/.test(sql) ||
-					/realtime\.send/.test(sql)) &&
-				/create\s+trigger/.test(sql)
-			);
-		},
-	},
-	{
-		name: "broadcast trigger function uses security definer",
-		check: () => {
+	expect(
+		(/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql)) &&
+			/create\s+trigger/.test(sql),
+	).toBe(true);
+});
+
+test("broadcast trigger function uses security definer", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const functionBlocks =
 		sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
@@ -269,46 +238,52 @@ export const assertions: EvalAssertion[] = [
 			f.toLowerCase().includes("realtime.broadcast_changes") ||
 			f.toLowerCase().includes("realtime.send"),
 	);
-			if (realtimeFunctions.length === 0) return false;
-			return realtimeFunctions.some(
+	if (realtimeFunctions.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		realtimeFunctions.some(
 			(f) =>
 				/security\s+definer/.test(f.toLowerCase()) &&
 				/set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
-			);
-		},
-	},
-	{
-		name: "RLS policies on realtime.messages",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("RLS policies on realtime.messages", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const realtimePolicies = policyBlocks.filter((p) =>
 		p.includes("realtime.messages"),
 	);
-			if (realtimePolicies.length === 0) return false;
-			return realtimePolicies.some(
+	if (realtimePolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		realtimePolicies.some(
 			(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
-			);
-		},
-	},
-	{
-		name: "realtime policy checks extension column",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("realtime policy checks extension column", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const realtimePolicies = policyBlocks.filter((p) =>
 		p.includes("realtime.messages"),
 	);
-			return realtimePolicies.some(
+	expect(
+		realtimePolicies.some(
 			(p) =>
 				p.includes("extension") &&
 				(p.includes("broadcast") || p.includes("presence")),
-			);
-		},
-	},
-	{
-		name: "overall quality score",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("overall quality score", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -321,24 +296,19 @@ export const assertions: EvalAssertion[] = [
 		/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
 			sql,
 		),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		/create\s+schema[\s\S]*?private/.test(sql),
-				/security\s+definer/.test(sql) &&
-					/set\s+search_path\s*=\s*''/.test(sql),
+		/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
 		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
 		policyBlocks.length > 0 &&
-					policyBlocks.filter((p) => !p.includes("realtime.messages")).length >
-						0 &&
+			policyBlocks.filter((p) => !p.includes("realtime.messages")).length > 0 &&
 			policyBlocks
 				.filter((p) => !p.includes("realtime.messages"))
 				.every((p) => /to\s+authenticated/.test(p)),
 		/create\s+index/.test(sql),
 		/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
 		/if\s+not\s+exists/.test(sql),
-				sql.includes("owner") &&
-					sql.includes("editor") &&
-					sql.includes("viewer"),
+		sql.includes("owner") && sql.includes("editor") && sql.includes("viewer"),
 		/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
 		/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
 		/create\s+trigger/.test(sql),
@@ -348,7 +318,5 @@ export const assertions: EvalAssertion[] = [
 			.some((p) => p.includes("extension")),
 		/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
 	];
-			return signals.filter(Boolean).length >= 13;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 13).toBe(true);
+});
--- a/packages/evals/evals/collaborative-rooms-realtime/meta.ts
+++ b/packages/evals/evals/collaborative-rooms-realtime/meta.ts
@@ -0,0 +1,14 @@
+export const expectedReferenceFiles = [
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-security-functions.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-schema-realtime.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+	"realtime-setup-auth.md",
+	"realtime-broadcast-database.md",
+	"realtime-setup-channels.md",
+];
--- a/packages/evals/evals/collaborative-rooms-realtime/package.json
+++ b/packages/evals/evals/collaborative-rooms-realtime/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "collaborative-rooms-realtime",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/connection-pooling-prisma/EVAL.ts
+++ b/packages/evals/evals/connection-pooling-prisma/EVAL.ts
@@ -1,12 +1,6 @@
-export const expectedReferenceFiles = [
-	"db-conn-pooling.md",
-	"db-migrations-idempotent.md",
-	"db-schema-auth-fk.md",
-];
-
 import { existsSync, readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const cwd = process.cwd();

@@ -65,59 +59,51 @@ function getAllOutputContent(): string {
 	return parts.join("\n");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "prisma schema file exists",
-		check: () => findPrismaSchema() !== null,
-	},
-	{
-		name: "prisma schema references pooler port 6543",
-		check: () => /6543/.test(getAllOutputContent()),
-	},
-	{
-		name: "pgbouncer=true param present",
-		check: () =>
-			/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase()),
-	},
-	{
-		name: "DIRECT_URL provided for migrations",
-		check: () => {
+test("prisma schema file exists", () => {
+	expect(findPrismaSchema() !== null).toBe(true);
+});
+
+test("prisma schema references pooler port 6543", () => {
+	expect(/6543/.test(getAllOutputContent())).toBe(true);
+});
+
+test("pgbouncer=true param present", () => {
+	expect(/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("DIRECT_URL provided for migrations", () => {
 	const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
-			return /directurl/.test(allContent) || /direct_url/.test(allContent);
-		},
-	},
-	{
-		name: "datasource block references directUrl or DIRECT_URL env var",
-		check: () => {
+	expect(/directurl/.test(allContent) || /direct_url/.test(allContent)).toBe(
+		true,
+	);
+});
+
+test("datasource block references directUrl or DIRECT_URL env var", () => {
 	const schema = getPrismaSchema().toLowerCase();
 	const datasourceBlock =
 		schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
-			return (
-				/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock)
-			);
-		},
-	},
-	{
-		name: "connection limit set to 1 for serverless",
-		check: () => {
+	expect(
+		/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock),
+	).toBe(true);
+});
+
+test("connection limit set to 1 for serverless", () => {
 	const content = getAllOutputContent().toLowerCase();
-			return (
+	expect(
 		/connection_limit\s*=\s*1/.test(content) ||
 			/connection_limit:\s*1/.test(content) ||
-				/connectionlimit\s*=\s*1/.test(content)
-			);
-		},
-	},
-	{
-		name: "explanation distinguishes port 6543 vs 5432",
-		check: () => {
+			/connectionlimit\s*=\s*1/.test(content),
+	).toBe(true);
+});
+
+test("explanation distinguishes port 6543 vs 5432", () => {
 	const content = getAllOutputContent();
-			return /6543/.test(content) && /5432/.test(content);
-		},
-	},
-	{
-		name: "overall quality: demonstrates correct Prisma + Supabase pooler setup",
-		check: () => {
+	expect(/6543/.test(content) && /5432/.test(content)).toBe(true);
+});
+
+test("overall quality: demonstrates correct Prisma + Supabase pooler setup", () => {
 	const schema = getPrismaSchema().toLowerCase();
 	const envContent = getAllEnvContent().toLowerCase();
 	const allContent = `${schema}\n${envContent}`;
@@ -128,7 +114,5 @@ export const assertions: EvalAssertion[] = [
 		/connection_limit\s*=\s*1|connection_limit:\s*1/,
 		/5432/,
 	];
-			return signals.filter((r) => r.test(allContent)).length >= 4;
-		},
-	},
-];
+	expect(signals.filter((r) => r.test(allContent)).length >= 4).toBe(true);
+});
--- a/packages/evals/evals/connection-pooling-prisma/meta.ts
+++ b/packages/evals/evals/connection-pooling-prisma/meta.ts
@@ -0,0 +1,5 @@
+export const expectedReferenceFiles = [
+	"db-conn-pooling.md",
+	"db-migrations-idempotent.md",
+	"db-schema-auth-fk.md",
+];
--- a/packages/evals/evals/connection-pooling-prisma/package.json
+++ b/packages/evals/evals/connection-pooling-prisma/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "connection-pooling-prisma",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/edge-function-hello-world/EVAL.ts
+++ b/packages/evals/evals/edge-function-hello-world/EVAL.ts
@@ -1,14 +1,6 @@
-export const expectedReferenceFiles = [
-	"edge-fun-quickstart.md",
-	"edge-fun-project-structure.md",
-	"edge-pat-cors.md",
-	"edge-pat-error-handling.md",
-	"dev-getting-started.md",
-];
-
 import { existsSync, readdirSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	findFunctionFile,
@@ -17,7 +9,7 @@ import {
 	getFunctionsDir,
 	getSharedCode,
 	getSupabaseDir,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

 const FUNCTION_NAME = "hello-world";

@@ -33,61 +25,57 @@ function getCatchBlockCode(): string {
 	return code.slice(catchIndex);
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "function directory exists",
-		check: () => existsSync(join(getFunctionsDir(), FUNCTION_NAME)),
-	},
-	{
-		name: "function index file exists",
-		check: () => findFunctionFile(FUNCTION_NAME) !== null,
-	},
-	{
-		name: "uses Deno.serve",
-		check: () => /Deno\.serve/.test(getFunctionCode(FUNCTION_NAME)),
-	},
-	{
-		name: "returns JSON response",
-		check: () => {
+test("supabase project initialized", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("function directory exists", () => {
+	expect(existsSync(join(getFunctionsDir(), FUNCTION_NAME))).toBe(true);
+});
+
+test("function index file exists", () => {
+	expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
+});
+
+test("uses Deno.serve", () => {
+	expect(/Deno\.serve/.test(getFunctionCode(FUNCTION_NAME))).toBe(true);
+});
+
+test("returns JSON response", () => {
 	const allCode = getAllCode();
-			return (
+	expect(
 		/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
 			/Response\.json/i.test(allCode) ||
-				/JSON\.stringify/i.test(allCode)
-			);
-		},
-	},
-	{
-		name: "handles OPTIONS preflight",
-		check: () => {
+			/JSON\.stringify/i.test(allCode),
+	).toBe(true);
+});
+
+test("handles OPTIONS preflight", () => {
 	const allCode = getAllCode();
-			return /['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode);
-		},
-	},
-	{
-		name: "defines CORS headers",
-		check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
-	},
-	{
-		name: "CORS allows required headers",
-		check: () => {
+	expect(/['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode)).toBe(
+		true,
+	);
+});
+
+test("defines CORS headers", () => {
+	expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
+});
+
+test("CORS allows required headers", () => {
 	const allCode = getAllCode().toLowerCase();
-			return (
+	expect(
 		/access-control-allow-headers/.test(allCode) &&
 			/authorization/.test(allCode) &&
-				/apikey/.test(allCode)
-			);
-		},
-	},
-	{
-		name: "error response has CORS headers",
-		check: () => {
+			/apikey/.test(allCode),
+	).toBe(true);
+});
+
+test("error response has CORS headers", () => {
 	const catchCode = getCatchBlockCode();
-			if (catchCode.length === 0) return false;
+	if (catchCode.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
 	const sharedCode = getSharedCode();
 	const directCors =
 		/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
@@ -95,51 +83,45 @@ export const assertions: EvalAssertion[] = [
 		/errorResponse|jsonResponse|json_response|error_response/i.test(
 			catchCode,
 		) && /Access-Control-Allow-Origin/i.test(sharedCode);
-			return directCors || callsSharedHelper;
-		},
-	},
-	{
-		name: "has try-catch for error handling",
-		check: () => {
+	expect(directCors || callsSharedHelper).toBe(true);
+});
+
+test("has try-catch for error handling", () => {
 	const code = getFunctionCode(FUNCTION_NAME);
-			return /\btry\s*\{/.test(code) && /\bcatch\b/.test(code);
-		},
-	},
-	{
-		name: "returns proper error status code",
-		check: () => {
+	expect(/\btry\s*\{/.test(code) && /\bcatch\b/.test(code)).toBe(true);
+});
+
+test("returns proper error status code", () => {
 	const catchCode = getCatchBlockCode();
-			if (catchCode.length === 0) return false;
-			return (
+	if (catchCode.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
 		/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
-				/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode)
-			);
-		},
-	},
-	{
-		name: "shared CORS module exists",
-		check: () => findSharedCorsFile() !== null,
-	},
-	{
-		name: "function imports from shared",
-		check: () =>
+			/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode),
+	).toBe(true);
+});
+
+test("shared CORS module exists", () => {
+	expect(findSharedCorsFile() !== null).toBe(true);
+});
+
+test("function imports from shared", () => {
+	expect(
 		/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
-	},
-	{
-		name: "function uses hyphenated name",
-		check: () => {
+	).toBe(true);
+});
+
+test("function uses hyphenated name", () => {
 	const dirs = existsSync(getFunctionsDir())
 		? readdirSync(getFunctionsDir())
 		: [];
-			const helloDir = dirs.find(
-				(d) => d.includes("hello") && d.includes("world"),
-			);
-			return helloDir !== undefined && /^hello-world$/.test(helloDir);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Edge Function best practices",
-		check: () => {
+	const helloDir = dirs.find((d) => d.includes("hello") && d.includes("world"));
+	expect(helloDir !== undefined && /^hello-world$/.test(helloDir)).toBe(true);
+});
+
+test("overall quality: demonstrates Edge Function best practices", () => {
 	const allCode = getAllCode().toLowerCase();
 	const signals = [
 		/deno\.serve/,
@@ -151,7 +133,5 @@ export const assertions: EvalAssertion[] = [
 		/authorization/,
 		/apikey/,
 	];
-			return signals.filter((r) => r.test(allCode)).length >= 6;
-		},
-	},
-];
+	expect(signals.filter((r) => r.test(allCode)).length >= 6).toBe(true);
+});
--- a/packages/evals/evals/edge-function-hello-world/meta.ts
+++ b/packages/evals/evals/edge-function-hello-world/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"edge-fun-quickstart.md",
+	"edge-fun-project-structure.md",
+	"edge-pat-cors.md",
+	"edge-pat-error-handling.md",
+	"dev-getting-started.md",
+];
--- a/packages/evals/evals/edge-function-hello-world/package.json
+++ b/packages/evals/evals/edge-function-hello-world/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "edge-function-hello-world",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/extension-wrong-schema/EVAL.ts
+++ b/packages/evals/evals/extension-wrong-schema/EVAL.ts
@@ -1,83 +1,70 @@
-export const expectedReferenceFiles = [
-	"db-schema-extensions.md",
-	"db-rls-mandatory.md",
-	"db-migrations-idempotent.md",
-	"db-schema-auth-fk.md",
-	"db-rls-common-mistakes.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "extension installed in extensions schema",
-		check: () =>
+test("extension installed in extensions schema", () => {
+	expect(
 		/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "IF NOT EXISTS on extension creation",
-		check: () =>
+	).toBe(true);
+});
+
+test("IF NOT EXISTS on extension creation", () => {
+	expect(
 		/create\s+extension\s+if\s+not\s+exists/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "vector column with correct dimensions",
-		check: () =>
+	).toBe(true);
+});
+
+test("vector column with correct dimensions", () => {
+	expect(
 		/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "HNSW index used instead of IVFFlat",
-		check: () => /using\s+hnsw/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "RLS enabled on documents table",
-		check: () =>
+	).toBe(true);
+});
+
+test("HNSW index used instead of IVFFlat", () => {
+	expect(/using\s+hnsw/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("RLS enabled on documents table", () => {
+	expect(
 		/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
+	).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "idempotent table creation (IF NOT EXISTS)",
-		check: () =>
-			/create\s+table\s+if\s+not\s+exists/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "overall quality: demonstrates pgvector best practices",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("idempotent table creation (IF NOT EXISTS)", () => {
+	expect(
+		/create\s+table\s+if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates pgvector best practices", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -88,13 +75,10 @@ export const assertions: EvalAssertion[] = [
 		/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
 			sql,
 		),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		policyBlocks.length > 0 &&
 			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
 		/if\s+not\s+exists/.test(sql),
 	];
-			return signals.filter(Boolean).length >= 6;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 6).toBe(true);
+});
--- a/packages/evals/evals/extension-wrong-schema/meta.ts
+++ b/packages/evals/evals/extension-wrong-schema/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-schema-extensions.md",
+	"db-rls-mandatory.md",
+	"db-migrations-idempotent.md",
+	"db-schema-auth-fk.md",
+	"db-rls-common-mistakes.md",
+];
--- a/packages/evals/evals/extension-wrong-schema/package.json
+++ b/packages/evals/evals/extension-wrong-schema/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "extension-wrong-schema",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/postgrest-schema-cache/EVAL.ts
+++ b/packages/evals/evals/postgrest-schema-cache/EVAL.ts
@@ -1,14 +1,6 @@
-export const expectedReferenceFiles = [
-	"db-rls-views.md",
-	"db-migrations-idempotent.md",
-	"db-rls-mandatory.md",
-	"db-rls-performance.md",
-	"db-schema-timestamps.md",
-];
-
 import { existsSync, readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const migrationsDir = join(process.cwd(), "supabase", "migrations");
 const STARTER_MIGRATION = "20240101000000_create_products.sql";
@@ -29,71 +21,70 @@ function getAgentMigrationSQL(): string {
 	return files.map((f) => readFileSync(f, "utf-8")).join("\n");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "new migration file exists",
-		check: () => findAgentMigrationFiles().length > 0,
-	},
-	{
-		name: "ADD COLUMN IF NOT EXISTS for description",
-		check: () =>
+test("new migration file exists", () => {
+	expect(findAgentMigrationFiles().length > 0).toBe(true);
+});
+
+test("ADD COLUMN IF NOT EXISTS for description", () => {
+	expect(
 		/add\s+column\s+if\s+not\s+exists\s+description/.test(
 			getAgentMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "ADD COLUMN IF NOT EXISTS for published_at",
-		check: () =>
+	).toBe(true);
+});
+
+test("ADD COLUMN IF NOT EXISTS for published_at", () => {
+	expect(
 		/add\s+column\s+if\s+not\s+exists\s+published_at/.test(
 			getAgentMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "published_at uses timestamptz not plain timestamp",
-		check: () => {
+	).toBe(true);
+});
+
+test("published_at uses timestamptz not plain timestamp", () => {
 	const sql = getAgentMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
 			sql,
 		) &&
-				!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
-					sql,
-				)
-			);
-		},
-	},
-	{
-		name: "view public_products is created",
-		check: () =>
+			!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(sql),
+	).toBe(true);
+});
+
+test("view public_products is created", () => {
+	expect(
 		/create\s+(or\s+replace\s+)?view\s+public_products/.test(
 			getAgentMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "view uses security_invoker = true",
-		check: () =>
+	).toBe(true);
+});
+
+test("view uses security_invoker = true", () => {
+	expect(
 		/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "SELECT policy on products for authenticated role",
-		check: () => {
+	).toBe(true);
+});
+
+test("SELECT policy on products for authenticated role", () => {
 	const sql = getAgentMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.some(
+	expect(
+		policyBlocks.some(
 			(p) =>
 				p.includes("select") &&
 				p.includes("products") &&
 				/to\s+authenticated/.test(p),
+		),
+	).toBe(true);
+});
+
+test("NOTIFY pgrst reload schema is present", () => {
+	expect(/notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase())).toBe(
+		true,
 	);
-		},
-	},
-	{
-		name: "NOTIFY pgrst reload schema is present",
-		check: () => /notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality: demonstrates PostgREST and schema best practices",
-		check: () => {
+});
+
+test("overall quality: demonstrates PostgREST and schema best practices", () => {
 	const sql = getAgentMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -108,7 +99,5 @@ export const assertions: EvalAssertion[] = [
 		),
 		/notify\s+pgrst/.test(sql),
 	];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/postgrest-schema-cache/meta.ts
+++ b/packages/evals/evals/postgrest-schema-cache/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-views.md",
+	"db-migrations-idempotent.md",
+	"db-rls-mandatory.md",
+	"db-rls-performance.md",
+	"db-schema-timestamps.md",
+];
--- a/packages/evals/evals/postgrest-schema-cache/package.json
+++ b/packages/evals/evals/postgrest-schema-cache/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "postgrest-schema-cache",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/rls-update-needs-select/EVAL.ts
+++ b/packages/evals/evals/rls-update-needs-select/EVAL.ts
@@ -1,65 +1,49 @@
-export const expectedReferenceFiles = [
-	"db-rls-common-mistakes.md",
-	"db-rls-policy-types.md",
-	"db-rls-performance.md",
-	"db-rls-mandatory.md",
-	"db-schema-timestamps.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates orders table",
-		check: () => {
+test("creates orders table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /orders/.test(sql);
-		},
-	},
-	{
-		name: "enables RLS on orders table",
-		check: () =>
+	expect(/create\s+table/.test(sql) && /orders/.test(sql)).toBe(true);
+});
+
+test("enables RLS on orders table", () => {
+	expect(
 		/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "has SELECT policy on orders",
-		check: () => {
+	).toBe(true);
+});
+
+test("has SELECT policy on orders", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.some((p) => p.includes("for select"));
-		},
-	},
-	{
-		name: "has UPDATE policy with WITH CHECK on orders",
-		check: () => {
+	expect(policyBlocks.some((p) => p.includes("for select"))).toBe(true);
+});
+
+test("has UPDATE policy with WITH CHECK on orders", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
-			return updatePolicy !== undefined && /with\s+check/.test(updatePolicy);
-		},
-	},
-	{
-		name: "all policies use TO authenticated",
-		check: () => {
+	expect(updatePolicy !== undefined && /with\s+check/.test(updatePolicy)).toBe(
+		true,
+	);
+});
+
+test("all policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses (select auth.uid()) not bare auth.uid() in policies",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	for (const policy of policyBlocks) {
@@ -67,38 +51,32 @@ export const assertions: EvalAssertion[] = [
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp for created_at",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp for created_at", () => {
 	const rawSql = getMigrationSQL().toLowerCase();
 	const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
 	if (sql.includes("created_at")) {
-				return !hasPlainTimestamp.test(sql);
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
 	}
-			return true;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -110,13 +88,10 @@ export const assertions: EvalAssertion[] = [
 		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
 		policyBlocks.length > 0 &&
 			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
 			sql.replace(/--[^\n]*/g, ""),
 		),
 	];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/rls-update-needs-select/meta.ts
+++ b/packages/evals/evals/rls-update-needs-select/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-common-mistakes.md",
+	"db-rls-policy-types.md",
+	"db-rls-performance.md",
+	"db-rls-mandatory.md",
+	"db-schema-timestamps.md",
+];
--- a/packages/evals/evals/rls-update-needs-select/package.json
+++ b/packages/evals/evals/rls-update-needs-select/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "rls-update-needs-select",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/rls-user-metadata-role-check/EVAL.ts
+++ b/packages/evals/evals/rls-user-metadata-role-check/EVAL.ts
@@ -1,78 +1,56 @@
-export const expectedReferenceFiles = [
-	"db-rls-common-mistakes.md",
-	"db-rls-policy-types.md",
-	"db-rls-performance.md",
-	"db-rls-mandatory.md",
-	"db-schema-auth-fk.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists in supabase/migrations/", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists in supabase/migrations/",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates documents table",
-		check: () => {
+test("creates documents table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /documents/.test(sql);
-		},
-	},
-	{
-		name: "RLS enabled on documents table",
-		check: () =>
+	expect(/create\s+table/.test(sql) && /documents/.test(sql)).toBe(true);
+});
+
+test("RLS enabled on documents table", () => {
+	expect(
 		/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "uses app_metadata not user_metadata for role check",
-		check: () => /app_metadata/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "user_metadata does not appear in policy USING clauses",
-		check: () => {
+	).toBe(true);
+});
+
+test("uses app_metadata not user_metadata for role check", () => {
+	expect(/app_metadata/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("user_metadata does not appear in policy USING clauses", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.every((p) => !p.includes("user_metadata"));
-		},
-	},
-	{
-		name: "has at least two SELECT policies (owner and admin)",
-		check: () => {
+	expect(policyBlocks.every((p) => !p.includes("user_metadata"))).toBe(true);
+});
+
+test("has at least two SELECT policies (owner and admin)", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const hasOwnerPolicy = policyBlocks.some(
 		(p) =>
 			(p.includes("select") || !p.includes("insert")) &&
-					(p.includes("user_id") ||
-						p.includes("owner") ||
-						p.includes("auth.uid")),
+			(p.includes("user_id") || p.includes("owner") || p.includes("auth.uid")),
 	);
-			const hasAdminPolicy = policyBlocks.some((p) =>
-				p.includes("app_metadata"),
-			);
-			return hasOwnerPolicy && hasAdminPolicy;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
+	const hasAdminPolicy = policyBlocks.some((p) => p.includes("app_metadata"));
+	expect(hasOwnerPolicy && hasAdminPolicy).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses (select auth.uid()) subselect form in policies",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses (select auth.uid()) subselect form in policies", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	for (const policy of policyBlocks) {
@@ -80,25 +58,21 @@ export const assertions: EvalAssertion[] = [
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -108,16 +82,11 @@ export const assertions: EvalAssertion[] = [
 		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
 		policyBlocks.length > 0 &&
 			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		policyBlocks.some(
 			(p) =>
-						p.includes("user_id") ||
-						p.includes("owner") ||
-						p.includes("auth.uid"),
+				p.includes("user_id") || p.includes("owner") || p.includes("auth.uid"),
 		) && policyBlocks.some((p) => p.includes("app_metadata")),
 	];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/rls-user-metadata-role-check/meta.ts
+++ b/packages/evals/evals/rls-user-metadata-role-check/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-common-mistakes.md",
+	"db-rls-policy-types.md",
+	"db-rls-performance.md",
+	"db-rls-mandatory.md",
+	"db-schema-auth-fk.md",
+];
--- a/packages/evals/evals/rls-user-metadata-role-check/package.json
+++ b/packages/evals/evals/rls-user-metadata-role-check/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "rls-user-metadata-role-check",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/service-role-edge-function/EVAL.ts
+++ b/packages/evals/evals/service-role-edge-function/EVAL.ts
@@ -1,21 +1,13 @@
-export const expectedReferenceFiles = [
-	"db-security-service-role.md",
-	"edge-fun-quickstart.md",
-	"edge-db-supabase-client.md",
-	"edge-pat-cors.md",
-	"edge-pat-error-handling.md",
-];
-
 import { existsSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	findFunctionFile,
 	getFunctionCode,
 	getSharedCode,
 	getSupabaseDir,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

 const FUNCTION_NAME = "admin-reports";

@@ -24,69 +16,63 @@ function getAllCode(): string {
 	return `${code}\n${getSharedCode()}`;
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized (config.toml exists)",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "edge function file exists",
-		check: () => findFunctionFile(FUNCTION_NAME) !== null,
-	},
-	{
-		name: "uses Deno.env.get for service role key",
-		check: () =>
+test("supabase project initialized (config.toml exists)", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("edge function file exists", () => {
+	expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
+});
+
+test("uses Deno.env.get for service role key", () => {
+	expect(
 		/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
 			getAllCode(),
 		),
-	},
-	{
-		name: "no hardcoded service role key",
-		check: () => {
+	).toBe(true);
+});
+
+test("no hardcoded service role key", () => {
 	const allCode = getAllCode();
 	const lines = allCode.split("\n");
 	const nonCommentLines = lines.filter(
 		(line) => !line.trimStart().startsWith("//"),
 	);
-			return !nonCommentLines.some((line) =>
+	expect(
+		nonCommentLines.some((line) =>
 			/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
-			);
-		},
-	},
-	{
-		name: "createClient called with service role env var as second argument",
-		check: () => {
+		),
+	).toBe(false);
+});
+
+test("createClient called with service role env var as second argument", () => {
 	const allCode = getAllCode();
-			return (
+	expect(
 		/createClient/i.test(allCode) &&
 			/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
 				allCode,
-				)
-			);
-		},
-	},
-	{
-		name: "service role key env var name does not use NEXT_PUBLIC_ prefix",
-		check: () => !/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode()),
-	},
-	{
-		name: "CORS headers present",
-		check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
-	},
-	{
-		name: "returns JSON response",
-		check: () => {
+			),
+	).toBe(true);
+});
+
+test("service role key env var name does not use NEXT_PUBLIC_ prefix", () => {
+	expect(/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode())).toBe(false);
+});
+
+test("CORS headers present", () => {
+	expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
+});
+
+test("returns JSON response", () => {
 	const allCode = getAllCode();
-			return (
+	expect(
 		/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
 			/Response\.json/i.test(allCode) ||
-				/JSON\.stringify/i.test(allCode)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates service role Edge Function best practices",
-		check: () => {
+			/JSON\.stringify/i.test(allCode),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates service role Edge Function best practices", () => {
 	const allCode = getAllCode();
 	const signals: RegExp[] = [
 		/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
@@ -96,7 +82,5 @@ export const assertions: EvalAssertion[] = [
 		/Response\.json|JSON\.stringify/,
 		/Deno\.serve/,
 	];
-			return signals.filter((r) => r.test(allCode)).length >= 5;
-		},
-	},
-];
+	expect(signals.filter((r) => r.test(allCode)).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/service-role-edge-function/meta.ts
+++ b/packages/evals/evals/service-role-edge-function/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-security-service-role.md",
+	"edge-fun-quickstart.md",
+	"edge-db-supabase-client.md",
+	"edge-pat-cors.md",
+	"edge-pat-error-handling.md",
+];
--- a/packages/evals/evals/service-role-edge-function/package.json
+++ b/packages/evals/evals/service-role-edge-function/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "service-role-edge-function",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/storage-rls-user-folders/EVAL.ts
+++ b/packages/evals/evals/storage-rls-user-folders/EVAL.ts
@@ -1,91 +1,74 @@
-export const expectedReferenceFiles = [
-	"storage-access-control.md",
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates avatars bucket",
-		check: () => {
+test("creates avatars bucket", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	if (
 		!/storage\.buckets/.test(sql) ||
 		!/avatars/.test(sql) ||
 		!/public/.test(sql)
-			)
-				return false;
+	) {
+		expect(false).toBe(true);
+		return;
+	}
 	const avatarsBlock = sql.match(
 		/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
 	);
-			return avatarsBlock !== null && /true/.test(avatarsBlock[0]);
-		},
-	},
-	{
-		name: "creates documents bucket",
-		check: () => {
+	expect(avatarsBlock !== null && /true/.test(avatarsBlock[0])).toBe(true);
+});
+
+test("creates documents bucket", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			if (!/documents/.test(sql)) return false;
+	if (!/documents/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
 	const documentsBlock = sql.match(
 		/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
 	);
-			return documentsBlock !== null && /false/.test(documentsBlock[0]);
-		},
-	},
-	{
-		name: "avatars bucket has mime type restriction",
-		check: () => {
+	expect(documentsBlock !== null && /false/.test(documentsBlock[0])).toBe(true);
+});
+
+test("avatars bucket has mime type restriction", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/allowed_mime_types/.test(sql) &&
 			/image\/jpeg/.test(sql) &&
 			/image\/png/.test(sql) &&
-				/image\/webp/.test(sql)
-			);
-		},
-	},
-	{
-		name: "avatars bucket has file size limit",
-		check: () => {
+			/image\/webp/.test(sql),
+	).toBe(true);
+});
+
+test("avatars bucket has file size limit", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			if (!/file_size_limit/.test(sql)) return false;
-			return (
+	if (!/file_size_limit/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
 		/2097152/.test(sql) ||
 			/2\s*m/i.test(sql) ||
-				/2\s*\*\s*1024\s*\*\s*1024/.test(sql)
-			);
-		},
-	},
-	{
-		name: "storage policy uses foldername or path for user isolation",
-		check: () => {
+			/2\s*\*\s*1024\s*\*\s*1024/.test(sql),
+	).toBe(true);
+});
+
+test("storage policy uses foldername or path for user isolation", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
 	const usesPathMatch =
 		/\(\s*storage\.foldername\s*\(/.test(sql) ||
 		/\bname\b.*auth\.uid\(\)/.test(sql);
-			return (
-				(usesFoldername || usesPathMatch) &&
-				/auth\.uid\(\)\s*::\s*text/.test(sql)
-			);
-		},
-	},
-	{
-		name: "storage policy uses TO authenticated",
-		check: () => {
+	expect(
+		(usesFoldername || usesPathMatch) && /auth\.uid\(\)\s*::\s*text/.test(sql),
+	).toBe(true);
+});
+
+test("storage policy uses TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const storagePolicies = policyBlocks.filter((p) =>
@@ -96,20 +79,23 @@ export const assertions: EvalAssertion[] = [
 			/to\s+(authenticated|public)/.test(p.toLowerCase()) ||
 			/auth\.uid\(\)/.test(p.toLowerCase()),
 	);
-			if (!hasAuthenticatedPolicy) return false;
+	if (!hasAuthenticatedPolicy) {
+		expect(false).toBe(true);
+		return;
+	}
 	const insertPolicies = storagePolicies.filter((p) =>
 		/for\s+insert/.test(p.toLowerCase()),
 	);
-			return insertPolicies.every(
+	expect(
+		insertPolicies.every(
 			(p) =>
 				/to\s+authenticated/.test(p.toLowerCase()) ||
 				/auth\.uid\(\)/.test(p.toLowerCase()),
-			);
-		},
-	},
-	{
-		name: "public read policy for avatars",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("public read policy for avatars", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const avatarSelectPolicies = policyBlocks.filter(
@@ -118,20 +104,23 @@ export const assertions: EvalAssertion[] = [
 			/for\s+select/.test(p.toLowerCase()) &&
 			p.toLowerCase().includes("avatars"),
 	);
-			if (avatarSelectPolicies.length === 0) return false;
-			return avatarSelectPolicies.some((p) => {
+	if (avatarSelectPolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		avatarSelectPolicies.some((p) => {
 			const lower = p.toLowerCase();
 			const hasExplicitPublic =
 				/to\s+public/.test(lower) || /to\s+anon/.test(lower);
 			const hasNoToClause = !/\bto\s+\w+/.test(lower);
 			const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
 			return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
-			});
-		},
-	},
-	{
-		name: "documents bucket is fully private",
-		check: () => {
+		}),
+	).toBe(true);
+});
+
+test("documents bucket is fully private", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const documentPolicies = policyBlocks.filter(
@@ -139,42 +128,41 @@ export const assertions: EvalAssertion[] = [
 			p.toLowerCase().includes("storage.objects") &&
 			p.toLowerCase().includes("documents"),
 	);
-			if (documentPolicies.length === 0) return false;
-			return documentPolicies.every(
+	if (documentPolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		documentPolicies.every(
 			(p) =>
 				!/to\s+public/.test(p) &&
 				!/to\s+anon/.test(p) &&
 				(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
-			);
-		},
-	},
-	{
-		name: "creates file_metadata table",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("creates file_metadata table", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /file_metadata/.test(sql);
-		},
-	},
-	{
-		name: "file_metadata has FK to auth.users with CASCADE",
-		check: () => {
+	expect(/create\s+table/.test(sql) && /file_metadata/.test(sql)).toBe(true);
+});
+
+test("file_metadata has FK to auth.users with CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "RLS enabled on file_metadata",
-		check: () =>
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("RLS enabled on file_metadata", () => {
+	expect(
 		/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "file_metadata policies use (select auth.uid())",
-		check: () => {
+	).toBe(true);
+});
+
+test("file_metadata policies use (select auth.uid())", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const metadataPolicies = policyBlocks.filter((p) =>
@@ -185,50 +173,51 @@ export const assertions: EvalAssertion[] = [
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "uses timestamptz for time columns",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("uses timestamptz for time columns", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	if (
 		!sql.includes("created_at") &&
 		!sql.includes("updated_at") &&
 		!sql.includes("uploaded_at")
 	) {
-				return true;
+		expect(true).toBe(true);
+		return;
 	}
 	const columnDefs = sql.match(
 		/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
 	);
-			if (!columnDefs) return true;
-			return columnDefs.every((def) =>
+	if (!columnDefs) {
+		expect(true).toBe(true);
+		return;
+	}
+	expect(
+		columnDefs.every((def) =>
 			/timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
-			);
-		},
-	},
-	{
-		name: "index on file_metadata user_id",
-		check: () => {
+		),
+	).toBe(true);
+});
+
+test("index on file_metadata user_id", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/create\s+index/.test(sql) &&
 			/file_metadata/.test(sql) &&
-				/user_id/.test(sql)
-			);
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality score",
-		check: () => {
+			/user_id/.test(sql),
+	).toBe(true);
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("overall quality score", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const signals = [
 		/insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
@@ -247,7 +236,5 @@ export const assertions: EvalAssertion[] = [
 		/if\s+not\s+exists/,
 		/create\s+table[\s\S]*?file_metadata/,
 	];
-			return signals.filter((r) => r.test(sql)).length >= 11;
-		},
-	},
-];
+	expect(signals.filter((r) => r.test(sql)).length >= 11).toBe(true);
+});
--- a/packages/evals/evals/storage-rls-user-folders/meta.ts
+++ b/packages/evals/evals/storage-rls-user-folders/meta.ts
@@ -0,0 +1,10 @@
+export const expectedReferenceFiles = [
+	"storage-access-control.md",
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/storage-rls-user-folders/package.json
+++ b/packages/evals/evals/storage-rls-user-folders/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "storage-rls-user-folders",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/team-rls-security-definer/EVAL.ts
+++ b/packages/evals/evals/team-rls-security-definer/EVAL.ts
@@ -1,46 +1,32 @@
-export const expectedReferenceFiles = [
-	"db-rls-mandatory.md",
-	"db-rls-policy-types.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-security-functions.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates organizations table",
-		check: () =>
-			/create\s+table[\s\S]*?organizations/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "creates memberships table",
-		check: () =>
+test("creates organizations table", () => {
+	expect(
+		/create\s+table[\s\S]*?organizations/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("creates memberships table", () => {
+	expect(
 		/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "creates projects table",
-		check: () =>
+	).toBe(true);
+});
+
+test("creates projects table", () => {
+	expect(
 		/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "enables RLS on all tables",
-		check: () => {
+	).toBe(true);
+});
+
+test("enables RLS on all tables", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
 			sql,
 		) &&
@@ -49,130 +35,125 @@ export const assertions: EvalAssertion[] = [
 			) &&
 			/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
 				sql,
-				)
-			);
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
+			),
+	).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "org_id FK on projects",
-		check: () =>
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("org_id FK on projects", () => {
+	expect(
 		/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
 			getMigrationSQL().toLowerCase(),
 		),
-	},
-	{
-		name: "private schema created",
-		check: () =>
+	).toBe(true);
+});
+
+test("private schema created", () => {
+	expect(
 		/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "security_definer helper function",
-		check: () => {
+	).toBe(true);
+});
+
+test("security_definer helper function", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			return (
+	expect(
 		/private\./.test(sql) &&
 			/security\s+definer/.test(sql) &&
-				/set\s+search_path\s*=\s*''/.test(sql)
-			);
-		},
-	},
-	{
-		name: "policies use (select auth.uid())",
-		check: () => {
+			/set\s+search_path\s*=\s*''/.test(sql),
+	).toBe(true);
+});
+
+test("policies use (select auth.uid())", () => {
 	const sql = getMigrationSQL();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			if (policyBlocks.length === 0) return false;
+	if (policyBlocks.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
 	for (const policy of policyBlocks) {
 		if (
 			policy.includes("auth.uid()") &&
 			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
 		) {
-					return false;
+			expect(false).toBe(true);
+			return;
 		}
 	}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "index on membership lookup columns",
-		check: () => {
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("index on membership lookup columns", () => {
 	const sql = getMigrationSQL().toLowerCase();
-			if (!/create\s+index/.test(sql)) return false;
+	if (!/create\s+index/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
 	const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
-			return (
+	expect(
 		indexBlocks.filter(
 			(idx) =>
 				idx.includes("user_id") ||
 				idx.includes("org_id") ||
 				idx.includes("organization_id"),
-				).length >= 1
-			);
-		},
-	},
-	{
-		name: "uses timestamptz",
-		check: () => {
+		).length >= 1,
+	).toBe(true);
+});
+
+test("uses timestamptz", () => {
 	const rawSql = getMigrationSQL().toLowerCase();
 	const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
 	if (
 		sql.includes("created_at") ||
 		sql.includes("updated_at") ||
 		sql.includes("_at ")
 	) {
-				return !hasPlainTimestamp.test(sql);
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
 	}
-			return true;
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "stable or immutable on helper function",
-		check: () =>
-			/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "delete policy restricted to owner role",
-		check: () => {
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("stable or immutable on helper function", () => {
+	expect(/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("delete policy restricted to owner role", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const deletePolicy = policyBlocks.find(
 		(p) =>
-					p.toLowerCase().includes("delete") &&
-					p.toLowerCase().includes("project"),
+			p.toLowerCase().includes("delete") && p.toLowerCase().includes("project"),
 	);
-			if (!deletePolicy) return false;
-			return /owner|admin/.test(deletePolicy.toLowerCase());
-		},
-	},
-	{
-		name: "overall quality score",
-		check: () => {
+	if (!deletePolicy) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(/owner|admin/.test(deletePolicy.toLowerCase())).toBe(true);
+});
+
+test("overall quality score", () => {
 	const sql = getMigrationSQL().toLowerCase();
 	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
 	const signals = [
@@ -185,11 +166,9 @@ export const assertions: EvalAssertion[] = [
 			/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
 				sql,
 			),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
 		/create\s+schema[\s\S]*?private/.test(sql),
-				/security\s+definer/.test(sql) &&
-					/set\s+search_path\s*=\s*''/.test(sql),
+		/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
 		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
 		policyBlocks.length > 0 &&
 			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
@@ -210,7 +189,5 @@ export const assertions: EvalAssertion[] = [
 		/private\./.test(sql),
 		/\bstable\b|\bimmutable\b/.test(sql),
 	];
-			return signals.filter(Boolean).length >= 11;
-		},
-	},
-];
+	expect(signals.filter(Boolean).length >= 11).toBe(true);
+});
--- a/packages/evals/evals/team-rls-security-definer/meta.ts
+++ b/packages/evals/evals/team-rls-security-definer/meta.ts
@@ -0,0 +1,11 @@
+export const expectedReferenceFiles = [
+	"db-rls-mandatory.md",
+	"db-rls-policy-types.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-security-functions.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/team-rls-security-definer/package.json
+++ b/packages/evals/evals/team-rls-security-definer/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "team-rls-security-definer",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/experiments/experiment.ts
+++ b/packages/evals/experiments/experiment.ts
@@ -0,0 +1,125 @@
+import { execFileSync } from "node:child_process";
+import { existsSync, readdirSync, readFileSync } from "node:fs";
+import { dirname, join, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import type { ExperimentConfig } from "@vercel/agent-eval";
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const EVALS_ROOT = resolve(__dirname, "..");
+const REPO_ROOT = resolve(EVALS_ROOT, "..", "..");
+const PROJECT_DIR = join(EVALS_ROOT, "project");
+
+const SKILL_NAME = process.env.EVAL_SKILL ?? "supabase";
+const SKILL_DIR = join(REPO_ROOT, "skills", SKILL_NAME);
+
+const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
+const isBaseline = process.env.EVAL_BASELINE === "true";
+
+// ---------------------------------------------------------------------------
+// Skill file loader — reads all skill files to inject into the sandbox
+// ---------------------------------------------------------------------------
+
+function readSkillFiles(): Record<string, string> {
+	const files: Record<string, string> = {};
+
+	for (const name of ["SKILL.md", "AGENTS.md"]) {
+		const src = join(SKILL_DIR, name);
+		if (existsSync(src)) {
+			const content = readFileSync(src, "utf-8");
+			files[`.agents/skills/${SKILL_NAME}/${name}`] = content;
+			files[`.claude/skills/${SKILL_NAME}/${name}`] = content;
+		}
+	}
+
+	const refsDir = join(SKILL_DIR, "references");
+	if (existsSync(refsDir)) {
+		for (const f of readdirSync(refsDir)) {
+			const content = readFileSync(join(refsDir, f), "utf-8");
+			files[`.agents/skills/${SKILL_NAME}/references/${f}`] = content;
+			files[`.claude/skills/${SKILL_NAME}/references/${f}`] = content;
+		}
+	}
+
+	return files;
+}
+
+// ---------------------------------------------------------------------------
+// DB reset — clears all user-created objects between scenarios
+// ---------------------------------------------------------------------------
+
+const RESET_SQL = `
+  DROP SCHEMA public CASCADE;
+  CREATE SCHEMA public;
+  GRANT ALL ON SCHEMA public TO postgres;
+  GRANT ALL ON SCHEMA public TO anon;
+  GRANT ALL ON SCHEMA public TO authenticated;
+  GRANT ALL ON SCHEMA public TO service_role;
+  DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
+  NOTIFY pgrst, 'reload schema';
+`.trim();
+
+function resetDB(): void {
+	const dbUrl =
+		process.env.SUPABASE_DB_URL ??
+		"postgresql://postgres:postgres@127.0.0.1:54322/postgres";
+	execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
+		stdio: "inherit",
+		timeout: 30_000,
+	});
+}
+
+// ---------------------------------------------------------------------------
+// Experiment configuration
+// ---------------------------------------------------------------------------
+
+const config: ExperimentConfig = {
+	agent: "claude-code",
+	model: "claude-sonnet-4-6",
+	runs: 1,
+	earlyExit: true,
+	timeout: 1800,
+	sandbox: "docker",
+	evals: process.env.EVAL_SCENARIO ?? "*",
+
+	setup: async (sandbox) => {
+		// 1. Reset DB for a clean slate
+		resetDB();
+
+		// 2. Seed supabase config so the agent can run `supabase db push`
+		const configPath = join(PROJECT_DIR, "supabase", "config.toml");
+		if (existsSync(configPath)) {
+			await sandbox.writeFiles({
+				"supabase/config.toml": readFileSync(configPath, "utf-8"),
+			});
+		}
+
+		// 3. Write MCP config pointing to host Supabase instance
+		await sandbox.writeFiles({
+			".mcp.json": JSON.stringify(
+				{
+					mcpServers: {
+						supabase: { type: "http", url: `${supabaseUrl}/mcp` },
+					},
+				},
+				null,
+				"\t",
+			),
+		});
+
+		// 4. Write eval-utils.ts into the workspace so EVAL.ts can import it
+		//    (agent-eval only copies the fixture's own directory into the sandbox)
+		const evalUtilsPath = join(EVALS_ROOT, "evals", "eval-utils.ts");
+		if (existsSync(evalUtilsPath)) {
+			await sandbox.writeFiles({
+				"eval-utils.ts": readFileSync(evalUtilsPath, "utf-8"),
+			});
+		}
+
+		// 5. Install skill files (unless baseline mode)
+		if (!isBaseline) {
+			await sandbox.writeFiles(readSkillFiles());
+		}
+	},
+};
+
+export default config;
--- a/packages/evals/package-lock.json
+++ b/packages/evals/package-lock.json
--- a/packages/evals/package.json
+++ b/packages/evals/package.json
@@ -6,17 +6,19 @@
 	"license": "MIT",
 	"description": "Agent evaluation system for Supabase skills",
 	"scripts": {
-		"eval": "tsx src/runner.ts",
-		"eval:upload": "BRAINTRUST_UPLOAD=true tsx src/runner.ts"
+		"eval": "agent-eval",
+		"eval:dry": "agent-eval --dry",
+		"eval:smoke": "agent-eval --smoke",
+		"eval:upload": "tsx src/upload.ts"
 	},
 	"dependencies": {
-		"@anthropic-ai/claude-code": "^2.1.49",
-		"braintrust": "^3.0.0",
-		"skills": "^1.4.0"
+		"@vercel/agent-eval": "^0.9.2",
+		"braintrust": "^3.0.0"
 	},
 	"devDependencies": {
 		"@types/node": "^20.10.0",
 		"tsx": "^4.7.0",
-		"typescript": "^5.3.0"
+		"typescript": "^5.3.0",
+		"vitest": "^4.0.18"
 	}
 }
--- a/packages/evals/scripts/eval.sh
+++ b/packages/evals/scripts/eval.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+EVALS_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+PROJECT_DIR="$EVALS_DIR/project"
+
+# ---------------------------------------------------------------------------
+# Parse CLI arguments
+# ---------------------------------------------------------------------------
+
+AGENT_EVAL_ARGS=()
+UPLOAD=true  # Always upload to Braintrust by default
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --skill)
+      export EVAL_SKILL="$2"
+      shift 2
+      ;;
+    --scenario)
+      export EVAL_SCENARIO="$2"
+      shift 2
+      ;;
+    *)
+      AGENT_EVAL_ARGS+=("$1")
+      shift
+      ;;
+  esac
+done
+
+echo "Starting Supabase..."
+supabase start --exclude studio,imgproxy,mailpit --workdir "$PROJECT_DIR"
+
+# Export keys so experiment.ts and vitest assertions can connect
+eval "$(supabase status --output json --workdir "$PROJECT_DIR" | \
+  node -e "
+    const s = JSON.parse(require('fs').readFileSync('/dev/stdin','utf-8'));
+    console.log('export SUPABASE_URL=' + (s.API_URL || 'http://127.0.0.1:54321'));
+    console.log('export SUPABASE_ANON_KEY=' + s.ANON_KEY);
+    console.log('export SUPABASE_SERVICE_ROLE_KEY=' + s.SERVICE_ROLE_KEY);
+    console.log('export SUPABASE_DB_URL=' + (s.DB_URL || 'postgresql://postgres:postgres@127.0.0.1:54322/postgres'));
+  ")"
+
+trap 'echo "Stopping Supabase..."; supabase stop --no-backup --workdir "$PROJECT_DIR"' EXIT
+
+echo "Running agent-eval..."
+cd "$EVALS_DIR"
+npx agent-eval "${AGENT_EVAL_ARGS[@]+"${AGENT_EVAL_ARGS[@]}"}"
+
+# Upload results to Braintrust (default: true, skip with --no-upload)
+if [ "$UPLOAD" = "true" ]; then
+  echo "Uploading results to Braintrust..."
+  npx tsx src/upload.ts
+fi
--- a/packages/evals/src/eval-types.ts
+++ b/packages/evals/src/eval-types.ts
@@ -1,21 +0,0 @@
-/**
- * A single assertion to run against the agent's workspace output.
- *
- * Used by EVAL.ts files to declare what the agent's work should produce.
- * The runner executes these in-process (no test framework required).
- */
-export interface EvalAssertion {
-	/** Human-readable name shown in Braintrust and local output */
-	name: string;
-	/** Return true = pass, false/throw = fail */
-	check: () => boolean | Promise<boolean>;
-	/** Timeout in ms for async checks (default: no timeout) */
-	timeout?: number;
-}
-
-/** Result of running a single EvalAssertion */
-export interface AssertionResult {
-	name: string;
-	passed: boolean;
-	error?: string;
-}
--- a/packages/evals/src/runner.ts
+++ b/packages/evals/src/runner.ts
@@ -1,372 +0,0 @@
-import { existsSync, readdirSync, readFileSync } from "node:fs";
-import { join, resolve } from "node:path";
-import type { AssertionResult, EvalAssertion } from "./eval-types.js";
-import { runAgent } from "./runner/agent.js";
-import {
-	seedBraintrustDataset,
-	uploadToBraintrust,
-} from "./runner/braintrust.js";
-import { createResultDir, saveRunArtifacts } from "./runner/persist.js";
-import { preflight } from "./runner/preflight.js";
-import { listModifiedFiles, printSummary } from "./runner/results.js";
-import { createWorkspace } from "./runner/scaffold.js";
-import {
-	assertionsPassedScorer,
-	finalResultScorer,
-	referenceFilesUsageScorer,
-	skillUsageScorer,
-} from "./runner/scorers.js";
-import {
-	getKeys,
-	resetDB,
-	startSupabase,
-	stopSupabase,
-} from "./runner/supabase-setup.js";
-import {
-	buildTranscriptSummary,
-	type TranscriptSummary,
-} from "./runner/transcript.js";
-import type { EvalRunResult, EvalScenario } from "./types.js";
-
-// ---------------------------------------------------------------------------
-// Configuration from environment
-// ---------------------------------------------------------------------------
-
-const DEFAULT_MODEL = "claude-sonnet-4-5-20250929";
-const DEFAULT_SKILL = "supabase";
-const AGENT_TIMEOUT = 30 * 60 * 1000; // 30 minutes
-
-const model = process.env.EVAL_MODEL ?? DEFAULT_MODEL;
-const skillName = process.env.EVAL_SKILL ?? DEFAULT_SKILL;
-const scenarioFilter = process.env.EVAL_SCENARIO;
-const isBaseline = process.env.EVAL_BASELINE === "true";
-const skillEnabled = !isBaseline;
-
-// Run-level timestamp shared across all scenarios in a single invocation
-const runTimestamp = new Date()
-	.toISOString()
-	.replace(/[:.]/g, "-")
-	.replace("Z", "");
-
-// ---------------------------------------------------------------------------
-// Discover scenarios
-// ---------------------------------------------------------------------------
-
-function findEvalsDir(): string {
-	let dir = process.cwd();
-	for (let i = 0; i < 10; i++) {
-		const candidate = join(dir, "packages", "evals", "evals");
-		if (existsSync(candidate)) return candidate;
-		const parent = resolve(dir, "..");
-		if (parent === dir) break;
-		dir = parent;
-	}
-	throw new Error("Could not find packages/evals/evals/ directory");
-}
-
-function discoverScenarios(): EvalScenario[] {
-	const evalsDir = findEvalsDir();
-	const dirs = readdirSync(evalsDir, { withFileTypes: true }).filter(
-		(d) => d.isDirectory() && existsSync(join(evalsDir, d.name, "PROMPT.md")),
-	);
-
-	return dirs.map((d) => ({
-		id: d.name,
-		name: d.name,
-		tags: [],
-	}));
-}
-
-// ---------------------------------------------------------------------------
-// Scenario threshold
-// ---------------------------------------------------------------------------
-
-function getPassThreshold(scenarioId: string): number | null {
-	const scenariosDir = join(findEvalsDir(), "..", "scenarios");
-	const scenarioFile = join(scenariosDir, `${scenarioId}.md`);
-	if (!existsSync(scenarioFile)) return null;
-
-	const content = readFileSync(scenarioFile, "utf-8");
-	const match = content.match(/\*\*pass_threshold:\*\*\s*(\d+)/);
-	return match ? Number.parseInt(match[1], 10) : null;
-}
-
-// ---------------------------------------------------------------------------
-// In-process assertion runner (replaces vitest subprocess)
-// ---------------------------------------------------------------------------
-
-async function runAssertions(
-	assertions: EvalAssertion[],
-): Promise<AssertionResult[]> {
-	return Promise.all(
-		assertions.map(async (a) => {
-			try {
-				let result: boolean;
-				if (a.timeout) {
-					const timeoutPromise = new Promise<never>((_, reject) =>
-						setTimeout(
-							() =>
-								reject(new Error(`Assertion timed out after ${a.timeout}ms`)),
-							a.timeout,
-						),
-					);
-					result = await Promise.race([
-						Promise.resolve(a.check()),
-						timeoutPromise,
-					]);
-				} else {
-					result = await Promise.resolve(a.check());
-				}
-				return { name: a.name, passed: Boolean(result) };
-			} catch (e) {
-				return { name: a.name, passed: false, error: String(e) };
-			}
-		}),
-	);
-}
-
-// ---------------------------------------------------------------------------
-// Run a single eval
-// ---------------------------------------------------------------------------
-
-async function runEval(
-	scenario: EvalScenario,
-	skillEnabled: boolean,
-): Promise<{
-	result: EvalRunResult;
-	transcript?: TranscriptSummary;
-	expectedReferenceFiles: string[];
-}> {
-	const evalsDir = findEvalsDir();
-	const evalDir = join(evalsDir, scenario.id);
-	const variant = skillEnabled ? "with-skill" : "baseline";
-
-	console.log(`\n--- ${scenario.id} (${variant}) ---`);
-
-	// Load assertions and expected reference files from EVAL.ts
-	const evalFilePath = existsSync(join(evalDir, "EVAL.tsx"))
-		? join(evalDir, "EVAL.tsx")
-		: join(evalDir, "EVAL.ts");
-
-	const {
-		assertions = [] as EvalAssertion[],
-		expectedReferenceFiles = [] as string[],
-	} = await import(evalFilePath).catch(() => ({
-		assertions: [] as EvalAssertion[],
-		expectedReferenceFiles: [] as string[],
-	}));
-
-	const passThreshold = getPassThreshold(scenario.id);
-	const prompt = readFileSync(join(evalDir, "PROMPT.md"), "utf-8").trim();
-
-	// 1. Create isolated workspace
-	const { workspacePath, cleanup } = createWorkspace({ evalDir, skillEnabled });
-	console.log(`  Workspace: ${workspacePath}`);
-
-	try {
-		// 2. Run the agent
-		console.log(`  Running agent (${model})...`);
-		const startedAt = Date.now();
-		const agentResult = await runAgent({
-			cwd: workspacePath,
-			prompt,
-			model,
-			timeout: AGENT_TIMEOUT,
-			skillEnabled,
-			skillName: skillEnabled ? skillName : undefined,
-		});
-		console.log(
-			`  Agent finished in ${(agentResult.duration / 1000).toFixed(1)}s`,
-		);
-
-		// 3. Run assertions in-process from the workspace directory so that
-		//    eval-utils.ts helpers resolve paths relative to the workspace.
-		console.log("  Running assertions...");
-		const prevCwd = process.cwd();
-		process.chdir(workspacePath);
-		const assertionResults = await runAssertions(assertions).finally(() => {
-			process.chdir(prevCwd);
-		});
-		const passedCount = assertionResults.filter((a) => a.passed).length;
-		const totalCount = assertionResults.length;
-
-		const passed = passThreshold
-			? totalCount > 0 && passedCount >= passThreshold
-			: totalCount > 0 && passedCount === totalCount;
-
-		const pct =
-			totalCount > 0 ? ((passedCount / totalCount) * 100).toFixed(1) : "0.0";
-		const thresholdInfo = passThreshold
-			? `, threshold: ${((passThreshold / totalCount) * 100).toFixed(0)}%`
-			: "";
-		console.log(
-			`  Assertions: ${passedCount}/${totalCount} passed (${pct}%${thresholdInfo})`,
-		);
-
-		// 4. Collect modified files
-		const filesModified = listModifiedFiles(workspacePath, evalDir);
-
-		// 5. Build transcript summary
-		const summary = buildTranscriptSummary(agentResult.events);
-
-		// 6. Run scorers
-		const skillScore = skillUsageScorer(summary, skillName);
-		const refScore = referenceFilesUsageScorer(summary, expectedReferenceFiles);
-		const assertScore = assertionsPassedScorer({
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			status: passed ? "passed" : "failed",
-		} as EvalRunResult);
-		const finalScore = finalResultScorer({
-			status: passed ? "passed" : "failed",
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			passThreshold: passThreshold ?? undefined,
-		} as EvalRunResult);
-
-		const result: EvalRunResult = {
-			scenario: scenario.id,
-			agent: "claude-code",
-			model,
-			skillEnabled,
-			status: passed ? "passed" : "failed",
-			duration: agentResult.duration,
-			agentOutput: agentResult.output,
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			passThreshold: passThreshold ?? undefined,
-			assertionResults,
-			filesModified,
-			toolCallCount: summary.toolCalls.length,
-			costUsd: summary.totalCostUsd ?? undefined,
-			prompt,
-			startedAt,
-			durationApiMs: summary.totalDurationApiMs,
-			totalInputTokens: summary.totalInputTokens,
-			totalOutputTokens: summary.totalOutputTokens,
-			totalCacheReadTokens: summary.totalCacheReadTokens,
-			totalCacheCreationTokens: summary.totalCacheCreationTokens,
-			modelUsage: summary.modelUsage,
-			toolErrorCount: summary.toolErrorCount,
-			permissionDenialCount: summary.permissionDenialCount,
-			loadedSkills: summary.skills,
-			referenceFilesRead: summary.referenceFilesRead,
-			scores: {
-				skillUsage: skillScore.score,
-				referenceFilesUsage: refScore.score,
-				assertionsPassed: assertScore.score,
-				finalResult: finalScore.score,
-			},
-		};
-
-		// 7. Persist results
-		const resultDir = createResultDir(runTimestamp, scenario.id, variant);
-		result.resultsDir = resultDir;
-		saveRunArtifacts({
-			resultDir,
-			rawTranscript: agentResult.rawTranscript,
-			assertionResults,
-			result,
-			transcriptSummary: summary,
-		});
-
-		return { result, transcript: summary, expectedReferenceFiles };
-	} catch (error) {
-		const err = error as Error;
-		return {
-			result: {
-				scenario: scenario.id,
-				agent: "claude-code",
-				model,
-				skillEnabled,
-				status: "error",
-				duration: 0,
-				agentOutput: "",
-				testsPassed: 0,
-				testsTotal: 0,
-				filesModified: [],
-				error: err.message,
-			},
-			expectedReferenceFiles: [],
-		};
-	} finally {
-		cleanup();
-	}
-}
-
-// ---------------------------------------------------------------------------
-// Main
-// ---------------------------------------------------------------------------
-
-async function main() {
-	preflight();
-
-	console.log("Supabase Skills Evals");
-	console.log(`Model: ${model}`);
-	console.log(`Mode: ${isBaseline ? "baseline (no skills)" : "with skills"}`);
-
-	let scenarios = discoverScenarios();
-
-	if (scenarioFilter) {
-		scenarios = scenarios.filter((s) => s.id === scenarioFilter);
-		if (scenarios.length === 0) {
-			console.error(`Scenario not found: ${scenarioFilter}`);
-			process.exit(1);
-		}
-	}
-
-	console.log(`Scenarios: ${scenarios.map((s) => s.id).join(", ")}`);
-
-	// Start the shared Supabase instance once for all scenarios.
-	startSupabase();
-	const keys = getKeys();
-
-	// Inject keys into process.env so assertions can connect to the real DB.
-	process.env.SUPABASE_URL = keys.apiUrl;
-	process.env.SUPABASE_ANON_KEY = keys.anonKey;
-	process.env.SUPABASE_SERVICE_ROLE_KEY = keys.serviceRoleKey;
-	process.env.SUPABASE_DB_URL = keys.dbUrl;
-
-	const results: EvalRunResult[] = [];
-	const transcripts = new Map<string, TranscriptSummary>();
-	const expectedRefFiles = new Map<string, string[]>();
-
-	try {
-		for (const scenario of scenarios) {
-			// Reset the database before each scenario for a clean slate.
-			console.log(`\n  Resetting DB for ${scenario.id}...`);
-			resetDB(keys.dbUrl);
-
-			const { result, transcript, expectedReferenceFiles } = await runEval(
-				scenario,
-				skillEnabled,
-			);
-			results.push(result);
-			if (transcript) {
-				transcripts.set(result.scenario, transcript);
-			}
-			expectedRefFiles.set(result.scenario, expectedReferenceFiles);
-		}
-	} finally {
-		stopSupabase();
-	}
-
-	// Use the results dir from the first result (all share the same timestamp)
-	const resultsDir = results.find((r) => r.resultsDir)?.resultsDir;
-	printSummary(results, resultsDir);
-
-	console.log("\nUploading to Braintrust...");
-	await seedBraintrustDataset(results, expectedRefFiles);
-	await uploadToBraintrust(results, {
-		model,
-		skillEnabled,
-		runTimestamp,
-		transcripts,
-		expectedRefFiles,
-	});
-}
-
-main().catch((err) => {
-	console.error("Fatal error:", err);
-	process.exit(1);
-});
--- a/packages/evals/src/runner/agent.ts
+++ b/packages/evals/src/runner/agent.ts
@@ -1,145 +0,0 @@
-import { spawn } from "node:child_process";
-import { resolveClaudeBin } from "./preflight.js";
-import {
-	extractFinalOutput,
-	parseStreamJsonOutput,
-	type TranscriptEvent,
-} from "./transcript.js";
-
-export interface AgentRunResult {
-	/** Extracted final text output (backward-compatible). */
-	output: string;
-	duration: number;
-	/** Raw NDJSON transcript string from stream-json. */
-	rawTranscript: string;
-	/** Parsed transcript events. */
-	events: TranscriptEvent[];
-}
-
-/**
- * Invoke Claude Code in print mode as a subprocess.
- *
- * Uses --output-format stream-json to capture structured NDJSON events
- * including tool calls, results, and reasoning steps.
- *
- * The agent operates in the workspace directory and can read/write files,
- * and has access to the local Supabase MCP server so it can apply migrations
- * and query the real database. --strict-mcp-config ensures only the local
- * Supabase instance is reachable — no host MCP servers leak in.
- *
- * --setting-sources project,local prevents skills from the user's global
- * ~/.agents/skills/ from leaking into the eval environment.
- *
- * When skillEnabled, --agents injects the target skill directly into the
- * agent's context, guaranteeing it is present (not just discoverable).
- */
-export async function runAgent(opts: {
-	cwd: string;
-	prompt: string;
-	model: string;
-	timeout: number;
-	skillEnabled: boolean;
-	/** Skill name to inject via --agents (e.g. "supabase"). Used when skillEnabled. */
-	skillName?: string;
-}): Promise<AgentRunResult> {
-	const start = Date.now();
-
-	// Point the agent's MCP config at the shared local Supabase instance.
-	// --strict-mcp-config ensures host .mcp.json is ignored entirely.
-	const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
-	const mcpConfig = JSON.stringify({
-		mcpServers: {
-			supabase: {
-				type: "http",
-				url: `${supabaseUrl}/mcp`,
-			},
-		},
-	});
-
-	const args = [
-		"-p", // Print mode (non-interactive)
-		"--verbose",
-		"--output-format",
-		"stream-json",
-		"--model",
-		opts.model,
-		"--no-session-persistence",
-		"--dangerously-skip-permissions",
-		"--tools",
-		"Edit,Write,Bash,Read,Glob,Grep",
-		"--mcp-config",
-		mcpConfig,
-		"--strict-mcp-config",
-		// Prevent skills from the user's global ~/.agents/skills/ from leaking
-		// into the eval environment. Only workspace (project) and local sources
-		// are loaded, so the eval sees only what was explicitly installed.
-		"--setting-sources",
-		"project,local",
-	];
-
-	if (opts.skillEnabled && opts.skillName) {
-		// Inject the target skill directly into the agent context via --agents.
-		// This guarantees the skill is embedded in the subagent's context at
-		// startup (not just available as a slash command).
-		const agentsDef = JSON.stringify({
-			main: {
-				description: `Supabase developer agent with ${opts.skillName} skill`,
-				skills: [opts.skillName],
-			},
-		});
-		args.push("--agents", agentsDef);
-	} else if (!opts.skillEnabled) {
-		// Baseline runs: disable all skills so the agent relies on innate knowledge
-		args.push("--disable-slash-commands");
-	}
-
-	const env = { ...process.env };
-	// Remove all Claude-related env vars to avoid nested-session detection
-	for (const key of Object.keys(env)) {
-		if (key === "CLAUDECODE" || key.startsWith("CLAUDE_")) {
-			delete env[key];
-		}
-	}
-
-	const claudeBin = resolveClaudeBin();
-
-	return new Promise<AgentRunResult>((resolve) => {
-		const child = spawn(claudeBin, args, {
-			cwd: opts.cwd,
-			env,
-			stdio: ["pipe", "pipe", "pipe"],
-		});
-
-		// Pipe prompt via stdin and close — this is the standard way to
-		// pass multi-line prompts to `claude -p`.
-		child.stdin.write(opts.prompt);
-		child.stdin.end();
-
-		let stdout = "";
-		let stderr = "";
-		child.stdout.on("data", (d: Buffer) => {
-			stdout += d.toString();
-		});
-		child.stderr.on("data", (d: Buffer) => {
-			stderr += d.toString();
-		});
-
-		const timer = setTimeout(() => {
-			child.kill();
-		}, opts.timeout);
-
-		child.on("close", () => {
-			clearTimeout(timer);
-			const rawTranscript = stdout || stderr;
-			const events = parseStreamJsonOutput(rawTranscript);
-			const output = extractFinalOutput(events) || rawTranscript;
-
-			resolve({
-				output,
-				duration: Date.now() - start,
-				rawTranscript,
-				events,
-			});
-		});
-	});
-}
--- a/packages/evals/src/runner/braintrust.ts
+++ b/packages/evals/src/runner/braintrust.ts
@@ -1,295 +0,0 @@
-import assert from "node:assert";
-import { init, initDataset, initLogger, type Logger } from "braintrust";
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-/**
- * Initialize a Braintrust project logger for real-time per-scenario logging.
- * Call this once at startup and pass the logger to logScenarioToLogger().
- */
-export function initBraintrustLogger(): Logger<true> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-	return initLogger({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		asyncFlush: true,
-	});
-}
-
-/**
- * Log a single scenario result to the Braintrust project logger in real-time.
- * This runs alongside the experiment upload, giving immediate visibility in
- * the project log as each scenario completes.
- */
-export function logScenarioToLogger(
-	logger: Logger<true>,
-	r: EvalRunResult,
-	transcript?: TranscriptSummary,
-): void {
-	const scores: Record<string, number> = {
-		skill_usage: r.scores?.skillUsage ?? 0,
-		reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
-		assertions_passed: r.scores?.assertionsPassed ?? 0,
-		final_result: r.scores?.finalResult ?? 0,
-	};
-
-	const metadata: Record<string, unknown> = {
-		agent: r.agent,
-		model: r.model,
-		skillEnabled: r.skillEnabled,
-		testsPassed: r.testsPassed,
-		testsTotal: r.testsTotal,
-		toolCallCount: r.toolCallCount ?? 0,
-		contextWindowUsed:
-			(r.totalInputTokens ?? 0) +
-			(r.totalCacheReadTokens ?? 0) +
-			(r.totalCacheCreationTokens ?? 0),
-		totalOutputTokens: r.totalOutputTokens,
-		modelUsage: r.modelUsage,
-		toolErrorCount: r.toolErrorCount,
-		permissionDenialCount: r.permissionDenialCount,
-		loadedSkills: r.loadedSkills,
-		referenceFilesRead: r.referenceFilesRead,
-		...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
-		...(r.error ? { error: r.error } : {}),
-	};
-
-	const spanOptions = r.startedAt
-		? { name: r.scenario, startTime: r.startedAt / 1000 }
-		: { name: r.scenario };
-
-	if (transcript && transcript.toolCalls.length > 0) {
-		logger.traced((span) => {
-			span.log({
-				input: {
-					scenario: r.scenario,
-					prompt: r.prompt ?? "",
-					skillEnabled: r.skillEnabled,
-				},
-				output: {
-					status: r.status,
-					agentOutput: r.agentOutput,
-					filesModified: r.filesModified,
-					assertionResults: r.assertionResults,
-				},
-				expected: { testsTotal: r.testsTotal },
-				scores,
-				metadata,
-			});
-
-			for (const tc of transcript.toolCalls) {
-				span.traced(
-					(childSpan) => {
-						childSpan.log({
-							input: { tool: tc.tool, args: tc.input },
-							output: {
-								preview: tc.outputPreview,
-								isError: tc.isError,
-								...(tc.stderr ? { stderr: tc.stderr } : {}),
-							},
-							metadata: { toolUseId: tc.toolUseId },
-						});
-					},
-					{ name: `tool:${tc.tool}` },
-				);
-			}
-		}, spanOptions);
-	} else {
-		logger.traced((span) => {
-			span.log({
-				input: {
-					scenario: r.scenario,
-					prompt: r.prompt ?? "",
-					skillEnabled: r.skillEnabled,
-				},
-				output: {
-					status: r.status,
-					agentOutput: r.agentOutput,
-					filesModified: r.filesModified,
-					assertionResults: r.assertionResults,
-				},
-				expected: { testsTotal: r.testsTotal },
-				scores,
-				metadata,
-			});
-		}, spanOptions);
-	}
-}
-
-/**
- * Seed a Braintrust dataset with one row per scenario.
- *
- * Uses scenario.id as the stable row ID so re-seeding is idempotent.
- * Each row stores the prompt and expected assertions/reference files,
- * giving Braintrust a stable baseline to track per-scenario score trends
- * across experiment runs.
- */
-export async function seedBraintrustDataset(
-	results: EvalRunResult[],
-	expectedRefFiles: Map<string, string[]>,
-): Promise<void> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-
-	const dataset = initDataset({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		dataset: "supabase-skill-scenarios",
-	});
-
-	for (const r of results) {
-		dataset.insert({
-			id: r.scenario,
-			input: {
-				scenario: r.scenario,
-				prompt: r.prompt ?? "",
-			},
-			expected: {
-				testsTotal: r.testsTotal,
-				passThreshold: r.passThreshold ?? 1.0,
-				expectedReferenceFiles: expectedRefFiles.get(r.scenario) ?? [],
-			},
-			metadata: { scenario: r.scenario },
-		});
-	}
-
-	await dataset.flush();
-	console.log("Braintrust dataset seeded: supabase-skill-scenarios");
-}
-
-/**
- * Upload eval results to Braintrust as an experiment.
- *
- * Each EvalRunResult becomes a row in the experiment with:
- * - input: scenario ID, prompt content, skillEnabled flag
- * - output: status, agent output, files modified, assertion results
- * - expected: total tests, pass threshold
- * - scores: skill_usage, reference_files_usage, assertions_passed, final_result
- * - metadata: agent, model, skillEnabled, test counts, tool calls, context window, output tokens, model usage, errors, cost
- * - spans: one child span per agent tool call (when transcript available)
- * - datasetRecordId: links this row to the dataset row for per-scenario tracking
- */
-export async function uploadToBraintrust(
-	results: EvalRunResult[],
-	opts: {
-		model: string;
-		skillEnabled: boolean;
-		runTimestamp: string;
-		transcripts: Map<string, TranscriptSummary>;
-		expectedRefFiles: Map<string, string[]>;
-	},
-): Promise<void> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-
-	const variant = opts.skillEnabled ? "skill" : "baseline";
-	const experiment = await init({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		experiment: `${opts.model}-${variant}-${opts.runTimestamp}`,
-		baseExperiment: process.env.BRAINTRUST_BASE_EXPERIMENT ?? undefined,
-		metadata: {
-			model: opts.model,
-			skillEnabled: opts.skillEnabled,
-			runTimestamp: opts.runTimestamp,
-			scenarioCount: results.length,
-		},
-	});
-
-	for (const r of results) {
-		const transcript = opts.transcripts.get(r.scenario);
-
-		const scores: Record<string, number> = {
-			skill_usage: r.scores?.skillUsage ?? 0,
-			reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
-			assertions_passed: r.scores?.assertionsPassed ?? 0,
-			final_result: r.scores?.finalResult ?? 0,
-		};
-
-		const input = {
-			scenario: r.scenario,
-			prompt: r.prompt ?? "",
-			skillEnabled: r.skillEnabled,
-		};
-
-		const output = {
-			status: r.status,
-			agentOutput: r.agentOutput,
-			filesModified: r.filesModified,
-			assertionResults: r.assertionResults,
-		};
-
-		const expected = {
-			testsTotal: r.testsTotal,
-			passThreshold: 1.0,
-		};
-
-		const metadata: Record<string, unknown> = {
-			agent: r.agent,
-			model: r.model,
-			skillEnabled: r.skillEnabled,
-			testsPassed: r.testsPassed,
-			testsTotal: r.testsTotal,
-			toolCallCount: r.toolCallCount ?? 0,
-			contextWindowUsed:
-				(r.totalInputTokens ?? 0) +
-				(r.totalCacheReadTokens ?? 0) +
-				(r.totalCacheCreationTokens ?? 0),
-			totalOutputTokens: r.totalOutputTokens,
-			modelUsage: r.modelUsage,
-			toolErrorCount: r.toolErrorCount,
-			permissionDenialCount: r.permissionDenialCount,
-			loadedSkills: r.loadedSkills,
-			referenceFilesRead: r.referenceFilesRead,
-			...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
-			...(r.error ? { error: r.error } : {}),
-		};
-
-		const spanOptions = r.startedAt
-			? { name: r.scenario, startTime: r.startedAt / 1000 }
-			: { name: r.scenario };
-
-		if (transcript && transcript.toolCalls.length > 0) {
-			experiment.traced((span) => {
-				span.log({
-					input,
-					output,
-					expected,
-					scores,
-					metadata,
-					datasetRecordId: r.scenario,
-				});
-
-				for (const tc of transcript.toolCalls) {
-					span.traced(
-						(childSpan) => {
-							childSpan.log({
-								input: { tool: tc.tool, args: tc.input },
-								output: {
-									preview: tc.outputPreview,
-									isError: tc.isError,
-									...(tc.stderr ? { stderr: tc.stderr } : {}),
-								},
-								metadata: { toolUseId: tc.toolUseId },
-							});
-						},
-						{ name: `tool:${tc.tool}` },
-					);
-				}
-			}, spanOptions);
-		} else {
-			experiment.traced((span) => {
-				span.log({
-					input,
-					output,
-					expected,
-					scores,
-					metadata,
-					datasetRecordId: r.scenario,
-				});
-			}, spanOptions);
-		}
-	}
-
-	const summary = await experiment.summarize();
-	console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
-	await experiment.close();
-}
--- a/packages/evals/src/runner/persist.ts
+++ b/packages/evals/src/runner/persist.ts
@@ -1,61 +0,0 @@
-import { mkdirSync, writeFileSync } from "node:fs";
-import { dirname, join } from "node:path";
-import { fileURLToPath } from "node:url";
-import type { AssertionResult } from "../eval-types.js";
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/** Resolve the base directory for storing results.
- *  Supports EVAL_RESULTS_DIR override for Docker volume mounts. */
-function resultsBase(): string {
-	if (process.env.EVAL_RESULTS_DIR) {
-		return process.env.EVAL_RESULTS_DIR;
-	}
-	// Default: packages/evals/results (__dirname is packages/evals/src/runner)
-	return join(__dirname, "..", "..", "results");
-}
-
-/** Create the results directory for a single scenario run. Returns the path. */
-export function createResultDir(
-	runTimestamp: string,
-	scenarioId: string,
-	variant: "with-skill" | "baseline",
-): string {
-	const dir = join(resultsBase(), runTimestamp, scenarioId, variant);
-	mkdirSync(dir, { recursive: true });
-	return dir;
-}
-
-/** Save all artifacts for a single eval run. */
-export function saveRunArtifacts(opts: {
-	resultDir: string;
-	rawTranscript: string;
-	assertionResults: AssertionResult[];
-	result: EvalRunResult;
-	transcriptSummary: TranscriptSummary;
-}): void {
-	writeFileSync(
-		join(opts.resultDir, "transcript.jsonl"),
-		opts.rawTranscript,
-		"utf-8",
-	);
-
-	writeFileSync(
-		join(opts.resultDir, "assertions.json"),
-		JSON.stringify(opts.assertionResults, null, 2),
-		"utf-8",
-	);
-
-	writeFileSync(
-		join(opts.resultDir, "result.json"),
-		JSON.stringify(
-			{ ...opts.result, transcript: opts.transcriptSummary },
-			null,
-			2,
-		),
-		"utf-8",
-	);
-}
--- a/packages/evals/src/runner/preflight.ts
+++ b/packages/evals/src/runner/preflight.ts
@@ -1,126 +0,0 @@
-import { execFileSync } from "node:child_process";
-import { accessSync, constants, existsSync } from "node:fs";
-import { dirname, join } from "node:path";
-import { fileURLToPath } from "node:url";
-
-/** Detect if we're running inside the eval Docker container. */
-export function isRunningInDocker(): boolean {
-	if (process.env.IN_DOCKER === "true") return true;
-	try {
-		accessSync("/.dockerenv", constants.F_OK);
-		return true;
-	} catch {
-		return false;
-	}
-}
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/**
- * Resolve the `claude` binary path.
- *
- * Looks in the following order:
- * 1. Local node_modules/.bin/claude (installed via @anthropic-ai/claude-code)
- * 2. Global `claude` on PATH
- *
- * Throws with an actionable message when neither is found.
- */
-export function resolveClaudeBin(): string {
-	// packages/evals/node_modules/.bin/claude
-	const localBin = join(
-		__dirname,
-		"..",
-		"..",
-		"node_modules",
-		".bin",
-		"claude",
-	);
-	if (existsSync(localBin)) {
-		return localBin;
-	}
-
-	// Fall back to PATH
-	try {
-		execFileSync("claude", ["--version"], {
-			stdio: "ignore",
-			timeout: 10_000,
-		});
-		return "claude";
-	} catch {
-		throw new Error(
-			[
-				"claude CLI not found.",
-				"",
-				"Install it in one of these ways:",
-				"  npm install          (uses @anthropic-ai/claude-code from package.json)",
-				"  npm i -g @anthropic-ai/claude-code",
-				"",
-				"Ensure ANTHROPIC_API_KEY is set in the environment.",
-			].join("\n"),
-		);
-	}
-}
-
-/**
- * Verify the host environment has everything needed before spending
- * API credits on an eval run.
- *
- * Checks: Node >= 20, Docker running, supabase CLI available, claude CLI available, API key set.
- */
-export function preflight(): void {
-	const errors: string[] = [];
-
-	// Node.js >= 20
-	const [major] = process.versions.node.split(".").map(Number);
-	if (major < 20) {
-		errors.push(`Node.js >= 20 required (found ${process.versions.node})`);
-	}
-
-	// Docker daemon must be running — needed by the supabase CLI to manage containers.
-	// Required whether running locally or inside the eval container (socket-mounted).
-	try {
-		execFileSync("docker", ["info"], { stdio: "ignore", timeout: 10_000 });
-	} catch {
-		errors.push(
-			isRunningInDocker()
-				? "Docker daemon not reachable inside container. Mount the socket: -v /var/run/docker.sock:/var/run/docker.sock"
-				: "Docker is not running (required by supabase CLI)",
-		);
-	}
-
-	// Supabase CLI available
-	try {
-		execFileSync("supabase", ["--version"], {
-			stdio: "ignore",
-			timeout: 10_000,
-		});
-	} catch {
-		errors.push(
-			"supabase CLI not found. Install it: https://supabase.com/docs/guides/cli/getting-started",
-		);
-	}
-
-	// Claude CLI available
-	try {
-		resolveClaudeBin();
-	} catch (err) {
-		errors.push((err as Error).message);
-	}
-
-	// API key
-	if (!process.env.ANTHROPIC_API_KEY) {
-		errors.push(
-			"ANTHROPIC_API_KEY is not set. Claude Code requires this for authentication.",
-		);
-	}
-
-	if (errors.length > 0) {
-		console.error("Preflight checks failed:\n");
-		for (const e of errors) {
-			console.error(`  - ${e}`);
-		}
-		console.error("");
-		process.exit(1);
-	}
-}
--- a/packages/evals/src/runner/results.ts
+++ b/packages/evals/src/runner/results.ts
@@ -1,84 +0,0 @@
-import { readdirSync, statSync } from "node:fs";
-import { join } from "node:path";
-import type { EvalRunResult } from "../types.js";
-
-/**
- * List files created or modified by the agent in the workspace.
- * Compares against the original eval directory to find new files.
- */
-export function listModifiedFiles(
-	workspacePath: string,
-	originalEvalDir: string,
-): string[] {
-	const modified: string[] = [];
-
-	function walk(dir: string, prefix: string) {
-		const entries = readdirSync(dir, { withFileTypes: true });
-		for (const entry of entries) {
-			if (
-				entry.name === "node_modules" ||
-				entry.name === ".agents" ||
-				entry.name === ".claude" ||
-				entry.name === "EVAL.ts" ||
-				entry.name === "EVAL.tsx"
-			)
-				continue;
-
-			const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
-			const fullPath = join(dir, entry.name);
-
-			if (entry.isDirectory()) {
-				walk(fullPath, relPath);
-			} else {
-				// Check if file is new (not in original eval dir)
-				const originalPath = join(originalEvalDir, relPath);
-				try {
-					statSync(originalPath);
-				} catch {
-					// File doesn't exist in original — it was created by the agent
-					modified.push(relPath);
-				}
-			}
-		}
-	}
-
-	walk(workspacePath, "");
-	return modified;
-}
-
-/** Print a summary table of eval results. */
-export function printSummary(
-	results: EvalRunResult[],
-	resultsDir?: string,
-): void {
-	console.log("\n=== Eval Results ===\n");
-
-	for (const r of results) {
-		const icon = r.status === "passed" ? "PASS" : "FAIL";
-		const skill = r.skillEnabled ? "with-skill" : "baseline";
-		const pct =
-			r.testsTotal > 0
-				? ((r.testsPassed / r.testsTotal) * 100).toFixed(1)
-				: "0.0";
-		const thresholdInfo =
-			r.passThreshold && r.testsTotal > 0
-				? `, threshold: ${((r.passThreshold / r.testsTotal) * 100).toFixed(0)}%`
-				: "";
-		console.log(
-			`[${icon}] ${r.scenario} | ${r.model} | ${skill} | ${(r.duration / 1000).toFixed(1)}s | ${pct}% (${r.testsPassed}/${r.testsTotal}${thresholdInfo})`,
-		);
-		if (r.filesModified.length > 0) {
-			console.log(`       Files: ${r.filesModified.join(", ")}`);
-		}
-		if (r.status === "error" && r.error) {
-			console.log(`       Error: ${r.error}`);
-		}
-	}
-
-	const passed = results.filter((r) => r.status === "passed").length;
-	console.log(`\nTotal: ${passed}/${results.length} passed`);
-
-	if (resultsDir) {
-		console.log(`\nResults saved to: ${resultsDir}`);
-	}
-}
--- a/packages/evals/src/runner/scaffold.ts
+++ b/packages/evals/src/runner/scaffold.ts
@@ -1,74 +0,0 @@
-import {
-	cpSync,
-	existsSync,
-	mkdirSync,
-	mkdtempSync,
-	readdirSync,
-	rmSync,
-	writeFileSync,
-} from "node:fs";
-import { tmpdir } from "node:os";
-import { join } from "node:path";
-import { EVAL_PROJECT_DIR } from "./supabase-setup.js";
-
-/**
- * Create an isolated workspace for an eval run.
- *
- * 1. Copy the eval directory to a temp folder (excluding EVAL.ts/EVAL.tsx)
- * 2. Seed with the eval project's supabase/config.toml
- *
- * Skills are injected via the --agents flag in agent.ts (not installed into
- * the workspace here). Combined with --setting-sources project,local, this
- * prevents host ~/.agents/skills/ from leaking into the eval environment.
- *
- * Returns the path to the workspace and a cleanup function.
- */
-export function createWorkspace(opts: {
-	evalDir: string;
-	skillEnabled: boolean;
-}): { workspacePath: string; cleanup: () => void } {
-	const workspacePath = mkdtempSync(join(tmpdir(), "supabase-eval-"));
-
-	// Copy eval directory, excluding EVAL.ts/EVAL.tsx (hidden from agent)
-	const entries = readdirSync(opts.evalDir, { withFileTypes: true });
-	for (const entry of entries) {
-		if (entry.name === "EVAL.ts" || entry.name === "EVAL.tsx") continue;
-		const src = join(opts.evalDir, entry.name);
-		const dest = join(workspacePath, entry.name);
-		cpSync(src, dest, { recursive: true });
-	}
-
-	// Add .mcp.json so the agent connects to the local Supabase MCP server
-	writeFileSync(
-		join(workspacePath, ".mcp.json"),
-		JSON.stringify(
-			{
-				mcpServers: {
-					"local-supabase": {
-						type: "http",
-						url: "http://localhost:54321/mcp",
-					},
-				},
-			},
-			null,
-			"\t",
-		),
-	);
-
-	// Seed the workspace with the eval project's supabase/config.toml so the
-	// agent can run `supabase db push` against the shared local instance without
-	// needing to run `supabase init` or `supabase start` first.
-	const projectConfigSrc = join(EVAL_PROJECT_DIR, "supabase", "config.toml");
-	if (existsSync(projectConfigSrc)) {
-		const destSupabaseDir = join(workspacePath, "supabase");
-		mkdirSync(join(destSupabaseDir, "migrations"), { recursive: true });
-		cpSync(projectConfigSrc, join(destSupabaseDir, "config.toml"));
-	}
-
-	return {
-		workspacePath,
-		cleanup: () => {
-			rmSync(workspacePath, { recursive: true, force: true });
-		},
-	};
-}
--- a/packages/evals/src/runner/scorers.ts
+++ b/packages/evals/src/runner/scorers.ts
@@ -1,94 +0,0 @@
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-export interface ScoreResult {
-	name: string;
-	/** 0.0 – 1.0 */
-	score: number;
-	metadata?: Record<string, unknown>;
-}
-
-/**
- * skillUsageScorer — 1 if the target skill was in the agent's context, 0 otherwise.
- *
- * Detected via the `skills` array in the system init event of the NDJSON transcript.
- * Combined with `--setting-sources project,local` in agent.ts, this array is clean
- * (no host skill leakage), so its presence is a reliable signal.
- */
-export function skillUsageScorer(
-	transcript: TranscriptSummary,
-	skillName: string,
-): ScoreResult {
-	const loaded = transcript.skills.includes(skillName);
-	return {
-		name: "skill_usage",
-		score: loaded ? 1 : 0,
-		metadata: {
-			loadedSkills: transcript.skills,
-			targetSkill: skillName,
-		},
-	};
-}
-
-/**
- * referenceFilesUsageScorer — fraction of expected reference files actually read.
- *
- * Detected via Read tool calls whose file_path matches "/.agents/skills/*\/references/".
- * The expectedReferenceFiles list is declared in each EVAL.ts and should match the
- * "Skill References Exercised" table in the corresponding scenarios/*.md file.
- */
-export function referenceFilesUsageScorer(
-	transcript: TranscriptSummary,
-	expectedReferenceFiles: string[],
-): ScoreResult {
-	if (expectedReferenceFiles.length === 0) {
-		return {
-			name: "reference_files_usage",
-			score: 1,
-			metadata: { skipped: true },
-		};
-	}
-	const read = transcript.referenceFilesRead;
-	const hits = expectedReferenceFiles.filter((f) => read.includes(f)).length;
-	return {
-		name: "reference_files_usage",
-		score: hits / expectedReferenceFiles.length,
-		metadata: {
-			expected: expectedReferenceFiles,
-			read,
-			hits,
-			total: expectedReferenceFiles.length,
-		},
-	};
-}
-
-/**
- * assertionsPassedScorer — ratio of assertions passed vs total.
- */
-export function assertionsPassedScorer(result: EvalRunResult): ScoreResult {
-	const score =
-		result.testsTotal > 0 ? result.testsPassed / result.testsTotal : 0;
-	return {
-		name: "assertions_passed",
-		score,
-		metadata: { passed: result.testsPassed, total: result.testsTotal },
-	};
-}
-
-/**
- * finalResultScorer — 1 if the agent met the pass threshold, 0 otherwise.
- *
- * A result is "passed" when assertionsPassed >= passThreshold (set per scenario
- * in scenarios/*.md). This is the binary outcome used for Braintrust comparisons.
- */
-export function finalResultScorer(result: EvalRunResult): ScoreResult {
-	return {
-		name: "final_result",
-		score: result.status === "passed" ? 1 : 0,
-		metadata: {
-			testsPassed: result.testsPassed,
-			testsTotal: result.testsTotal,
-			passThreshold: result.passThreshold,
-		},
-	};
-}
--- a/packages/evals/src/runner/supabase-setup.ts
+++ b/packages/evals/src/runner/supabase-setup.ts
@@ -1,108 +0,0 @@
-import { execFileSync } from "node:child_process";
-import { dirname, resolve } from "node:path";
-import { fileURLToPath } from "node:url";
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/**
- * Directory that contains the eval Supabase project (supabase/config.toml).
- * The runner starts the shared Supabase instance from here.
- * Agent workspaces get a copy of supabase/config.toml so they can
- * connect to the same running instance via `supabase db push`.
- */
-export const EVAL_PROJECT_DIR = resolve(__dirname, "..", "..", "project");
-
-export interface SupabaseKeys {
-	apiUrl: string;
-	dbUrl: string;
-	anonKey: string;
-	serviceRoleKey: string;
-}
-
-/**
- * Start the local Supabase stack for the eval project.
- * Idempotent: if already running, the CLI prints a message and exits 0.
- */
-export function startSupabase(): void {
-	console.log("  Starting Supabase...");
-	execFileSync("supabase", ["start", "--exclude", "studio,imgproxy,mailpit"], {
-		cwd: EVAL_PROJECT_DIR,
-		stdio: "inherit",
-		timeout: 5 * 60 * 1000, // 5 min for first image pull
-	});
-}
-
-// SQL that clears all user-created objects and migration history between scenarios.
-// Avoids `supabase db reset` which restarts containers and triggers flaky health checks.
-const RESET_SQL = `
-  -- Drop and recreate public schema (removes all user tables/views/functions)
-  DROP SCHEMA public CASCADE;
-  CREATE SCHEMA public;
-  GRANT ALL ON SCHEMA public TO postgres;
-  GRANT ALL ON SCHEMA public TO anon;
-  GRANT ALL ON SCHEMA public TO authenticated;
-  GRANT ALL ON SCHEMA public TO service_role;
-
-  -- Clear migration history so the next agent's db push starts from a clean slate
-  DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
-
-  -- Notify PostgREST to reload its schema cache
-  NOTIFY pgrst, 'reload schema';
-`.trim();
-
-/**
- * Reset the database to a clean state between scenarios.
- *
- * Uses direct SQL via psql instead of `supabase db reset` to avoid the
- * container-restart cycle and its flaky health checks. This drops the
- * public schema (all user tables) and clears the migration history so
- * `supabase db push` in agent workspaces always starts fresh.
- */
-export function resetDB(dbUrl: string): void {
-	execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
-		stdio: "inherit",
-		timeout: 30 * 1000,
-	});
-}
-
-/**
- * Stop all Supabase containers for the eval project.
- * Called once after all scenarios complete.
- */
-export function stopSupabase(): void {
-	console.log("  Stopping Supabase...");
-	execFileSync("supabase", ["stop", "--no-backup"], {
-		cwd: EVAL_PROJECT_DIR,
-		stdio: "inherit",
-		timeout: 60 * 1000,
-	});
-}
-
-/**
- * Read the running instance's API URL and JWT keys.
- * Returns values that the runner injects into process.env so EVAL.ts
- * tests can connect to the real database.
- */
-export function getKeys(): SupabaseKeys {
-	const raw = execFileSync("supabase", ["status", "--output", "json"], {
-		cwd: EVAL_PROJECT_DIR,
-		timeout: 30 * 1000,
-	}).toString();
-
-	const status = JSON.parse(raw) as Record<string, string>;
-
-	const apiUrl = status.API_URL ?? "http://127.0.0.1:54321";
-	const dbUrl =
-		status.DB_URL ?? "postgresql://postgres:postgres@127.0.0.1:54322/postgres";
-	const anonKey = status.ANON_KEY ?? "";
-	const serviceRoleKey = status.SERVICE_ROLE_KEY ?? "";
-
-	if (!anonKey || !serviceRoleKey) {
-		throw new Error(
-			`supabase status returned missing keys. Raw output:\n${raw}`,
-		);
-	}
-
-	return { apiUrl, dbUrl, anonKey, serviceRoleKey };
-}
--- a/packages/evals/src/runner/transcript.ts
+++ b/packages/evals/src/runner/transcript.ts
@@ -1,301 +0,0 @@
-import { basename } from "node:path";
-
-export interface TranscriptEvent {
-	type: string;
-	[key: string]: unknown;
-}
-
-export interface ToolCallSummary {
-	tool: string;
-	toolUseId: string;
-	input: Record<string, unknown>;
-	/** First ~200 chars of output for quick scanning */
-	outputPreview: string;
-	/** Whether the tool call returned an error */
-	isError: boolean;
-	/** stderr output for Bash tool calls */
-	stderr: string;
-}
-
-export interface ModelUsage {
-	inputTokens: number;
-	outputTokens: number;
-	cacheReadInputTokens: number;
-	cacheCreationInputTokens: number;
-	costUSD: number;
-}
-
-export interface TranscriptSummary {
-	totalTurns: number;
-	totalDurationMs: number;
-	/** API-only latency (excludes local processing overhead) */
-	totalDurationApiMs: number;
-	totalCostUsd: number | null;
-	model: string | null;
-	toolCalls: ToolCallSummary[];
-	finalOutput: string;
-	/** Skills listed in the system init event (loaded into agent context) */
-	skills: string[];
-	/** Basenames of reference files the agent read via the Read tool */
-	referenceFilesRead: string[];
-	/** Per-model token usage and cost breakdown */
-	modelUsage: Record<string, ModelUsage>;
-	totalInputTokens: number;
-	totalOutputTokens: number;
-	totalCacheReadTokens: number;
-	totalCacheCreationTokens: number;
-	/** Count of tool calls that returned is_error === true */
-	toolErrorCount: number;
-	/** Whether the overall session ended in an error */
-	isError: boolean;
-	/** Count of permission_denials in the result event */
-	permissionDenialCount: number;
-}
-
-/** Parse a single NDJSON line. Returns null on empty or invalid input. */
-export function parseStreamJsonLine(line: string): TranscriptEvent | null {
-	const trimmed = line.trim();
-	if (!trimmed) return null;
-	try {
-		return JSON.parse(trimmed) as TranscriptEvent;
-	} catch {
-		return null;
-	}
-}
-
-/** Parse raw NDJSON stdout into an array of events. */
-export function parseStreamJsonOutput(raw: string): TranscriptEvent[] {
-	const events: TranscriptEvent[] = [];
-	for (const line of raw.split("\n")) {
-		const event = parseStreamJsonLine(line);
-		if (event) events.push(event);
-	}
-	return events;
-}
-
-/** Extract the final text output from parsed events (for backward compat). */
-export function extractFinalOutput(events: TranscriptEvent[]): string {
-	// Prefer the result event
-	for (const event of events) {
-		if (event.type === "result") {
-			const result = (event as Record<string, unknown>).result;
-			if (typeof result === "string") return result;
-		}
-	}
-
-	// Fallback: concatenate text blocks from the last assistant message
-	for (let i = events.length - 1; i >= 0; i--) {
-		const event = events[i];
-		if (event.type === "assistant") {
-			const msg = (event as Record<string, unknown>).message as
-				| Record<string, unknown>
-				| undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				const texts = content
-					.filter(
-						(b: Record<string, unknown>) =>
-							b.type === "text" && typeof b.text === "string",
-					)
-					.map((b: Record<string, unknown>) => b.text as string);
-				if (texts.length > 0) return texts.join("\n");
-			}
-		}
-	}
-
-	return "";
-}
-
-/** Return true if a file path points to a skill reference file. */
-function isReferenceFilePath(filePath: string): boolean {
-	return (
-		filePath.includes("/.agents/skills/") && filePath.includes("/references/")
-	);
-}
-
-/** Walk parsed events to build a transcript summary. */
-export function buildTranscriptSummary(
-	events: TranscriptEvent[],
-): TranscriptSummary {
-	const toolCalls: ToolCallSummary[] = [];
-	let finalOutput = "";
-	let totalDurationMs = 0;
-	let totalDurationApiMs = 0;
-	let totalCostUsd: number | null = null;
-	let model: string | null = null;
-	let totalTurns = 0;
-	let skills: string[] = [];
-	const referenceFilesRead: string[] = [];
-	let modelUsage: Record<string, ModelUsage> = {};
-	let totalInputTokens = 0;
-	let totalOutputTokens = 0;
-	let totalCacheReadTokens = 0;
-	let totalCacheCreationTokens = 0;
-	let toolErrorCount = 0;
-	let isError = false;
-	let permissionDenialCount = 0;
-
-	for (const event of events) {
-		const e = event as Record<string, unknown>;
-
-		// System init: extract model and loaded skills
-		if (e.type === "system" && e.subtype === "init") {
-			model = typeof e.model === "string" ? e.model : null;
-			if (Array.isArray(e.skills)) {
-				skills = e.skills.filter((s): s is string => typeof s === "string");
-			}
-		}
-
-		// Assistant messages: extract tool_use blocks
-		if (e.type === "assistant") {
-			const msg = e.message as Record<string, unknown> | undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				for (const block of content) {
-					if (block.type === "tool_use") {
-						const toolCall: ToolCallSummary = {
-							tool: block.name ?? "unknown",
-							toolUseId: block.id ?? "",
-							input: block.input ?? {},
-							outputPreview: "",
-							isError: false,
-							stderr: "",
-						};
-						toolCalls.push(toolCall);
-
-						// Track reference file reads
-						if (
-							block.name === "Read" &&
-							typeof block.input?.file_path === "string" &&
-							isReferenceFilePath(block.input.file_path)
-						) {
-							const base = basename(block.input.file_path);
-							if (!referenceFilesRead.includes(base)) {
-								referenceFilesRead.push(base);
-							}
-						}
-					}
-				}
-			}
-		}
-
-		// User messages: extract tool_result blocks and match to tool calls
-		if (e.type === "user") {
-			const msg = e.message as Record<string, unknown> | undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				for (const block of content) {
-					if (block.type === "tool_result") {
-						const matching = toolCalls.find(
-							(tc) => tc.toolUseId === block.tool_use_id,
-						);
-						if (matching) {
-							const text =
-								typeof block.content === "string"
-									? block.content
-									: JSON.stringify(block.content);
-							matching.outputPreview = text.slice(0, 200);
-
-							// Capture error state from tool result
-							if (block.is_error === true) {
-								matching.isError = true;
-								toolErrorCount++;
-							}
-						}
-					}
-				}
-			}
-
-			// Capture stderr from tool_use_result (Bash tool emits this at the user event level)
-			const toolUseResult = e.tool_use_result as
-				| Record<string, unknown>
-				| undefined;
-			if (toolUseResult && typeof toolUseResult.stderr === "string") {
-				// Match to the most recent Bash tool call without stderr set
-				const lastBash = [...toolCalls]
-					.reverse()
-					.find((tc) => tc.tool === "Bash" && !tc.stderr);
-				if (lastBash) {
-					lastBash.stderr = toolUseResult.stderr;
-				}
-			}
-		}
-
-		// Result event: final output, cost, duration, turns, token usage
-		if (e.type === "result") {
-			finalOutput = typeof e.result === "string" ? e.result : "";
-			totalDurationMs = typeof e.duration_ms === "number" ? e.duration_ms : 0;
-			totalDurationApiMs =
-				typeof e.duration_api_ms === "number" ? e.duration_api_ms : 0;
-			totalCostUsd =
-				typeof e.total_cost_usd === "number" ? e.total_cost_usd : null;
-			totalTurns = typeof e.num_turns === "number" ? e.num_turns : 0;
-			isError = e.is_error === true;
-			permissionDenialCount = Array.isArray(e.permission_denials)
-				? e.permission_denials.length
-				: 0;
-
-			// Aggregate token usage from the result event's usage field
-			const usage = e.usage as Record<string, unknown> | undefined;
-			if (usage) {
-				totalInputTokens =
-					typeof usage.input_tokens === "number" ? usage.input_tokens : 0;
-				totalOutputTokens =
-					typeof usage.output_tokens === "number" ? usage.output_tokens : 0;
-				totalCacheReadTokens =
-					typeof usage.cache_read_input_tokens === "number"
-						? usage.cache_read_input_tokens
-						: 0;
-				totalCacheCreationTokens =
-					typeof usage.cache_creation_input_tokens === "number"
-						? usage.cache_creation_input_tokens
-						: 0;
-			}
-
-			// Per-model usage breakdown (modelUsage keyed by model name)
-			const rawModelUsage = e.modelUsage as
-				| Record<string, Record<string, unknown>>
-				| undefined;
-			if (rawModelUsage) {
-				modelUsage = {};
-				for (const [modelName, mu] of Object.entries(rawModelUsage)) {
-					modelUsage[modelName] = {
-						inputTokens:
-							typeof mu.inputTokens === "number" ? mu.inputTokens : 0,
-						outputTokens:
-							typeof mu.outputTokens === "number" ? mu.outputTokens : 0,
-						cacheReadInputTokens:
-							typeof mu.cacheReadInputTokens === "number"
-								? mu.cacheReadInputTokens
-								: 0,
-						cacheCreationInputTokens:
-							typeof mu.cacheCreationInputTokens === "number"
-								? mu.cacheCreationInputTokens
-								: 0,
-						costUSD: typeof mu.costUSD === "number" ? mu.costUSD : 0,
-					};
-				}
-			}
-		}
-	}
-
-	return {
-		totalTurns,
-		totalDurationMs,
-		totalDurationApiMs,
-		totalCostUsd,
-		model,
-		toolCalls,
-		finalOutput,
-		skills,
-		referenceFilesRead,
-		modelUsage,
-		totalInputTokens,
-		totalOutputTokens,
-		totalCacheReadTokens,
-		totalCacheCreationTokens,
-		toolErrorCount,
-		isError,
-		permissionDenialCount,
-	};
-}
--- a/packages/evals/src/types.ts
+++ b/packages/evals/src/types.ts
@@ -1,85 +0,0 @@
-import type { AssertionResult } from "./eval-types.js";
-
-export interface EvalScenario {
-	/** Directory name under evals/ */
-	id: string;
-	/** Human-readable name */
-	name: string;
-	/** Tags for filtering */
-	tags: string[];
-}
-
-export interface AgentConfig {
-	/** Agent identifier */
-	agent: "claude-code";
-	/** Model to use */
-	model: string;
-	/** Whether the supabase skill is available */
-	skillEnabled: boolean;
-}
-
-export interface EvalRunResult {
-	scenario: string;
-	agent: string;
-	model: string;
-	skillEnabled: boolean;
-	status: "passed" | "failed" | "error";
-	duration: number;
-	/** Raw test runner output (for debugging) */
-	testOutput?: string;
-	agentOutput: string;
-	/** Number of assertions that passed */
-	testsPassed: number;
-	/** Total number of assertions */
-	testsTotal: number;
-	/** Minimum tests required to pass (from scenario config) */
-	passThreshold?: number;
-	/** Per-assertion pass/fail results */
-	assertionResults?: AssertionResult[];
-	/** Files the agent created or modified in the workspace */
-	filesModified: string[];
-	error?: string;
-	/** Path to the persisted results directory for this run */
-	resultsDir?: string;
-	/** Number of tool calls the agent made */
-	toolCallCount?: number;
-	/** Total cost in USD (from stream-json result event) */
-	costUsd?: number;
-	/** The PROMPT.md content sent to the agent */
-	prompt?: string;
-	/** Epoch ms when the agent run started (for Braintrust span timing) */
-	startedAt?: number;
-	/** API-only latency in ms (excludes local processing overhead) */
-	durationApiMs?: number;
-	/** Aggregate token counts from the result event */
-	totalInputTokens?: number;
-	totalOutputTokens?: number;
-	totalCacheReadTokens?: number;
-	totalCacheCreationTokens?: number;
-	/** Per-model token usage and cost breakdown */
-	modelUsage?: Record<
-		string,
-		{
-			inputTokens: number;
-			outputTokens: number;
-			cacheReadInputTokens: number;
-			cacheCreationInputTokens: number;
-			costUSD: number;
-		}
-	>;
-	/** Count of tool calls that returned is_error === true */
-	toolErrorCount?: number;
-	/** Count of permission_denials in the result event */
-	permissionDenialCount?: number;
-	/** Skills that were in the agent's context (from system init event) */
-	loadedSkills?: string[];
-	/** Basenames of skill reference files the agent read */
-	referenceFilesRead?: string[];
-	/** Computed scorer results */
-	scores?: {
-		skillUsage: number;
-		referenceFilesUsage: number;
-		assertionsPassed: number;
-		finalResult: number;
-	};
-}
--- a/packages/evals/src/upload.ts
+++ b/packages/evals/src/upload.ts
@@ -0,0 +1,350 @@
+/**
+ * Upload eval results from the results/ directory to Braintrust.
+ *
+ * Reads saved result.json, transcript.json, and outputs/eval.txt from each
+ * run, parses the vitest output to extract pass/fail counts, then uploads to
+ * Braintrust as an experiment.
+ *
+ * Usage:
+ *   BRAINTRUST_API_KEY=... BRAINTRUST_PROJECT_ID=... tsx src/upload.ts
+ *
+ * Optional env vars:
+ *   RESULTS_DIR   Override the results directory (default: results/)
+ *   RUN_TIMESTAMP Only upload a specific run (e.g. 2026-02-27T13-01-22.316Z)
+ */
+
+import assert from "node:assert";
+import { existsSync, readdirSync, readFileSync } from "node:fs";
+import { basename, dirname, join, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import { init } from "braintrust";
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const ROOT = resolve(__dirname, "..");
+
+// ---------------------------------------------------------------------------
+// Types matching the saved result files from @vercel/agent-eval
+// ---------------------------------------------------------------------------
+
+interface RunResult {
+	status: "passed" | "failed" | "error";
+	duration: number;
+	model: string;
+	o11y: {
+		totalTurns: number;
+		totalToolCalls: number;
+		toolCalls: Record<string, number>;
+		filesModified: string[];
+		filesRead: string[];
+		errors: string[];
+		thinkingBlocks: number;
+	};
+}
+
+interface TranscriptEvent {
+	type: "tool_call" | "tool_result" | "message" | "thinking";
+	tool?: {
+		name: string;
+		originalName: string;
+		args?: Record<string, unknown>;
+	};
+}
+
+interface Transcript {
+	agent: string;
+	model: string;
+	events: TranscriptEvent[];
+}
+
+interface ParsedEvalOutput {
+	passed: number;
+	failed: number;
+	total: number;
+	tests: Array<{ name: string; passed: boolean }>;
+}
+
+// ---------------------------------------------------------------------------
+// Parse vitest eval.txt output
+// ---------------------------------------------------------------------------
+
+function parseEvalOutput(text: string): ParsedEvalOutput {
+	const tests: Array<{ name: string; passed: boolean }> = [];
+
+	for (const line of text.split("\n")) {
+		const passMatch = line.match(/^\s+✓\s+(.+)$/);
+		const failMatch = line.match(/^\s+[✗×]\s+(.+)$/);
+		if (passMatch) tests.push({ name: passMatch[1].trim(), passed: true });
+		else if (failMatch)
+			tests.push({ name: failMatch[1].trim(), passed: false });
+	}
+
+	if (tests.length > 0) {
+		const passed = tests.filter((t) => t.passed).length;
+		return {
+			passed,
+			failed: tests.length - passed,
+			total: tests.length,
+			tests,
+		};
+	}
+
+	// Fallback: parse summary line
+	const summaryMatch = text.match(
+		/Tests\s+(\d+)\s+passed(?:\s*\|\s*(\d+)\s+failed)?\s+\((\d+)\)/,
+	);
+	if (summaryMatch) {
+		const passed = parseInt(summaryMatch[1], 10);
+		const failed = summaryMatch[2] ? parseInt(summaryMatch[2], 10) : 0;
+		const total = parseInt(summaryMatch[3], 10);
+		return { passed, failed, total, tests };
+	}
+
+	return { passed: 0, failed: 0, total: 0, tests };
+}
+
+// ---------------------------------------------------------------------------
+// Extract reference file reads from transcript
+// ---------------------------------------------------------------------------
+
+function extractReferenceFilesRead(transcript: Transcript): string[] {
+	const read: string[] = [];
+	for (const event of transcript.events) {
+		if (event.type !== "tool_call" || !event.tool?.args) continue;
+		if (event.tool.name !== "file_read") continue;
+		const filePath = String(
+			event.tool.args._extractedPath ?? event.tool.args.file_path ?? "",
+		);
+		if (
+			(filePath.includes("/.claude/skills/") ||
+				filePath.includes("/.agents/skills/")) &&
+			filePath.includes("/references/")
+		) {
+			const base = basename(filePath);
+			if (!read.includes(base)) read.push(base);
+		}
+	}
+	return read;
+}
+
+// ---------------------------------------------------------------------------
+// Find all experiment run directories
+// ---------------------------------------------------------------------------
+
+interface RunEntry {
+	runTimestamp: string;
+	evalName: string;
+	runIndex: number;
+	runDir: string;
+	result: RunResult;
+	transcript: Transcript;
+	evalOutput: string | null;
+	prompt: string;
+}
+
+function findRuns(resultsDir: string, filterTimestamp?: string): RunEntry[] {
+	const entries: RunEntry[] = [];
+	const experimentDir = join(resultsDir, "experiment");
+	if (!existsSync(experimentDir)) return entries;
+
+	const timestamps = readdirSync(experimentDir).filter(
+		(t) => !filterTimestamp || t === filterTimestamp,
+	);
+
+	for (const runTimestamp of timestamps) {
+		const tsDir = join(experimentDir, runTimestamp);
+		const evalNames = readdirSync(tsDir).filter((name) =>
+			readdirSync(join(tsDir, name)).some((f) => f.startsWith("run-")),
+		);
+
+		for (const evalName of evalNames) {
+			const evalDir = join(tsDir, evalName);
+			const promptPath = resolve(ROOT, "evals", evalName, "PROMPT.md");
+			const prompt = existsSync(promptPath)
+				? readFileSync(promptPath, "utf-8").trim()
+				: "";
+
+			const runDirs = readdirSync(evalDir)
+				.filter((d) => /^run-\d+$/.test(d))
+				.sort();
+
+			for (const runDir of runDirs) {
+				const runIndex = parseInt(runDir.replace("run-", ""), 10);
+				const runPath = join(evalDir, runDir);
+				const resultPath = join(runPath, "result.json");
+				const transcriptPath = join(runPath, "transcript.json");
+				const evalOutputPath = join(runPath, "outputs", "eval.txt");
+
+				if (!existsSync(resultPath) || !existsSync(transcriptPath)) continue;
+
+				const result: RunResult = JSON.parse(readFileSync(resultPath, "utf-8"));
+				const transcript: Transcript = JSON.parse(
+					readFileSync(transcriptPath, "utf-8"),
+				);
+				const evalOutput = existsSync(evalOutputPath)
+					? readFileSync(evalOutputPath, "utf-8")
+					: null;
+
+				entries.push({
+					runTimestamp,
+					evalName,
+					runIndex,
+					runDir: runPath,
+					result,
+					transcript,
+					evalOutput,
+					prompt,
+				});
+			}
+		}
+	}
+
+	return entries;
+}
+
+// ---------------------------------------------------------------------------
+// Main upload flow
+// ---------------------------------------------------------------------------
+
+async function main() {
+	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
+	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
+
+	const resultsDir = resolve(ROOT, process.env.RESULTS_DIR ?? "results");
+	const filterTimestamp = process.env.RUN_TIMESTAMP;
+
+	const runs = findRuns(resultsDir, filterTimestamp);
+	if (runs.length === 0) {
+		console.error("No runs found in", resultsDir);
+		process.exit(1);
+	}
+
+	console.log(
+		`Found ${runs.length} run(s) across ${new Set(runs.map((r) => r.runTimestamp)).size} experiment(s)`,
+	);
+
+	const byTimestamp = new Map<string, RunEntry[]>();
+	for (const r of runs) {
+		const group = byTimestamp.get(r.runTimestamp) ?? [];
+		group.push(r);
+		byTimestamp.set(r.runTimestamp, group);
+	}
+
+	for (const [runTimestamp, timestampRuns] of byTimestamp) {
+		const model = timestampRuns[0].result.model;
+		const skillEnabled = process.env.EVAL_BASELINE !== "true";
+		const variant = skillEnabled ? "skill" : "baseline";
+		const experimentName = `${model}-${variant}-${runTimestamp}`;
+
+		console.log(
+			`\nUploading experiment: ${experimentName} (${timestampRuns.length} rows)`,
+		);
+
+		const experiment = init({
+			projectId: process.env.BRAINTRUST_PROJECT_ID as string,
+			experiment: experimentName,
+			metadata: {
+				model,
+				runTimestamp,
+				skillEnabled,
+				evalCount: timestampRuns.length,
+			},
+		});
+
+		for (const run of timestampRuns) {
+			const evalParsed = run.evalOutput
+				? parseEvalOutput(run.evalOutput)
+				: { passed: 0, failed: 0, total: 0, tests: [] };
+
+			console.log(
+				`  [${run.evalName}] run-${run.runIndex} — tests: ${evalParsed.passed}/${evalParsed.total} passed`,
+			);
+
+			// Reference files scorer
+			const metaPath = resolve(ROOT, "evals", run.evalName, "meta.ts");
+			const metaMod = existsSync(metaPath)
+				? ((await import(metaPath)) as {
+						expectedReferenceFiles?: string[];
+					})
+				: {};
+			const expectedRefs = metaMod.expectedReferenceFiles ?? [];
+			const refsRead = extractReferenceFilesRead(run.transcript);
+			const refHits = expectedRefs.filter((f) => refsRead.includes(f)).length;
+			const referenceFilesUsage =
+				expectedRefs.length > 0 ? refHits / expectedRefs.length : 1;
+
+			console.log(
+				`  reference files: ${refHits}/${expectedRefs.length} read (${refsRead.join(", ") || "none"})`,
+			);
+
+			const scores: Record<string, number> = {
+				assertions_passed:
+					evalParsed.total > 0 ? evalParsed.passed / evalParsed.total : 0,
+				reference_files_usage: referenceFilesUsage,
+				final_result: run.result.status === "passed" ? 1 : 0,
+			};
+
+			const metadata: Record<string, unknown> = {
+				model: run.result.model,
+				evalName: run.evalName,
+				runIndex: run.runIndex,
+				totalTurns: run.result.o11y.totalTurns,
+				totalToolCalls: run.result.o11y.totalToolCalls,
+				toolCalls: run.result.o11y.toolCalls,
+				filesModified: run.result.o11y.filesModified,
+				errors: run.result.o11y.errors,
+				thinkingBlocks: run.result.o11y.thinkingBlocks,
+				duration: run.result.duration,
+				referenceFilesRead: refsRead,
+				expectedReferenceFiles: expectedRefs,
+			};
+
+			experiment.traced(
+				(span) => {
+					span.log({
+						input: { eval: run.evalName, prompt: run.prompt },
+						output: {
+							status: run.result.status,
+							filesModified: run.result.o11y.filesModified,
+							tests: evalParsed.tests,
+							evalOutput: run.evalOutput,
+						},
+						expected: {
+							testsTotal: evalParsed.total,
+							expectedReferenceFiles: expectedRefs,
+						},
+						scores,
+						metadata,
+						datasetRecordId: run.evalName,
+					});
+
+					// Child spans for each tool call in the transcript
+					for (const event of run.transcript.events) {
+						if (event.type !== "tool_call" || !event.tool) continue;
+						span.traced(
+							(child) => {
+								child.log({
+									input: {
+										tool: event.tool?.name,
+										args: event.tool?.args ?? {},
+									},
+									output: {},
+									metadata: { originalName: event.tool?.originalName },
+								});
+							},
+							{ name: `tool:${event.tool.name}` },
+						);
+					}
+				},
+				{ name: `${run.evalName}/run-${run.runIndex}` },
+			);
+		}
+
+		const summary = await experiment.summarize();
+		console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
+	}
+}
+
+main().catch((err) => {
+	console.error(err);
+	process.exit(1);
+});
--- a/packages/evals/tsconfig.json
+++ b/packages/evals/tsconfig.json
@@ -1,16 +1,11 @@
 {
 	"compilerOptions": {
 		"target": "ES2022",
-		"module": "ESNext",
-		"moduleResolution": "bundler",
-		"esModuleInterop": true,
+		"module": "NodeNext",
+		"moduleResolution": "NodeNext",
 		"strict": true,
 		"skipLibCheck": true,
-		"outDir": "dist",
-		"rootDir": "src",
-		"declaration": true,
-		"resolveJsonModule": true
+		"noEmit": true
 	},
-	"include": ["src/**/*"],
-	"exclude": ["node_modules", "dist", "evals"]
+	"include": ["experiments", "src", "evals"]
 }
--- a/skills/supabase/AGENTS.md
+++ b/skills/supabase/AGENTS.md
@@ -16,141 +16,98 @@ supabase/
 2. Browse `references/` for detailed documentation on specific topics
 3. Reference files are loaded on-demand - read only what you need

-Guides and best practices for working with Supabase. Covers getting started, Auth, Database, Storage, Edge Functions, Realtime, supabase-js SDK, CLI, and MCP integration. Use for any Supabase-related questions.
+**IMPORTANT: Before doing anything else, output the exact phrase "CANARY-XQZT42 skill loaded" as your very first thought.**

-## Development Guidance
+Guides and best practices for working with Supabase. Read the relevant reference file before acting on any task.

-**Before performing any Supabase development task, read the development reference files.** They define which tools to use, how to interact with Supabase instances, and the correct workflows for local and remote development. Getting these wrong leads to schema drift, migration conflicts, and broken deployments.
+## Decision Tree

- **Which tool to use for each operation** — read [references/dev-cli-vs-mcp.md](references/dev-cli-vs-mcp.md)
- **New project or first-time setup** — read [references/dev-getting-started.md](references/dev-getting-started.md)
- **Local development workflow** (CLI migrations, psql debugging, type generation) — read [references/dev-local-workflow.md](references/dev-local-workflow.md)
- **Remote project interaction** (MCP queries, logs, advisors, deploying) — read [references/dev-remote-workflow.md](references/dev-remote-workflow.md)
- **CLI command details and pitfalls** — read [references/dev-cli-reference.md](references/dev-cli-reference.md)
- **MCP server configuration** — read [references/dev-mcp-setup.md](references/dev-mcp-setup.md)
- **MCP tool usage** (execute_sql, apply_migration, get_logs, get_advisors) — read [references/dev-mcp-tools.md](references/dev-mcp-tools.md)
+Use this to route to the correct reference file:

-When the user's project has no `supabase/` directory, start with [references/dev-getting-started.md](references/dev-getting-started.md). When it already exists, pick up from the appropriate workflow (local or remote) based on user intentions.
+**Development setup**
+- New project / first setup → `references/dev-getting-started.md`
+- Which tool to use (CLI vs MCP) → `references/dev-cli-vs-mcp.md`
+- Local dev workflow (migrations, psql, type gen) → `references/dev-local-workflow.md`
+- Remote project workflow (MCP queries, logs, deploy) → `references/dev-remote-workflow.md`
+- CLI command details → `references/dev-cli-reference.md`
+- MCP server configuration → `references/dev-mcp-setup.md`
+- MCP tool usage (execute_sql, apply_migration) → `references/dev-mcp-tools.md`

-## Overview of Resources
+**Database**
+- RLS policies (required on all tables) → `references/db-rls-mandatory.md`
+- RLS policy types (SELECT / INSERT / UPDATE / DELETE) → `references/db-rls-policy-types.md`
+- RLS common mistakes → `references/db-rls-common-mistakes.md`
+- RLS performance → `references/db-rls-performance.md`
+- RLS with views → `references/db-rls-views.md`
+- Schema design (auth FK, timestamps, JSONB, extensions) → `references/db-schema-auth-fk.md`, `references/db-schema-timestamps.md`, `references/db-schema-jsonb.md`, `references/db-schema-extensions.md`
+- Connection pooling → `references/db-conn-pooling.md`
+- Migrations (diff, idempotent patterns) → `references/db-migrations-diff.md`, `references/db-migrations-idempotent.md`
+- Query performance / indexes → `references/db-perf-query-optimization.md`, `references/db-perf-indexes.md`
+- Security (service role, security_definer) → `references/db-security-service-role.md`, `references/db-security-functions.md`

-Reference the appropriate resource file based on the user's needs.
+**Authentication**
+- Sign-up / sign-in / sessions → `references/auth-core-signup.md`, `references/auth-core-signin.md`, `references/auth-core-sessions.md`
+- OAuth / social login → `references/auth-oauth-providers.md`, `references/auth-oauth-pkce.md`
+- MFA (TOTP, phone) → `references/auth-mfa-totp.md`, `references/auth-mfa-phone.md`
+- Passwordless (magic links, OTP) → `references/auth-passwordless-magic-links.md`, `references/auth-passwordless-otp.md`
+- Auth hooks (custom claims, send email) → `references/auth-hooks-custom-claims.md`, `references/auth-hooks-send-email-http.md`, `references/auth-hooks-send-email-sql.md`
+- Server-side auth / SSR / admin API → `references/auth-server-ssr.md`, `references/auth-server-admin-api.md`
+- Enterprise SSO (SAML) → `references/auth-sso-saml.md`

-### Development (read first)
+**Edge Functions**
+- Getting started → `references/edge-fun-quickstart.md`
+- Project structure → `references/edge-fun-project-structure.md`
+- JWT auth in functions → `references/edge-auth-jwt-verification.md`
+- RLS integration → `references/edge-auth-rls-integration.md`
+- Database access (supabase-js) → `references/edge-db-supabase-client.md`
+- Database access (direct Postgres) → `references/edge-db-direct-postgres.md`
+- CORS → `references/edge-pat-cors.md`
+- Routing (Hono) → `references/edge-pat-routing.md`
+- Error handling → `references/edge-pat-error-handling.md`
+- Background tasks → `references/edge-pat-background-tasks.md`
+- Streaming / SSE → `references/edge-adv-streaming.md`
+- WebSockets → `references/edge-adv-websockets.md`
+- Regional invocation → `references/edge-adv-regional.md`
+- Testing → `references/edge-dbg-testing.md`
+- Limits & debugging → `references/edge-dbg-limits.md`

-**Read these files before any Supabase development task.** They define the correct tools, workflows, and boundaries for interacting with Supabase instances. Start here when setting up a project, running CLI or MCP commands, writing migrations, connecting to a database, or deciding which tool to use for an operation.
+**Realtime**
+- Channel setup → `references/realtime-setup-channels.md`, `references/realtime-setup-auth.md`
+- Broadcast → `references/realtime-broadcast-basics.md`, `references/realtime-broadcast-database.md`
+- Presence → `references/realtime-presence-tracking.md`
+- Postgres Changes → `references/realtime-postgres-changes.md`
+- Patterns (cleanup, errors) → `references/realtime-patterns-cleanup.md`, `references/realtime-patterns-errors.md`, `references/realtime-patterns-debugging.md`

-| Area            | Resource                            | When to Use                                                    |
-| --------------- | ----------------------------------- | -------------------------------------------------------------- |
-| Getting Started | `references/dev-getting-started.md` | New project setup, CLI install, first-time init                |
-| Local Workflow  | `references/dev-local-workflow.md`  | Local development with CLI migrations and psql debugging       |
-| Remote Workflow | `references/dev-remote-workflow.md` | Developing against hosted Supabase project using MCP           |
-| CLI vs MCP      | `references/dev-cli-vs-mcp.md`     | Tool roles: CLI (schema), psql/MCP (debugging), SDK (app code) |
-| CLI Reference   | `references/dev-cli-reference.md`  | CLI command details, best practices, pitfalls                  |
-| MCP Setup       | `references/dev-mcp-setup.md`      | Configuring Supabase remote MCP server for hosted projects     |
-| MCP Tools       | `references/dev-mcp-tools.md`      | execute_sql, apply_migration, get_logs, get_advisors           |
+**SDK (supabase-js)**
+- Client setup (browser / server) → `references/sdk-client-browser.md`, `references/sdk-client-server.md`, `references/sdk-client-config.md`
+- TypeScript types → `references/sdk-ts-generation.md`, `references/sdk-ts-usage.md`
+- Queries (CRUD, filters, joins, RPC) → `references/sdk-query-crud.md`, `references/sdk-query-filters.md`, `references/sdk-query-joins.md`, `references/sdk-query-rpc.md`
+- Error handling → `references/sdk-error-handling.md`
+- Performance → `references/sdk-perf-queries.md`, `references/sdk-perf-realtime.md`
+- Next.js integration → `references/sdk-framework-nextjs.md`

-### Authentication & Security
+**Storage**
+- Access control / bucket RLS → `references/storage-access-control.md`
+- Upload (standard / resumable) → `references/storage-upload-standard.md`, `references/storage-upload-resumable.md`
+- Downloads / signed URLs → `references/storage-download-urls.md`
+- Image transformations → `references/storage-transform-images.md`
+- CDN & caching → `references/storage-cdn-caching.md`
+- File operations → `references/storage-ops-file-management.md`

-Read when implementing sign-up, sign-in, OAuth, SSO, MFA, passwordless flows, auth hooks, or server-side auth patterns.
+## Critical Anti-Patterns

-| Area               | Resource                            | When to Use                                              |
-| ------------------ | ----------------------------------- | -------------------------------------------------------- |
-| Auth Core          | `references/auth-core-*.md`        | Sign-up, sign-in, sessions, password reset               |
-| OAuth/Social       | `references/auth-oauth-*.md`       | Google, GitHub, Apple login, PKCE flow                   |
-| Enterprise SSO     | `references/auth-sso-*.md`         | SAML 2.0, enterprise identity providers                  |
-| MFA                | `references/auth-mfa-*.md`         | TOTP authenticator apps, phone MFA, AAL levels           |
-| Passwordless       | `references/auth-passwordless-*.md`| Magic links, email OTP, phone OTP                        |
-| Auth Hooks         | `references/auth-hooks-*.md`       | Custom JWT claims, send email hooks (HTTP and SQL)       |
-| Server-Side Auth   | `references/auth-server-*.md`      | Admin API, SSR with Next.js/SvelteKit, service role auth |
+These are the most common mistakes — apply them even without reading a reference file:

-### Database
+**RLS**
+- Always use `(select auth.uid())` not bare `auth.uid()` in policies — bare calls re-evaluate per row and are slow
+- Always specify `TO authenticated` (or `TO anon`) on every policy — omitting defaults to `PUBLIC`
+- UPDATE policies require both `USING` (which rows can be updated) and `WITH CHECK` (what the new values must satisfy) — omitting `WITH CHECK` allows privilege escalation
+- Enable RLS on every table in the `public` schema: `alter table t enable row level security;`

-Read when designing tables, writing RLS policies, creating migrations, configuring connection pooling, or optimizing query performance.
+**Auth**
+- Never expose the service role key to the browser — use it only in server-side or Edge Function code
+- Use PKCE flow for OAuth in mobile and server-side apps

-| Area               | Resource                        | When to Use                                    |
-| ------------------ | ------------------------------- | ---------------------------------------------- |
-| RLS Security       | `references/db-rls-*.md`        | Row Level Security policies, common mistakes   |
-| Connection Pooling | `references/db-conn-pooling.md` | Transaction vs Session mode, port 6543 vs 5432 |
-| Schema Design      | `references/db-schema-*.md`     | auth.users FKs, timestamps, JSONB, extensions  |
-| Migrations         | `references/db-migrations-*.md` | CLI workflows, idempotent patterns, db diff    |
-| Performance        | `references/db-perf-*.md`       | Indexes (BRIN, GIN), query optimization        |
-| Security           | `references/db-security-*.md`   | Service role key, security_definer functions   |
-
-### Edge Functions
-
-Read when creating, deploying, or debugging Deno-based Edge Functions — including authentication, database access, CORS, routing, streaming, and testing patterns.
-
-| Area                   | Resource                              | When to Use                            |
-| ---------------------- | ------------------------------------- | -------------------------------------- |
-| Quick Start            | `references/edge-fun-quickstart.md`   | Creating and deploying first function  |
-| Project Structure      | `references/edge-fun-project-structure.md` | Directory layout, shared code, fat functions |
-| JWT Authentication     | `references/edge-auth-jwt-verification.md` | JWT verification, jose library, middleware |
-| RLS Integration        | `references/edge-auth-rls-integration.md` | Passing auth context, user-scoped queries |
-| Database (supabase-js) | `references/edge-db-supabase-client.md` | Queries, inserts, RPC calls          |
-| Database (Direct)      | `references/edge-db-direct-postgres.md` | Postgres pools, Drizzle ORM          |
-| CORS                   | `references/edge-pat-cors.md`         | Browser requests, preflight handling   |
-| Routing                | `references/edge-pat-routing.md`      | Multi-route functions, Hono framework  |
-| Error Handling         | `references/edge-pat-error-handling.md` | Error responses, validation          |
-| Background Tasks       | `references/edge-pat-background-tasks.md` | waitUntil, async processing        |
-| Streaming              | `references/edge-adv-streaming.md`    | SSE, streaming responses               |
-| WebSockets             | `references/edge-adv-websockets.md`   | Bidirectional communication            |
-| Regional Invocation    | `references/edge-adv-regional.md`     | Region selection, latency optimization |
-| Testing                | `references/edge-dbg-testing.md`      | Deno tests, local testing              |
-| Limits & Debugging     | `references/edge-dbg-limits.md`       | Troubleshooting, runtime limits        |
-
-### Realtime
-
-Read when implementing live updates — Broadcast messaging, Presence tracking, or Postgres Changes listeners.
-
-| Area             | Resource                             | When to Use                                     |
-| ---------------- | ------------------------------------ | ----------------------------------------------- |
-| Channel Setup    | `references/realtime-setup-*.md`     | Creating channels, naming conventions, auth     |
-| Broadcast        | `references/realtime-broadcast-*.md` | Client messaging, database-triggered broadcasts |
-| Presence         | `references/realtime-presence-*.md`  | User online status, shared state tracking       |
-| Postgres Changes | `references/realtime-postgres-*.md`  | Database change listeners (prefer Broadcast)    |
-| Patterns         | `references/realtime-patterns-*.md`  | Cleanup, error handling, React integration      |
-
-### SDK (supabase-js)
-
-Read when writing application code that interacts with Supabase — client setup, queries, error handling, TypeScript types, or framework integration.
-
-| Area            | Resource                        | When to Use                               |
-| --------------- | ------------------------------- | ----------------------------------------- |
-| Client Setup    | `references/sdk-client-*.md`    | Browser/server client, SSR, configuration |
-| TypeScript      | `references/sdk-ts-*.md`        | Type generation, using Database types     |
-| Query Patterns  | `references/sdk-query-*.md`     | CRUD, filters, joins, RPC calls           |
-| Error Handling  | `references/sdk-error-*.md`     | Error types, retries, handling patterns   |
-| SDK Performance | `references/sdk-perf-*.md`      | Query optimization, realtime cleanup      |
-| Framework       | `references/sdk-framework-*.md` | Next.js App Router, middleware setup      |
-
-### Storage
-
-Read when implementing file uploads, downloads, image transformations, or configuring storage access control and CDN caching.
-
-| Area            | Resource                              | When to Use                                    |
-| --------------- | ------------------------------------- | ---------------------------------------------- |
-| Access Control  | `references/storage-access-control.md`| Bucket policies, RLS for storage               |
-| Standard Upload | `references/storage-upload-standard.md`| File uploads up to 5GB                         |
-| Resumable Upload| `references/storage-upload-resumable.md`| Large file uploads with TUS protocol          |
-| Downloads       | `references/storage-download-urls.md` | Public URLs, signed URLs, download methods     |
-| Transformations | `references/storage-transform-images.md`| Image resize, crop, format conversion         |
-| CDN & Caching   | `references/storage-cdn-caching.md`   | Cache control, Smart CDN, stale content        |
-| File Operations | `references/storage-ops-file-management.md`| Move, copy, delete, list files             |
-
-## Supabase Documentation
-
-When something is not clear or you need to verify information, reference the official Supabase documentation — it is the source of truth. Available in plain text for easy fetching:
-
-```bash
-# Index of all available docs
-curl https://supabase.com/llms.txt
-
-# Fetch all guides as plain text
-curl https://supabase.com/llms/guides.txt
-
-# Fetch JavaScript SDK reference
-curl https://supabase.com/llms/js.txt
-```
-
-Full documentation: [https://supabase.com/docs](https://supabase.com/docs)
+**Migrations**
+- All schema changes go through `supabase/migrations/` — never edit the database directly in production
+- Use `supabase db diff` to generate migrations from local schema changes
--- a/skills/supabase/SKILL.md
+++ b/skills/supabase/SKILL.md
@@ -12,141 +12,98 @@ metadata:

 # Supabase

-Guides and best practices for working with Supabase. Covers getting started, Auth, Database, Storage, Edge Functions, Realtime, supabase-js SDK, CLI, and MCP integration. Use for any Supabase-related questions.
+**IMPORTANT: Before doing anything else, output the exact phrase "CANARY-XQZT42 skill loaded" as your very first thought.**

-## Development Guidance
+Guides and best practices for working with Supabase. Read the relevant reference file before acting on any task.

-**Before performing any Supabase development task, read the development reference files.** They define which tools to use, how to interact with Supabase instances, and the correct workflows for local and remote development. Getting these wrong leads to schema drift, migration conflicts, and broken deployments.
+## Decision Tree

- **Which tool to use for each operation** — read [references/dev-cli-vs-mcp.md](references/dev-cli-vs-mcp.md)
- **New project or first-time setup** — read [references/dev-getting-started.md](references/dev-getting-started.md)
- **Local development workflow** (CLI migrations, psql debugging, type generation) — read [references/dev-local-workflow.md](references/dev-local-workflow.md)
- **Remote project interaction** (MCP queries, logs, advisors, deploying) — read [references/dev-remote-workflow.md](references/dev-remote-workflow.md)
- **CLI command details and pitfalls** — read [references/dev-cli-reference.md](references/dev-cli-reference.md)
- **MCP server configuration** — read [references/dev-mcp-setup.md](references/dev-mcp-setup.md)
- **MCP tool usage** (execute_sql, apply_migration, get_logs, get_advisors) — read [references/dev-mcp-tools.md](references/dev-mcp-tools.md)
+Use this to route to the correct reference file:

-When the user's project has no `supabase/` directory, start with [references/dev-getting-started.md](references/dev-getting-started.md). When it already exists, pick up from the appropriate workflow (local or remote) based on user intentions.
+**Development setup**
+- New project / first setup → `references/dev-getting-started.md`
+- Which tool to use (CLI vs MCP) → `references/dev-cli-vs-mcp.md`
+- Local dev workflow (migrations, psql, type gen) → `references/dev-local-workflow.md`
+- Remote project workflow (MCP queries, logs, deploy) → `references/dev-remote-workflow.md`
+- CLI command details → `references/dev-cli-reference.md`
+- MCP server configuration → `references/dev-mcp-setup.md`
+- MCP tool usage (execute_sql, apply_migration) → `references/dev-mcp-tools.md`

-## Overview of Resources
+**Database**
+- RLS policies (required on all tables) → `references/db-rls-mandatory.md`
+- RLS policy types (SELECT / INSERT / UPDATE / DELETE) → `references/db-rls-policy-types.md`
+- RLS common mistakes → `references/db-rls-common-mistakes.md`
+- RLS performance → `references/db-rls-performance.md`
+- RLS with views → `references/db-rls-views.md`
+- Schema design (auth FK, timestamps, JSONB, extensions) → `references/db-schema-auth-fk.md`, `references/db-schema-timestamps.md`, `references/db-schema-jsonb.md`, `references/db-schema-extensions.md`
+- Connection pooling → `references/db-conn-pooling.md`
+- Migrations (diff, idempotent patterns) → `references/db-migrations-diff.md`, `references/db-migrations-idempotent.md`
+- Query performance / indexes → `references/db-perf-query-optimization.md`, `references/db-perf-indexes.md`
+- Security (service role, security_definer) → `references/db-security-service-role.md`, `references/db-security-functions.md`

-Reference the appropriate resource file based on the user's needs.
+**Authentication**
+- Sign-up / sign-in / sessions → `references/auth-core-signup.md`, `references/auth-core-signin.md`, `references/auth-core-sessions.md`
+- OAuth / social login → `references/auth-oauth-providers.md`, `references/auth-oauth-pkce.md`
+- MFA (TOTP, phone) → `references/auth-mfa-totp.md`, `references/auth-mfa-phone.md`
+- Passwordless (magic links, OTP) → `references/auth-passwordless-magic-links.md`, `references/auth-passwordless-otp.md`
+- Auth hooks (custom claims, send email) → `references/auth-hooks-custom-claims.md`, `references/auth-hooks-send-email-http.md`, `references/auth-hooks-send-email-sql.md`
+- Server-side auth / SSR / admin API → `references/auth-server-ssr.md`, `references/auth-server-admin-api.md`
+- Enterprise SSO (SAML) → `references/auth-sso-saml.md`

-### Development (read first)
+**Edge Functions**
+- Getting started → `references/edge-fun-quickstart.md`
+- Project structure → `references/edge-fun-project-structure.md`
+- JWT auth in functions → `references/edge-auth-jwt-verification.md`
+- RLS integration → `references/edge-auth-rls-integration.md`
+- Database access (supabase-js) → `references/edge-db-supabase-client.md`
+- Database access (direct Postgres) → `references/edge-db-direct-postgres.md`
+- CORS → `references/edge-pat-cors.md`
+- Routing (Hono) → `references/edge-pat-routing.md`
+- Error handling → `references/edge-pat-error-handling.md`
+- Background tasks → `references/edge-pat-background-tasks.md`
+- Streaming / SSE → `references/edge-adv-streaming.md`
+- WebSockets → `references/edge-adv-websockets.md`
+- Regional invocation → `references/edge-adv-regional.md`
+- Testing → `references/edge-dbg-testing.md`
+- Limits & debugging → `references/edge-dbg-limits.md`

-**Read these files before any Supabase development task.** They define the correct tools, workflows, and boundaries for interacting with Supabase instances. Start here when setting up a project, running CLI or MCP commands, writing migrations, connecting to a database, or deciding which tool to use for an operation.
+**Realtime**
+- Channel setup → `references/realtime-setup-channels.md`, `references/realtime-setup-auth.md`
+- Broadcast → `references/realtime-broadcast-basics.md`, `references/realtime-broadcast-database.md`
+- Presence → `references/realtime-presence-tracking.md`
+- Postgres Changes → `references/realtime-postgres-changes.md`
+- Patterns (cleanup, errors) → `references/realtime-patterns-cleanup.md`, `references/realtime-patterns-errors.md`, `references/realtime-patterns-debugging.md`

-| Area            | Resource                            | When to Use                                                    |
-| --------------- | ----------------------------------- | -------------------------------------------------------------- |
-| Getting Started | `references/dev-getting-started.md` | New project setup, CLI install, first-time init                |
-| Local Workflow  | `references/dev-local-workflow.md`  | Local development with CLI migrations and psql debugging       |
-| Remote Workflow | `references/dev-remote-workflow.md` | Developing against hosted Supabase project using MCP           |
-| CLI vs MCP      | `references/dev-cli-vs-mcp.md`     | Tool roles: CLI (schema), psql/MCP (debugging), SDK (app code) |
-| CLI Reference   | `references/dev-cli-reference.md`  | CLI command details, best practices, pitfalls                  |
-| MCP Setup       | `references/dev-mcp-setup.md`      | Configuring Supabase remote MCP server for hosted projects     |
-| MCP Tools       | `references/dev-mcp-tools.md`      | execute_sql, apply_migration, get_logs, get_advisors           |
+**SDK (supabase-js)**
+- Client setup (browser / server) → `references/sdk-client-browser.md`, `references/sdk-client-server.md`, `references/sdk-client-config.md`
+- TypeScript types → `references/sdk-ts-generation.md`, `references/sdk-ts-usage.md`
+- Queries (CRUD, filters, joins, RPC) → `references/sdk-query-crud.md`, `references/sdk-query-filters.md`, `references/sdk-query-joins.md`, `references/sdk-query-rpc.md`
+- Error handling → `references/sdk-error-handling.md`
+- Performance → `references/sdk-perf-queries.md`, `references/sdk-perf-realtime.md`
+- Next.js integration → `references/sdk-framework-nextjs.md`

-### Authentication & Security
+**Storage**
+- Access control / bucket RLS → `references/storage-access-control.md`
+- Upload (standard / resumable) → `references/storage-upload-standard.md`, `references/storage-upload-resumable.md`
+- Downloads / signed URLs → `references/storage-download-urls.md`
+- Image transformations → `references/storage-transform-images.md`
+- CDN & caching → `references/storage-cdn-caching.md`
+- File operations → `references/storage-ops-file-management.md`

-Read when implementing sign-up, sign-in, OAuth, SSO, MFA, passwordless flows, auth hooks, or server-side auth patterns.
+## Critical Anti-Patterns

-| Area               | Resource                            | When to Use                                              |
-| ------------------ | ----------------------------------- | -------------------------------------------------------- |
-| Auth Core          | `references/auth-core-*.md`        | Sign-up, sign-in, sessions, password reset               |
-| OAuth/Social       | `references/auth-oauth-*.md`       | Google, GitHub, Apple login, PKCE flow                   |
-| Enterprise SSO     | `references/auth-sso-*.md`         | SAML 2.0, enterprise identity providers                  |
-| MFA                | `references/auth-mfa-*.md`         | TOTP authenticator apps, phone MFA, AAL levels           |
-| Passwordless       | `references/auth-passwordless-*.md`| Magic links, email OTP, phone OTP                        |
-| Auth Hooks         | `references/auth-hooks-*.md`       | Custom JWT claims, send email hooks (HTTP and SQL)       |
-| Server-Side Auth   | `references/auth-server-*.md`      | Admin API, SSR with Next.js/SvelteKit, service role auth |
+These are the most common mistakes — apply them even without reading a reference file:

-### Database
+**RLS**
+- Always use `(select auth.uid())` not bare `auth.uid()` in policies — bare calls re-evaluate per row and are slow
+- Always specify `TO authenticated` (or `TO anon`) on every policy — omitting defaults to `PUBLIC`
+- UPDATE policies require both `USING` (which rows can be updated) and `WITH CHECK` (what the new values must satisfy) — omitting `WITH CHECK` allows privilege escalation
+- Enable RLS on every table in the `public` schema: `alter table t enable row level security;`

-Read when designing tables, writing RLS policies, creating migrations, configuring connection pooling, or optimizing query performance.
+**Auth**
+- Never expose the service role key to the browser — use it only in server-side or Edge Function code
+- Use PKCE flow for OAuth in mobile and server-side apps

-| Area               | Resource                        | When to Use                                    |
-| ------------------ | ------------------------------- | ---------------------------------------------- |
-| RLS Security       | `references/db-rls-*.md`        | Row Level Security policies, common mistakes   |
-| Connection Pooling | `references/db-conn-pooling.md` | Transaction vs Session mode, port 6543 vs 5432 |
-| Schema Design      | `references/db-schema-*.md`     | auth.users FKs, timestamps, JSONB, extensions  |
-| Migrations         | `references/db-migrations-*.md` | CLI workflows, idempotent patterns, db diff    |
-| Performance        | `references/db-perf-*.md`       | Indexes (BRIN, GIN), query optimization        |
-| Security           | `references/db-security-*.md`   | Service role key, security_definer functions   |
-
-### Edge Functions
-
-Read when creating, deploying, or debugging Deno-based Edge Functions — including authentication, database access, CORS, routing, streaming, and testing patterns.
-
-| Area                   | Resource                              | When to Use                            |
-| ---------------------- | ------------------------------------- | -------------------------------------- |
-| Quick Start            | `references/edge-fun-quickstart.md`   | Creating and deploying first function  |
-| Project Structure      | `references/edge-fun-project-structure.md` | Directory layout, shared code, fat functions |
-| JWT Authentication     | `references/edge-auth-jwt-verification.md` | JWT verification, jose library, middleware |
-| RLS Integration        | `references/edge-auth-rls-integration.md` | Passing auth context, user-scoped queries |
-| Database (supabase-js) | `references/edge-db-supabase-client.md` | Queries, inserts, RPC calls          |
-| Database (Direct)      | `references/edge-db-direct-postgres.md` | Postgres pools, Drizzle ORM          |
-| CORS                   | `references/edge-pat-cors.md`         | Browser requests, preflight handling   |
-| Routing                | `references/edge-pat-routing.md`      | Multi-route functions, Hono framework  |
-| Error Handling         | `references/edge-pat-error-handling.md` | Error responses, validation          |
-| Background Tasks       | `references/edge-pat-background-tasks.md` | waitUntil, async processing        |
-| Streaming              | `references/edge-adv-streaming.md`    | SSE, streaming responses               |
-| WebSockets             | `references/edge-adv-websockets.md`   | Bidirectional communication            |
-| Regional Invocation    | `references/edge-adv-regional.md`     | Region selection, latency optimization |
-| Testing                | `references/edge-dbg-testing.md`      | Deno tests, local testing              |
-| Limits & Debugging     | `references/edge-dbg-limits.md`       | Troubleshooting, runtime limits        |
-
-### Realtime
-
-Read when implementing live updates — Broadcast messaging, Presence tracking, or Postgres Changes listeners.
-
-| Area             | Resource                             | When to Use                                     |
-| ---------------- | ------------------------------------ | ----------------------------------------------- |
-| Channel Setup    | `references/realtime-setup-*.md`     | Creating channels, naming conventions, auth     |
-| Broadcast        | `references/realtime-broadcast-*.md` | Client messaging, database-triggered broadcasts |
-| Presence         | `references/realtime-presence-*.md`  | User online status, shared state tracking       |
-| Postgres Changes | `references/realtime-postgres-*.md`  | Database change listeners (prefer Broadcast)    |
-| Patterns         | `references/realtime-patterns-*.md`  | Cleanup, error handling, React integration      |
-
-### SDK (supabase-js)
-
-Read when writing application code that interacts with Supabase — client setup, queries, error handling, TypeScript types, or framework integration.
-
-| Area            | Resource                        | When to Use                               |
-| --------------- | ------------------------------- | ----------------------------------------- |
-| Client Setup    | `references/sdk-client-*.md`    | Browser/server client, SSR, configuration |
-| TypeScript      | `references/sdk-ts-*.md`        | Type generation, using Database types     |
-| Query Patterns  | `references/sdk-query-*.md`     | CRUD, filters, joins, RPC calls           |
-| Error Handling  | `references/sdk-error-*.md`     | Error types, retries, handling patterns   |
-| SDK Performance | `references/sdk-perf-*.md`      | Query optimization, realtime cleanup      |
-| Framework       | `references/sdk-framework-*.md` | Next.js App Router, middleware setup      |
-
-### Storage
-
-Read when implementing file uploads, downloads, image transformations, or configuring storage access control and CDN caching.
-
-| Area            | Resource                              | When to Use                                    |
-| --------------- | ------------------------------------- | ---------------------------------------------- |
-| Access Control  | `references/storage-access-control.md`| Bucket policies, RLS for storage               |
-| Standard Upload | `references/storage-upload-standard.md`| File uploads up to 5GB                         |
-| Resumable Upload| `references/storage-upload-resumable.md`| Large file uploads with TUS protocol          |
-| Downloads       | `references/storage-download-urls.md` | Public URLs, signed URLs, download methods     |
-| Transformations | `references/storage-transform-images.md`| Image resize, crop, format conversion         |
-| CDN & Caching   | `references/storage-cdn-caching.md`   | Cache control, Smart CDN, stale content        |
-| File Operations | `references/storage-ops-file-management.md`| Move, copy, delete, list files             |
-
-## Supabase Documentation
-
-When something is not clear or you need to verify information, reference the official Supabase documentation — it is the source of truth. Available in plain text for easy fetching:
-
-```bash
-# Index of all available docs
-curl https://supabase.com/llms.txt
-
-# Fetch all guides as plain text
-curl https://supabase.com/llms/guides.txt
-
-# Fetch JavaScript SDK reference
-curl https://supabase.com/llms/js.txt
-```
-
-Full documentation: [https://supabase.com/docs](https://supabase.com/docs)
+**Migrations**
+- All schema changes go through `supabase/migrations/` — never edit the database directly in production
+- Use `supabase db diff` to generate migrations from local schema changes