use agent-evals package

2026-03-27 10:09:26 +08:00 · 2026-02-27 15:32:55 +00:00
parent 0894f5683e
commit 9c6fd293eb
61 changed files with 4208 additions and 4652 deletions
--- a/packages/evals/AGENTS.md
+++ b/packages/evals/AGENTS.md
@@ -1,57 +1,56 @@
 # Evals — Agent Guide

 This package evaluates whether AI agents correctly implement Supabase tasks
-when using skill documentation. Modeled after
-[Vercel's next-evals-oss](https://github.com/vercel-labs/next-evals-oss): each
-eval is a self-contained project with a task prompt, the agent works on it, and
-hidden tests check the result. Binary pass/fail.
+when using skill documentation. Built on
+[@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
+self-contained scenario with a task prompt, the agent works in a Docker sandbox,
+and hidden vitest assertions check the result. Binary pass/fail.

 ## Architecture

 ```
-1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir)
-2. Install skills via `skills add` CLI (or skip for baseline)
-3. Run: claude -p "prompt" --cwd /tmp/eval-xxx
-4. Agent reads skill, creates migrations/code in the workspace
-5. Copy hidden EVAL.ts into workspace, run vitest
-6. Capture pass/fail
+1. eval.sh starts Supabase, exports keys
+2. agent-eval reads experiments/experiment.ts
+3. For each scenario:
+   a. setup() resets DB, writes config + skills into Docker sandbox
+   b. Agent (Claude Code) runs PROMPT.md in the sandbox
+   c. EVAL.ts (vitest) asserts against agent output
+4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
+5. Optional: upload.ts pushes results to Braintrust
 ```

-The agent is **Claude Code** invoked via `claude -p` (print mode). It operates
-on a real filesystem in a temp directory and can read/write files freely.
+The agent is **Claude Code** running inside a Docker sandbox managed by
+`@vercel/agent-eval`. It operates on a real filesystem and can read/write files
+freely.

-**Important**: MCP servers are disabled via `--strict-mcp-config` with an empty
-config. This ensures the agent uses only local tools (Bash, Edit, Write, Read,
-Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All
-work must happen on the local filesystem — e.g., creating migration files in
-`supabase/migrations/`, not applying them to a remote project.
-
-## Eval Structure
-
-Each eval lives in `evals/{scenario-name}/`:
+## File Structure

 ```
-evals/auth-rls-new-project/
-  PROMPT.md          # Task description (visible to agent)
-  EVAL.ts            # Vitest assertions (hidden from agent during run)
-  package.json       # Minimal project manifest
-  supabase/
-    config.toml      # Pre-initialized supabase config
-    migrations/      # Empty — agent creates files here
+packages/evals/
+  experiments/
+    experiment.ts        # ExperimentConfig — agent, sandbox, setup() hook
+  scripts/
+    eval.sh              # Supabase lifecycle wrapper (start → eval → stop)
+  src/
+    upload.ts            # Standalone Braintrust result uploader
+  evals/
+    eval-utils.ts        # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
+    {scenario}/
+      PROMPT.md          # Task description (visible to agent)
+      EVAL.ts            # Vitest assertions (hidden from agent during run)
+      meta.ts            # expectedReferenceFiles for scoring
+      package.json       # Minimal manifest with vitest devDep
+  project/
+    supabase/
+      config.toml        # Shared Supabase config seeded into each sandbox
+  scenarios/             # Workflow scenario proposals
+  results/               # Output from eval runs (gitignored)
 ```

-**EVAL.ts** is never copied to the workspace until after the agent finishes.
-This prevents the agent from "teaching to the test."
-
 ## Running Evals

-Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
-source files haven't changed. Use `--force` to bypass caching when you need
-to re-run evals regardless (e.g., after changing environment variables or
-re-running the same scenario):
-
 ```bash
-# Run all scenarios with skills (default)
+# Run all scenarios with skills
 mise run eval

 # Force re-run (bypass source caching)
@@ -66,64 +65,52 @@ EVAL_MODEL=claude-opus-4-6 mise run eval
 # Run without skills (baseline)
 EVAL_BASELINE=true mise run eval

-# Install only a specific skill
-EVAL_SKILL=supabase mise run eval
+# Dry run (no API calls)
+mise run eval:dry

 # Upload results to Braintrust
 mise run eval:upload
-
-# Force upload (bypass cache)
-mise run --force eval:upload
 ```

 ## Baseline Mode

-Set `EVAL_BASELINE=true` to run scenarios **without** skills. By default,
-scenarios run with skills installed via the `skills` CLI.
+Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
+default, skill files from `skills/supabase/` are written into the sandbox.

-To compare with-skill vs baseline, run evals twice:
+Compare with-skill vs baseline:

 ```bash
 mise run eval                        # with skills
 EVAL_BASELINE=true mise run eval     # without skills (baseline)
 ```

-Compare the results to measure how much skills improve agent output.
-
 ## Adding Scenarios

-1. Create `evals/{scenario-name}/` with `PROMPT.md`, `EVAL.ts`, `package.json`
-2. Add any starter files the agent should see (e.g., `supabase/config.toml`)
-3. Write vitest assertions in `EVAL.ts` that check the agent's output files
-4. Document the scenario in `scenarios/SCENARIOS.md`
+1. Create `evals/{scenario-name}/` with:
+   - `PROMPT.md` — task description for the agent
+   - `EVAL.ts` — vitest assertions checking agent output
+   - `meta.ts` — export `expectedReferenceFiles` array for scoring
+   - `package.json` — `{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
+2. Add any starter files the agent should see (they get copied via `setup()`)
+3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)

 ## Environment

 ```
-ANTHROPIC_API_KEY=sk-ant-...       # Required: Claude Code authentication
-EVAL_MODEL=...                     # Optional: override model (default: claude-sonnet-4-5-20250929)
-EVAL_SCENARIO=...                  # Optional: run single scenario
-EVAL_SKILL=...                     # Optional: install only this skill (e.g., "supabase")
-EVAL_BASELINE=true                 # Optional: run without skills (baseline mode)
-BRAINTRUST_UPLOAD=true             # Optional: upload results to Braintrust
-BRAINTRUST_API_KEY=...             # Required when BRAINTRUST_UPLOAD=true
-BRAINTRUST_PROJECT_ID=...          # Required when BRAINTRUST_UPLOAD=true
-BRAINTRUST_BASE_EXPERIMENT=...     # Optional: compare against a named experiment
+ANTHROPIC_API_KEY=sk-ant-...         # Required: Claude Code authentication
+EVAL_MODEL=...                       # Optional: override model (default: claude-sonnet-4-6)
+EVAL_SCENARIO=...                    # Optional: run single scenario
+EVAL_BASELINE=true                   # Optional: run without skills
+BRAINTRUST_API_KEY=...               # Required for eval:upload
+BRAINTRUST_PROJECT_ID=...            # Required for eval:upload
 ```

-## Key Files
+## Docker Evals

-```
-src/
-  runner.ts              # Main orchestrator
-  types.ts               # Core interfaces
-  runner/
-    scaffold.ts          # Creates temp workspace from eval template
-    agent.ts             # Invokes claude -p as subprocess
-    test.ts              # Runs vitest EVAL.ts against workspace
-    results.ts           # Collects results and prints summary
-evals/
-  auth-rls-new-project/  # Scenario 1
-scenarios/
-  SCENARIOS.md           # Scenario descriptions
+Build and run evals inside Docker (e.g., for CI):
+
+```bash
+mise run eval:docker:build           # Build the eval Docker image
+mise run eval:docker                 # Run evals in Docker
+mise run eval:docker:shell           # Debug shell in eval container
 ```
--- a/packages/evals/evals/auth-fk-cascade-delete/EVAL.ts
+++ b/packages/evals/evals/auth-fk-cascade-delete/EVAL.ts
@@ -1,85 +1,74 @@
-export const expectedReferenceFiles = [
-	"db-schema-auth-fk.md",
-	"db-security-functions.md",
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates profiles table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /profiles/.test(sql);
-		},
-	},
-	{
-		name: "FK references auth.users",
-		check: () =>
-			/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "ON DELETE CASCADE present",
-		check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "RLS enabled on profiles",
-		check: () =>
-			/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "trigger function uses SECURITY DEFINER",
-		check: () => /security\s+definer/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "trigger function sets search_path",
-		check: () =>
-			/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "trigger created on auth.users",
-		check: () =>
-			/create\s+trigger[\s\S]*?on\s+auth\.users/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "policies scoped to authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
-				/security\s+definer/.test(sql),
-				/set\s+search_path\s*=\s*''/.test(sql),
-				/create\s+trigger[\s\S]*?on\s+auth\.users/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-			];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+test("creates profiles table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+table/.test(sql) && /profiles/.test(sql)).toBe(true);
+});
+
+test("FK references auth.users", () => {
+	expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("ON DELETE CASCADE present", () => {
+	expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("RLS enabled on profiles", () => {
+	expect(
+		/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("trigger function uses SECURITY DEFINER", () => {
+	expect(/security\s+definer/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("trigger function sets search_path", () => {
+	expect(
+		/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("trigger created on auth.users", () => {
+	expect(
+		/create\s+trigger[\s\S]*?on\s+auth\.users/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("policies scoped to authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
+		/security\s+definer/.test(sql),
+		/set\s+search_path\s*=\s*''/.test(sql),
+		/create\s+trigger[\s\S]*?on\s+auth\.users/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/auth-fk-cascade-delete/meta.ts
+++ b/packages/evals/evals/auth-fk-cascade-delete/meta.ts
@@ -0,0 +1,6 @@
+export const expectedReferenceFiles = [
+	"db-schema-auth-fk.md",
+	"db-security-functions.md",
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+];
--- a/packages/evals/evals/auth-fk-cascade-delete/package.json
+++ b/packages/evals/evals/auth-fk-cascade-delete/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "auth-fk-cascade-delete",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/auth-rls-new-project/EVAL.ts
+++ b/packages/evals/evals/auth-rls-new-project/EVAL.ts
@@ -1,16 +1,6 @@
-export const expectedReferenceFiles = [
-	"dev-getting-started.md",
-	"db-rls-mandatory.md",
-	"db-rls-policy-types.md",
-	"db-rls-common-mistakes.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-migrations-idempotent.md",
-];
-
 import { existsSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	anonSeeesNoRows,
@@ -19,132 +9,116 @@ import {
 	getSupabaseDir,
 	queryTable,
 	tableExists,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized (config.toml exists)",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "migration file exists in supabase/migrations/",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates tasks table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /tasks/.test(sql);
-		},
-	},
-	{
-		name: "enables RLS on tasks table",
-		check: () =>
-			/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "has foreign key to auth.users",
-		check: () =>
-			/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "uses ON DELETE CASCADE for auth FK",
-		check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "uses (select auth.uid()) not bare auth.uid() in policies",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			for (const policy of policyBlocks) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp for time columns",
-		check: () => {
-			const rawSql = getMigrationSQL().toLowerCase();
-			const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
-			if (
-				sql.includes("created_at") ||
-				sql.includes("updated_at") ||
-				sql.includes("due_date")
-			) {
-				return !hasPlainTimestamp.test(sql);
-			}
-			return true;
-		},
-	},
-	{
-		name: "creates index on user_id column",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+index/.test(sql) && /user_id/.test(sql);
-		},
-	},
-	{
-		name: "does not use SERIAL or BIGSERIAL for primary key",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return !/\bserial\b/.test(sql) && !/\bbigserial\b/.test(sql);
-		},
-	},
-	{
-		name: "migration is idempotent (uses IF NOT EXISTS)",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const signals = [
-				/enable\s+row\s+level\s+security/,
-				/\(select\s+auth\.uid\(\)\)/,
-				/to\s+authenticated/,
-				/on\s+delete\s+cascade/,
-				/create\s+index/,
-			];
-			return signals.filter((r) => r.test(sql)).length >= 4;
-		},
-	},
-	{
-		name: "tasks table exists in the database after migration",
-		check: () => tableExists("tasks"),
-		timeout: 10_000,
-	},
-	{
-		name: "tasks table is queryable with service role",
-		check: async () => {
-			const { error } = await queryTable("tasks", "service_role");
-			return error === null;
-		},
-		timeout: 10_000,
-	},
-	{
-		name: "tasks table returns no rows for anon (RLS is active)",
-		check: () => anonSeeesNoRows("tasks"),
-		timeout: 10_000,
-	},
-];
+test("supabase project initialized (config.toml exists)", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("migration file exists in supabase/migrations/", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});
+
+test("creates tasks table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+table/.test(sql) && /tasks/.test(sql)).toBe(true);
+});
+
+test("enables RLS on tasks table", () => {
+	expect(
+		/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("has foreign key to auth.users", () => {
+	expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("uses ON DELETE CASCADE for auth FK", () => {
+	expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	for (const policy of policyBlocks) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp for time columns", () => {
+	const rawSql = getMigrationSQL().toLowerCase();
+	const sql = rawSql.replace(/--[^\n]*/g, "");
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	if (
+		sql.includes("created_at") ||
+		sql.includes("updated_at") ||
+		sql.includes("due_date")
+	) {
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
+	}
+});
+
+test("creates index on user_id column", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+index/.test(sql) && /user_id/.test(sql)).toBe(true);
+});
+
+test("does not use SERIAL or BIGSERIAL for primary key", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/\bserial\b/.test(sql)).toBe(false);
+	expect(/\bbigserial\b/.test(sql)).toBe(false);
+});
+
+test("migration is idempotent (uses IF NOT EXISTS)", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const signals = [
+		/enable\s+row\s+level\s+security/,
+		/\(select\s+auth\.uid\(\)\)/,
+		/to\s+authenticated/,
+		/on\s+delete\s+cascade/,
+		/create\s+index/,
+	];
+	expect(signals.filter((r) => r.test(sql)).length >= 4).toBe(true);
+});
+
+test("tasks table exists in the database after migration", async () => {
+	expect(await tableExists("tasks")).toBe(true);
+}, 10_000);
+
+test("tasks table is queryable with service role", async () => {
+	const { error } = await queryTable("tasks", "service_role");
+	expect(error === null).toBe(true);
+}, 10_000);
+
+test("tasks table returns no rows for anon (RLS is active)", async () => {
+	expect(await anonSeeesNoRows("tasks")).toBe(true);
+}, 10_000);
--- a/packages/evals/evals/auth-rls-new-project/meta.ts
+++ b/packages/evals/evals/auth-rls-new-project/meta.ts
@@ -0,0 +1,9 @@
+export const expectedReferenceFiles = [
+	"dev-getting-started.md",
+	"db-rls-mandatory.md",
+	"db-rls-policy-types.md",
+	"db-rls-common-mistakes.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/auth-rls-new-project/package.json
+++ b/packages/evals/evals/auth-rls-new-project/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "auth-rls-new-project",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/cli-hallucinated-commands/EVAL.ts
+++ b/packages/evals/evals/cli-hallucinated-commands/EVAL.ts
@@ -1,11 +1,6 @@
-export const expectedReferenceFiles = [
-	"dev-getting-started.md",
-	"edge-fun-quickstart.md",
-];
-
 import { readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const cwd = process.cwd();

@@ -27,102 +22,93 @@ function getReferenceContent(): string {
 	return readFileSync(file, "utf-8");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "CLI_REFERENCE.md exists in project root",
-		check: () => findReferenceFile() !== null,
-	},
-	{
-		name: "no hallucinated functions log command",
-		check: () => {
-			const content = getReferenceContent();
-			return (
-				!/`supabase\s+functions\s+log`/.test(content) &&
-				!/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) &&
-				!/^\s*supabase\s+functions\s+log\b/m.test(content)
-			);
-		},
-	},
-	{
-		name: "no hallucinated db query command",
-		check: () => {
-			const content = getReferenceContent();
-			return (
-				!/`supabase\s+db\s+query`/.test(content) &&
-				!/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) &&
-				!/^\s*supabase\s+db\s+query\b/m.test(content)
-			);
-		},
-	},
-	{
-		name: "mentions supabase functions serve for local development",
-		check: () =>
-			/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions supabase functions deploy",
-		check: () =>
-			/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions psql or SQL Editor or connection string for ad-hoc SQL",
-		check: () => {
-			const content = getReferenceContent().toLowerCase();
-			return (
-				/\bpsql\b/.test(content) ||
-				/sql\s+editor/.test(content) ||
-				/connection\s+string/.test(content) ||
-				/supabase\s+db\s+dump/.test(content)
-			);
-		},
-	},
-	{
-		name: "mentions supabase db push or supabase db reset for migrations",
-		check: () => {
-			const content = getReferenceContent().toLowerCase();
-			return (
-				/supabase\s+db\s+push/.test(content) ||
-				/supabase\s+db\s+reset/.test(content)
-			);
-		},
-	},
-	{
-		name: "mentions supabase start for local stack",
-		check: () => /supabase\s+start/.test(getReferenceContent().toLowerCase()),
-	},
-	{
-		name: "mentions Dashboard or Logs Explorer for production log viewing",
-		check: () => {
-			const content = getReferenceContent().toLowerCase();
-			return /\bdashboard\b/.test(content) || /logs\s+explorer/.test(content);
-		},
-	},
-	{
-		name: "overall quality: uses real CLI commands throughout",
-		check: () => {
-			const content = getReferenceContent().toLowerCase();
-			const signals = [
-				/supabase\s+start/,
-				/supabase\s+stop/,
-				/supabase\s+functions\s+serve/,
-				/supabase\s+functions\s+deploy/,
-				/supabase\s+db\s+(push|reset|diff)/,
-				/\bpsql\b|\bsql\s+editor\b|\bconnection\s+string\b/,
-				/\bdashboard\b|\blogs\s+explorer\b/,
-			];
-			const hallucinations = [
-				/`supabase\s+functions\s+log`/,
-				/^\s*npx\s+supabase\s+functions\s+log\b/m,
-				/^\s*supabase\s+functions\s+log\b/m,
-				/`supabase\s+db\s+query`/,
-				/^\s*npx\s+supabase\s+db\s+query\b/m,
-				/^\s*supabase\s+db\s+query\b/m,
-			];
-			const positiveMatches = signals.filter((r) => r.test(content)).length;
-			const hallucinationMatches = hallucinations.filter((r) =>
-				r.test(content),
-			).length;
-			return positiveMatches >= 5 && hallucinationMatches === 0;
-		},
-	},
-];
+test("CLI_REFERENCE.md exists in project root", () => {
+	expect(findReferenceFile() !== null).toBe(true);
+});
+
+test("no hallucinated functions log command", () => {
+	const content = getReferenceContent();
+	expect(
+		/`supabase\s+functions\s+log`/.test(content) ||
+			/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) ||
+			/^\s*supabase\s+functions\s+log\b/m.test(content),
+	).toBe(false);
+});
+
+test("no hallucinated db query command", () => {
+	const content = getReferenceContent();
+	expect(
+		/`supabase\s+db\s+query`/.test(content) ||
+			/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) ||
+			/^\s*supabase\s+db\s+query\b/m.test(content),
+	).toBe(false);
+});
+
+test("mentions supabase functions serve for local development", () => {
+	expect(
+		/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
+	).toBe(true);
+});
+
+test("mentions supabase functions deploy", () => {
+	expect(
+		/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
+	).toBe(true);
+});
+
+test("mentions psql or SQL Editor or connection string for ad-hoc SQL", () => {
+	const content = getReferenceContent().toLowerCase();
+	expect(
+		/\bpsql\b/.test(content) ||
+			/sql\s+editor/.test(content) ||
+			/connection\s+string/.test(content) ||
+			/supabase\s+db\s+dump/.test(content),
+	).toBe(true);
+});
+
+test("mentions supabase db push or supabase db reset for migrations", () => {
+	const content = getReferenceContent().toLowerCase();
+	expect(
+		/supabase\s+db\s+push/.test(content) ||
+			/supabase\s+db\s+reset/.test(content),
+	).toBe(true);
+});
+
+test("mentions supabase start for local stack", () => {
+	expect(/supabase\s+start/.test(getReferenceContent().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("mentions Dashboard or Logs Explorer for production log viewing", () => {
+	const content = getReferenceContent().toLowerCase();
+	expect(/\bdashboard\b/.test(content) || /logs\s+explorer/.test(content)).toBe(
+		true,
+	);
+});
+
+test("overall quality: uses real CLI commands throughout", () => {
+	const content = getReferenceContent().toLowerCase();
+	const signals = [
+		/supabase\s+start/,
+		/supabase\s+stop/,
+		/supabase\s+functions\s+serve/,
+		/supabase\s+functions\s+deploy/,
+		/supabase\s+db\s+(push|reset|diff)/,
+		/\bpsql\b|\bsql\s+editor\b|\bconnection\s+string\b/,
+		/\bdashboard\b|\blogs\s+explorer\b/,
+	];
+	const hallucinations = [
+		/`supabase\s+functions\s+log`/,
+		/^\s*npx\s+supabase\s+functions\s+log\b/m,
+		/^\s*supabase\s+functions\s+log\b/m,
+		/`supabase\s+db\s+query`/,
+		/^\s*npx\s+supabase\s+db\s+query\b/m,
+		/^\s*supabase\s+db\s+query\b/m,
+	];
+	const positiveMatches = signals.filter((r) => r.test(content)).length;
+	const hallucinationMatches = hallucinations.filter((r) =>
+		r.test(content),
+	).length;
+	expect(positiveMatches >= 5 && hallucinationMatches === 0).toBe(true);
+});
--- a/packages/evals/evals/cli-hallucinated-commands/meta.ts
+++ b/packages/evals/evals/cli-hallucinated-commands/meta.ts
@@ -0,0 +1,4 @@
+export const expectedReferenceFiles = [
+	"dev-getting-started.md",
+	"edge-fun-quickstart.md",
+];
--- a/packages/evals/evals/cli-hallucinated-commands/package.json
+++ b/packages/evals/evals/cli-hallucinated-commands/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "cli-hallucinated-commands",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/collaborative-rooms-realtime/EVAL.ts
+++ b/packages/evals/evals/collaborative-rooms-realtime/EVAL.ts
@@ -1,354 +1,322 @@
-export const expectedReferenceFiles = [
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-security-functions.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-schema-realtime.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-	"realtime-setup-auth.md",
-	"realtime-broadcast-database.md",
-	"realtime-setup-channels.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates rooms table",
-		check: () =>
-			/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "creates room_members table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/create\s+table[\s\S]*?room_members/.test(sql) ||
-				/create\s+table[\s\S]*?room_users/.test(sql) ||
-				/create\s+table[\s\S]*?memberships/.test(sql)
-			);
-		},
-	},
-	{
-		name: "creates content table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/create\s+table[\s\S]*?content/.test(sql) ||
-				/create\s+table[\s\S]*?items/.test(sql) ||
-				/create\s+table[\s\S]*?documents/.test(sql) ||
-				/create\s+table[\s\S]*?posts/.test(sql) ||
-				/create\s+table[\s\S]*?messages/.test(sql)
-			);
-		},
-	},
-	{
-		name: "room_members has role column with owner/editor/viewer",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/role/.test(sql) &&
-				/owner/.test(sql) &&
-				/editor/.test(sql) &&
-				/viewer/.test(sql)
-			);
-		},
-	},
-	{
-		name: "enables RLS on all application tables",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const roomsRls =
-				/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				);
-			const membershipRls =
-				/alter\s+table[\s\S]*?room_members[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?room_users[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				);
-			const contentRls =
-				/alter\s+table[\s\S]*?content[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?items[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?posts[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) ||
-				/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				);
-			return roomsRls && membershipRls && contentRls;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "content has room_id FK referencing rooms",
-		check: () =>
-			/room_id[\s\S]*?references[\s\S]*?rooms/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "policies use (select auth.uid())",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			if (policyBlocks.length === 0) return false;
-			for (const policy of policyBlocks) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const appPolicies = policyBlocks.filter(
-				(p) => !p.includes("realtime.messages"),
-			);
-			return (
-				appPolicies.length > 0 &&
-				appPolicies.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "private schema with security_definer helper function",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/create\s+schema[\s\S]*?private/.test(sql) &&
-				/private\./.test(sql) &&
-				/security\s+definer/.test(sql) &&
-				/set\s+search_path\s*=\s*''/.test(sql)
-			);
-		},
-	},
-	{
-		name: "role-based write policies: content INSERT/UPDATE restricted to owner or editor",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const writePolicies = policyBlocks.filter(
-				(p) =>
-					(/for\s+(insert|update|all)/.test(p) || /insert|update/.test(p)) &&
-					(p.includes("content") ||
-						p.includes("items") ||
-						p.includes("documents") ||
-						p.includes("posts") ||
-						p.includes("messages")),
-			);
-			return writePolicies.some(
-				(p) => p.includes("owner") || p.includes("editor"),
-			);
-		},
-	},
-	{
-		name: "viewer role is read-only (no write access to content)",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const contentWritePolicies = policyBlocks.filter(
-				(p) =>
-					/for\s+(insert|update|delete)/.test(p) &&
-					(p.includes("content") ||
-						p.includes("items") ||
-						p.includes("documents") ||
-						p.includes("posts") ||
-						p.includes("messages")),
-			);
-			if (contentWritePolicies.length === 0) return true;
-			return !contentWritePolicies.some((p) => {
-				const mentionsRole =
-					p.includes("owner") || p.includes("editor") || p.includes("viewer");
-				if (!mentionsRole) return true;
-				return (
-					p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
-				);
-			});
-		},
-	},
-	{
-		name: "indexes on membership lookup columns",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (!/create\s+index/.test(sql)) return false;
-			const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
-			return (
-				indexBlocks.filter(
-					(idx) =>
-						idx.toLowerCase().includes("user_id") ||
-						idx.toLowerCase().includes("room_id"),
-				).length >= 1
-			);
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp",
-		check: () => {
-			const rawSql = getMigrationSQL().toLowerCase();
-			const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/(?:created_at|updated_at|invited_at|joined_at)\s+timestamp(?!\s*tz)(?!\s+with\s+time\s+zone)/;
-			if (
-				sql.includes("created_at") ||
-				sql.includes("updated_at") ||
-				sql.includes("_at ")
-			) {
-				return !hasPlainTimestamp.test(sql);
-			}
-			return true;
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "realtime publication enabled for content table",
-		check: () =>
-			/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "broadcast trigger for content changes",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				(/realtime\.broadcast_changes/.test(sql) ||
-					/realtime\.send/.test(sql)) &&
-				/create\s+trigger/.test(sql)
-			);
-		},
-	},
-	{
-		name: "broadcast trigger function uses security definer",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const functionBlocks =
-				sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
-			const realtimeFunctions = functionBlocks.filter(
-				(f) =>
-					f.toLowerCase().includes("realtime.broadcast_changes") ||
-					f.toLowerCase().includes("realtime.send"),
-			);
-			if (realtimeFunctions.length === 0) return false;
-			return realtimeFunctions.some(
-				(f) =>
-					/security\s+definer/.test(f.toLowerCase()) &&
-					/set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
-			);
-		},
-	},
-	{
-		name: "RLS policies on realtime.messages",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const realtimePolicies = policyBlocks.filter((p) =>
-				p.includes("realtime.messages"),
-			);
-			if (realtimePolicies.length === 0) return false;
-			return realtimePolicies.some(
-				(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
-			);
-		},
-	},
-	{
-		name: "realtime policy checks extension column",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const realtimePolicies = policyBlocks.filter((p) =>
-				p.includes("realtime.messages"),
-			);
-			return realtimePolicies.some(
-				(p) =>
-					p.includes("extension") &&
-					(p.includes("broadcast") || p.includes("presence")),
-			);
-		},
-	},
-	{
-		name: "overall quality score",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				),
-				/alter\s+table[\s\S]*?(room_members|room_users|memberships)[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				),
-				/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				/create\s+schema[\s\S]*?private/.test(sql),
-				/security\s+definer/.test(sql) &&
-					/set\s+search_path\s*=\s*''/.test(sql),
-				/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.filter((p) => !p.includes("realtime.messages")).length >
-						0 &&
-					policyBlocks
-						.filter((p) => !p.includes("realtime.messages"))
-						.every((p) => /to\s+authenticated/.test(p)),
-				/create\s+index/.test(sql),
-				/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
-				/if\s+not\s+exists/.test(sql),
-				sql.includes("owner") &&
-					sql.includes("editor") &&
-					sql.includes("viewer"),
-				/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
-				/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
-				/create\s+trigger/.test(sql),
-				policyBlocks.some((p) => p.includes("realtime.messages")),
-				policyBlocks
-					.filter((p) => p.includes("realtime.messages"))
-					.some((p) => p.includes("extension")),
-				/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
-			];
-			return signals.filter(Boolean).length >= 13;
-		},
-	},
-];
+test("creates rooms table", () => {
+	expect(
+		/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("creates room_members table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/create\s+table[\s\S]*?room_members/.test(sql) ||
+			/create\s+table[\s\S]*?room_users/.test(sql) ||
+			/create\s+table[\s\S]*?memberships/.test(sql),
+	).toBe(true);
+});
+
+test("creates content table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/create\s+table[\s\S]*?content/.test(sql) ||
+			/create\s+table[\s\S]*?items/.test(sql) ||
+			/create\s+table[\s\S]*?documents/.test(sql) ||
+			/create\s+table[\s\S]*?posts/.test(sql) ||
+			/create\s+table[\s\S]*?messages/.test(sql),
+	).toBe(true);
+});
+
+test("room_members has role column with owner/editor/viewer", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/role/.test(sql) &&
+			/owner/.test(sql) &&
+			/editor/.test(sql) &&
+			/viewer/.test(sql),
+	).toBe(true);
+});
+
+test("enables RLS on all application tables", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const roomsRls =
+		/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		);
+	const membershipRls =
+		/alter\s+table[\s\S]*?room_members[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?room_users[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		);
+	const contentRls =
+		/alter\s+table[\s\S]*?content[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?items[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?posts[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) ||
+		/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		);
+	expect(roomsRls && membershipRls && contentRls).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("content has room_id FK referencing rooms", () => {
+	expect(
+		/room_id[\s\S]*?references[\s\S]*?rooms/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("policies use (select auth.uid())", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	if (policyBlocks.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	for (const policy of policyBlocks) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const appPolicies = policyBlocks.filter(
+		(p) => !p.includes("realtime.messages"),
+	);
+	expect(
+		appPolicies.length > 0 &&
+			appPolicies.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("private schema with security_definer helper function", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/create\s+schema[\s\S]*?private/.test(sql) &&
+			/private\./.test(sql) &&
+			/security\s+definer/.test(sql) &&
+			/set\s+search_path\s*=\s*''/.test(sql),
+	).toBe(true);
+});
+
+test("role-based write policies: content INSERT/UPDATE restricted to owner or editor", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const writePolicies = policyBlocks.filter(
+		(p) =>
+			(/for\s+(insert|update|all)/.test(p) || /insert|update/.test(p)) &&
+			(p.includes("content") ||
+				p.includes("items") ||
+				p.includes("documents") ||
+				p.includes("posts") ||
+				p.includes("messages")),
+	);
+	expect(
+		writePolicies.some((p) => p.includes("owner") || p.includes("editor")),
+	).toBe(true);
+});
+
+test("viewer role is read-only (no write access to content)", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const contentWritePolicies = policyBlocks.filter(
+		(p) =>
+			/for\s+(insert|update|delete)/.test(p) &&
+			(p.includes("content") ||
+				p.includes("items") ||
+				p.includes("documents") ||
+				p.includes("posts") ||
+				p.includes("messages")),
+	);
+	if (contentWritePolicies.length === 0) {
+		expect(true).toBe(true);
+		return;
+	}
+	const result = !contentWritePolicies.some((p) => {
+		const mentionsRole =
+			p.includes("owner") || p.includes("editor") || p.includes("viewer");
+		if (!mentionsRole) return true;
+		return (
+			p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
+		);
+	});
+	expect(result).toBe(true);
+});
+
+test("indexes on membership lookup columns", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (!/create\s+index/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
+	const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
+	expect(
+		indexBlocks.filter(
+			(idx) =>
+				idx.toLowerCase().includes("user_id") ||
+				idx.toLowerCase().includes("room_id"),
+		).length >= 1,
+	).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp", () => {
+	const rawSql = getMigrationSQL().toLowerCase();
+	const sql = rawSql.replace(/--[^\n]*/g, "");
+	const hasPlainTimestamp =
+		/(?:created_at|updated_at|invited_at|joined_at)\s+timestamp(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	if (
+		sql.includes("created_at") ||
+		sql.includes("updated_at") ||
+		sql.includes("_at ")
+	) {
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
+	}
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("realtime publication enabled for content table", () => {
+	expect(
+		/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("broadcast trigger for content changes", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		(/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql)) &&
+			/create\s+trigger/.test(sql),
+	).toBe(true);
+});
+
+test("broadcast trigger function uses security definer", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const functionBlocks =
+		sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
+	const realtimeFunctions = functionBlocks.filter(
+		(f) =>
+			f.toLowerCase().includes("realtime.broadcast_changes") ||
+			f.toLowerCase().includes("realtime.send"),
+	);
+	if (realtimeFunctions.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		realtimeFunctions.some(
+			(f) =>
+				/security\s+definer/.test(f.toLowerCase()) &&
+				/set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
+		),
+	).toBe(true);
+});
+
+test("RLS policies on realtime.messages", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const realtimePolicies = policyBlocks.filter((p) =>
+		p.includes("realtime.messages"),
+	);
+	if (realtimePolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		realtimePolicies.some(
+			(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
+		),
+	).toBe(true);
+});
+
+test("realtime policy checks extension column", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const realtimePolicies = policyBlocks.filter((p) =>
+		p.includes("realtime.messages"),
+	);
+	expect(
+		realtimePolicies.some(
+			(p) =>
+				p.includes("extension") &&
+				(p.includes("broadcast") || p.includes("presence")),
+		),
+	).toBe(true);
+});
+
+test("overall quality score", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		),
+		/alter\s+table[\s\S]*?(room_members|room_users|memberships)[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		),
+		/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		/create\s+schema[\s\S]*?private/.test(sql),
+		/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
+		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.filter((p) => !p.includes("realtime.messages")).length > 0 &&
+			policyBlocks
+				.filter((p) => !p.includes("realtime.messages"))
+				.every((p) => /to\s+authenticated/.test(p)),
+		/create\s+index/.test(sql),
+		/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
+		/if\s+not\s+exists/.test(sql),
+		sql.includes("owner") && sql.includes("editor") && sql.includes("viewer"),
+		/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
+		/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
+		/create\s+trigger/.test(sql),
+		policyBlocks.some((p) => p.includes("realtime.messages")),
+		policyBlocks
+			.filter((p) => p.includes("realtime.messages"))
+			.some((p) => p.includes("extension")),
+		/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
+	];
+	expect(signals.filter(Boolean).length >= 13).toBe(true);
+});
--- a/packages/evals/evals/collaborative-rooms-realtime/meta.ts
+++ b/packages/evals/evals/collaborative-rooms-realtime/meta.ts
@@ -0,0 +1,14 @@
+export const expectedReferenceFiles = [
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-security-functions.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-schema-realtime.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+	"realtime-setup-auth.md",
+	"realtime-broadcast-database.md",
+	"realtime-setup-channels.md",
+];
--- a/packages/evals/evals/collaborative-rooms-realtime/package.json
+++ b/packages/evals/evals/collaborative-rooms-realtime/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "collaborative-rooms-realtime",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/connection-pooling-prisma/EVAL.ts
+++ b/packages/evals/evals/connection-pooling-prisma/EVAL.ts
@@ -1,12 +1,6 @@
-export const expectedReferenceFiles = [
-	"db-conn-pooling.md",
-	"db-migrations-idempotent.md",
-	"db-schema-auth-fk.md",
-];
-
 import { existsSync, readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const cwd = process.cwd();

@@ -65,70 +59,60 @@ function getAllOutputContent(): string {
 	return parts.join("\n");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "prisma schema file exists",
-		check: () => findPrismaSchema() !== null,
-	},
-	{
-		name: "prisma schema references pooler port 6543",
-		check: () => /6543/.test(getAllOutputContent()),
-	},
-	{
-		name: "pgbouncer=true param present",
-		check: () =>
-			/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase()),
-	},
-	{
-		name: "DIRECT_URL provided for migrations",
-		check: () => {
-			const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
-			return /directurl/.test(allContent) || /direct_url/.test(allContent);
-		},
-	},
-	{
-		name: "datasource block references directUrl or DIRECT_URL env var",
-		check: () => {
-			const schema = getPrismaSchema().toLowerCase();
-			const datasourceBlock =
-				schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
-			return (
-				/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock)
-			);
-		},
-	},
-	{
-		name: "connection limit set to 1 for serverless",
-		check: () => {
-			const content = getAllOutputContent().toLowerCase();
-			return (
-				/connection_limit\s*=\s*1/.test(content) ||
-				/connection_limit:\s*1/.test(content) ||
-				/connectionlimit\s*=\s*1/.test(content)
-			);
-		},
-	},
-	{
-		name: "explanation distinguishes port 6543 vs 5432",
-		check: () => {
-			const content = getAllOutputContent();
-			return /6543/.test(content) && /5432/.test(content);
-		},
-	},
-	{
-		name: "overall quality: demonstrates correct Prisma + Supabase pooler setup",
-		check: () => {
-			const schema = getPrismaSchema().toLowerCase();
-			const envContent = getAllEnvContent().toLowerCase();
-			const allContent = `${schema}\n${envContent}`;
-			const signals = [
-				/6543/,
-				/pgbouncer\s*=\s*true/,
-				/directurl|direct_url/,
-				/connection_limit\s*=\s*1|connection_limit:\s*1/,
-				/5432/,
-			];
-			return signals.filter((r) => r.test(allContent)).length >= 4;
-		},
-	},
-];
+test("prisma schema file exists", () => {
+	expect(findPrismaSchema() !== null).toBe(true);
+});
+
+test("prisma schema references pooler port 6543", () => {
+	expect(/6543/.test(getAllOutputContent())).toBe(true);
+});
+
+test("pgbouncer=true param present", () => {
+	expect(/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("DIRECT_URL provided for migrations", () => {
+	const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
+	expect(/directurl/.test(allContent) || /direct_url/.test(allContent)).toBe(
+		true,
+	);
+});
+
+test("datasource block references directUrl or DIRECT_URL env var", () => {
+	const schema = getPrismaSchema().toLowerCase();
+	const datasourceBlock =
+		schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
+	expect(
+		/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock),
+	).toBe(true);
+});
+
+test("connection limit set to 1 for serverless", () => {
+	const content = getAllOutputContent().toLowerCase();
+	expect(
+		/connection_limit\s*=\s*1/.test(content) ||
+			/connection_limit:\s*1/.test(content) ||
+			/connectionlimit\s*=\s*1/.test(content),
+	).toBe(true);
+});
+
+test("explanation distinguishes port 6543 vs 5432", () => {
+	const content = getAllOutputContent();
+	expect(/6543/.test(content) && /5432/.test(content)).toBe(true);
+});
+
+test("overall quality: demonstrates correct Prisma + Supabase pooler setup", () => {
+	const schema = getPrismaSchema().toLowerCase();
+	const envContent = getAllEnvContent().toLowerCase();
+	const allContent = `${schema}\n${envContent}`;
+	const signals = [
+		/6543/,
+		/pgbouncer\s*=\s*true/,
+		/directurl|direct_url/,
+		/connection_limit\s*=\s*1|connection_limit:\s*1/,
+		/5432/,
+	];
+	expect(signals.filter((r) => r.test(allContent)).length >= 4).toBe(true);
+});
--- a/packages/evals/evals/connection-pooling-prisma/meta.ts
+++ b/packages/evals/evals/connection-pooling-prisma/meta.ts
@@ -0,0 +1,5 @@
+export const expectedReferenceFiles = [
+	"db-conn-pooling.md",
+	"db-migrations-idempotent.md",
+	"db-schema-auth-fk.md",
+];
--- a/packages/evals/evals/connection-pooling-prisma/package.json
+++ b/packages/evals/evals/connection-pooling-prisma/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "connection-pooling-prisma",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/edge-function-hello-world/EVAL.ts
+++ b/packages/evals/evals/edge-function-hello-world/EVAL.ts
@@ -1,14 +1,6 @@
-export const expectedReferenceFiles = [
-	"edge-fun-quickstart.md",
-	"edge-fun-project-structure.md",
-	"edge-pat-cors.md",
-	"edge-pat-error-handling.md",
-	"dev-getting-started.md",
-];
-
 import { existsSync, readdirSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	findFunctionFile,
@@ -17,7 +9,7 @@ import {
 	getFunctionsDir,
 	getSharedCode,
 	getSupabaseDir,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

 const FUNCTION_NAME = "hello-world";

@@ -33,125 +25,113 @@ function getCatchBlockCode(): string {
 	return code.slice(catchIndex);
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "function directory exists",
-		check: () => existsSync(join(getFunctionsDir(), FUNCTION_NAME)),
-	},
-	{
-		name: "function index file exists",
-		check: () => findFunctionFile(FUNCTION_NAME) !== null,
-	},
-	{
-		name: "uses Deno.serve",
-		check: () => /Deno\.serve/.test(getFunctionCode(FUNCTION_NAME)),
-	},
-	{
-		name: "returns JSON response",
-		check: () => {
-			const allCode = getAllCode();
-			return (
-				/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
-				/Response\.json/i.test(allCode) ||
-				/JSON\.stringify/i.test(allCode)
-			);
-		},
-	},
-	{
-		name: "handles OPTIONS preflight",
-		check: () => {
-			const allCode = getAllCode();
-			return /['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode);
-		},
-	},
-	{
-		name: "defines CORS headers",
-		check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
-	},
-	{
-		name: "CORS allows required headers",
-		check: () => {
-			const allCode = getAllCode().toLowerCase();
-			return (
-				/access-control-allow-headers/.test(allCode) &&
-				/authorization/.test(allCode) &&
-				/apikey/.test(allCode)
-			);
-		},
-	},
-	{
-		name: "error response has CORS headers",
-		check: () => {
-			const catchCode = getCatchBlockCode();
-			if (catchCode.length === 0) return false;
-			const sharedCode = getSharedCode();
-			const directCors =
-				/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
-			const callsSharedHelper =
-				/errorResponse|jsonResponse|json_response|error_response/i.test(
-					catchCode,
-				) && /Access-Control-Allow-Origin/i.test(sharedCode);
-			return directCors || callsSharedHelper;
-		},
-	},
-	{
-		name: "has try-catch for error handling",
-		check: () => {
-			const code = getFunctionCode(FUNCTION_NAME);
-			return /\btry\s*\{/.test(code) && /\bcatch\b/.test(code);
-		},
-	},
-	{
-		name: "returns proper error status code",
-		check: () => {
-			const catchCode = getCatchBlockCode();
-			if (catchCode.length === 0) return false;
-			return (
-				/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
-				/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode)
-			);
-		},
-	},
-	{
-		name: "shared CORS module exists",
-		check: () => findSharedCorsFile() !== null,
-	},
-	{
-		name: "function imports from shared",
-		check: () =>
-			/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
-	},
-	{
-		name: "function uses hyphenated name",
-		check: () => {
-			const dirs = existsSync(getFunctionsDir())
-				? readdirSync(getFunctionsDir())
-				: [];
-			const helloDir = dirs.find(
-				(d) => d.includes("hello") && d.includes("world"),
-			);
-			return helloDir !== undefined && /^hello-world$/.test(helloDir);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Edge Function best practices",
-		check: () => {
-			const allCode = getAllCode().toLowerCase();
-			const signals = [
-				/deno\.serve/,
-				/['"]options['"]/,
-				/access-control-allow-origin/,
-				/\btry\s*\{/,
-				/status:\s*(400|500|4\d{2}|5\d{2})|[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/,
-				/from\s+['"]\.\.\/(_shared|_utils)/,
-				/authorization/,
-				/apikey/,
-			];
-			return signals.filter((r) => r.test(allCode)).length >= 6;
-		},
-	},
-];
+test("supabase project initialized", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("function directory exists", () => {
+	expect(existsSync(join(getFunctionsDir(), FUNCTION_NAME))).toBe(true);
+});
+
+test("function index file exists", () => {
+	expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
+});
+
+test("uses Deno.serve", () => {
+	expect(/Deno\.serve/.test(getFunctionCode(FUNCTION_NAME))).toBe(true);
+});
+
+test("returns JSON response", () => {
+	const allCode = getAllCode();
+	expect(
+		/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
+			/Response\.json/i.test(allCode) ||
+			/JSON\.stringify/i.test(allCode),
+	).toBe(true);
+});
+
+test("handles OPTIONS preflight", () => {
+	const allCode = getAllCode();
+	expect(/['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode)).toBe(
+		true,
+	);
+});
+
+test("defines CORS headers", () => {
+	expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
+});
+
+test("CORS allows required headers", () => {
+	const allCode = getAllCode().toLowerCase();
+	expect(
+		/access-control-allow-headers/.test(allCode) &&
+			/authorization/.test(allCode) &&
+			/apikey/.test(allCode),
+	).toBe(true);
+});
+
+test("error response has CORS headers", () => {
+	const catchCode = getCatchBlockCode();
+	if (catchCode.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	const sharedCode = getSharedCode();
+	const directCors =
+		/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
+	const callsSharedHelper =
+		/errorResponse|jsonResponse|json_response|error_response/i.test(
+			catchCode,
+		) && /Access-Control-Allow-Origin/i.test(sharedCode);
+	expect(directCors || callsSharedHelper).toBe(true);
+});
+
+test("has try-catch for error handling", () => {
+	const code = getFunctionCode(FUNCTION_NAME);
+	expect(/\btry\s*\{/.test(code) && /\bcatch\b/.test(code)).toBe(true);
+});
+
+test("returns proper error status code", () => {
+	const catchCode = getCatchBlockCode();
+	if (catchCode.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
+			/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode),
+	).toBe(true);
+});
+
+test("shared CORS module exists", () => {
+	expect(findSharedCorsFile() !== null).toBe(true);
+});
+
+test("function imports from shared", () => {
+	expect(
+		/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
+	).toBe(true);
+});
+
+test("function uses hyphenated name", () => {
+	const dirs = existsSync(getFunctionsDir())
+		? readdirSync(getFunctionsDir())
+		: [];
+	const helloDir = dirs.find((d) => d.includes("hello") && d.includes("world"));
+	expect(helloDir !== undefined && /^hello-world$/.test(helloDir)).toBe(true);
+});
+
+test("overall quality: demonstrates Edge Function best practices", () => {
+	const allCode = getAllCode().toLowerCase();
+	const signals = [
+		/deno\.serve/,
+		/['"]options['"]/,
+		/access-control-allow-origin/,
+		/\btry\s*\{/,
+		/status:\s*(400|500|4\d{2}|5\d{2})|[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/,
+		/from\s+['"]\.\.\/(_shared|_utils)/,
+		/authorization/,
+		/apikey/,
+	];
+	expect(signals.filter((r) => r.test(allCode)).length >= 6).toBe(true);
+});
--- a/packages/evals/evals/edge-function-hello-world/meta.ts
+++ b/packages/evals/evals/edge-function-hello-world/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"edge-fun-quickstart.md",
+	"edge-fun-project-structure.md",
+	"edge-pat-cors.md",
+	"edge-pat-error-handling.md",
+	"dev-getting-started.md",
+];
--- a/packages/evals/evals/edge-function-hello-world/package.json
+++ b/packages/evals/evals/edge-function-hello-world/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "edge-function-hello-world",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/extension-wrong-schema/EVAL.ts
+++ b/packages/evals/evals/extension-wrong-schema/EVAL.ts
@@ -1,100 +1,84 @@
-export const expectedReferenceFiles = [
-	"db-schema-extensions.md",
-	"db-rls-mandatory.md",
-	"db-migrations-idempotent.md",
-	"db-schema-auth-fk.md",
-	"db-rls-common-mistakes.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "extension installed in extensions schema",
-		check: () =>
-			/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "IF NOT EXISTS on extension creation",
-		check: () =>
-			/create\s+extension\s+if\s+not\s+exists/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "vector column with correct dimensions",
-		check: () =>
-			/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "HNSW index used instead of IVFFlat",
-		check: () => /using\s+hnsw/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "RLS enabled on documents table",
-		check: () =>
-			/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "idempotent table creation (IF NOT EXISTS)",
-		check: () =>
-			/create\s+table\s+if\s+not\s+exists/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "overall quality: demonstrates pgvector best practices",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(sql),
-				/create\s+extension\s+if\s+not\s+exists/.test(sql),
-				/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(sql),
-				/using\s+hnsw/.test(sql),
-				/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/if\s+not\s+exists/.test(sql),
-			];
-			return signals.filter(Boolean).length >= 6;
-		},
-	},
-];
+test("extension installed in extensions schema", () => {
+	expect(
+		/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("IF NOT EXISTS on extension creation", () => {
+	expect(
+		/create\s+extension\s+if\s+not\s+exists/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("vector column with correct dimensions", () => {
+	expect(
+		/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("HNSW index used instead of IVFFlat", () => {
+	expect(/using\s+hnsw/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("RLS enabled on documents table", () => {
+	expect(
+		/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("idempotent table creation (IF NOT EXISTS)", () => {
+	expect(
+		/create\s+table\s+if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates pgvector best practices", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(sql),
+		/create\s+extension\s+if\s+not\s+exists/.test(sql),
+		/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(sql),
+		/using\s+hnsw/.test(sql),
+		/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+		/if\s+not\s+exists/.test(sql),
+	];
+	expect(signals.filter(Boolean).length >= 6).toBe(true);
+});
--- a/packages/evals/evals/extension-wrong-schema/meta.ts
+++ b/packages/evals/evals/extension-wrong-schema/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-schema-extensions.md",
+	"db-rls-mandatory.md",
+	"db-migrations-idempotent.md",
+	"db-schema-auth-fk.md",
+	"db-rls-common-mistakes.md",
+];
--- a/packages/evals/evals/extension-wrong-schema/package.json
+++ b/packages/evals/evals/extension-wrong-schema/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "extension-wrong-schema",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/postgrest-schema-cache/EVAL.ts
+++ b/packages/evals/evals/postgrest-schema-cache/EVAL.ts
@@ -1,14 +1,6 @@
-export const expectedReferenceFiles = [
-	"db-rls-views.md",
-	"db-migrations-idempotent.md",
-	"db-rls-mandatory.md",
-	"db-rls-performance.md",
-	"db-schema-timestamps.md",
-];
-
 import { existsSync, readdirSync, readFileSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 const migrationsDir = join(process.cwd(), "supabase", "migrations");
 const STARTER_MIGRATION = "20240101000000_create_products.sql";
@@ -29,86 +21,83 @@ function getAgentMigrationSQL(): string {
 	return files.map((f) => readFileSync(f, "utf-8")).join("\n");
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "new migration file exists",
-		check: () => findAgentMigrationFiles().length > 0,
-	},
-	{
-		name: "ADD COLUMN IF NOT EXISTS for description",
-		check: () =>
-			/add\s+column\s+if\s+not\s+exists\s+description/.test(
-				getAgentMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "ADD COLUMN IF NOT EXISTS for published_at",
-		check: () =>
-			/add\s+column\s+if\s+not\s+exists\s+published_at/.test(
-				getAgentMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "published_at uses timestamptz not plain timestamp",
-		check: () => {
-			const sql = getAgentMigrationSQL().toLowerCase();
-			return (
-				/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
-					sql,
-				) &&
-				!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
-					sql,
-				)
-			);
-		},
-	},
-	{
-		name: "view public_products is created",
-		check: () =>
-			/create\s+(or\s+replace\s+)?view\s+public_products/.test(
-				getAgentMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "view uses security_invoker = true",
-		check: () =>
-			/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "SELECT policy on products for authenticated role",
-		check: () => {
-			const sql = getAgentMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.some(
-				(p) =>
-					p.includes("select") &&
-					p.includes("products") &&
-					/to\s+authenticated/.test(p),
-			);
-		},
-	},
-	{
-		name: "NOTIFY pgrst reload schema is present",
-		check: () => /notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality: demonstrates PostgREST and schema best practices",
-		check: () => {
-			const sql = getAgentMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/add\s+column\s+if\s+not\s+exists/.test(sql),
-				/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
-					sql,
-				),
-				/create\s+(or\s+replace\s+)?view\s+public_products/.test(sql),
-				/security_invoker\s*=\s*true/.test(sql),
-				policyBlocks.some(
-					(p) => p.includes("select") && /to\s+authenticated/.test(p),
-				),
-				/notify\s+pgrst/.test(sql),
-			];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+test("new migration file exists", () => {
+	expect(findAgentMigrationFiles().length > 0).toBe(true);
+});
+
+test("ADD COLUMN IF NOT EXISTS for description", () => {
+	expect(
+		/add\s+column\s+if\s+not\s+exists\s+description/.test(
+			getAgentMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("ADD COLUMN IF NOT EXISTS for published_at", () => {
+	expect(
+		/add\s+column\s+if\s+not\s+exists\s+published_at/.test(
+			getAgentMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("published_at uses timestamptz not plain timestamp", () => {
+	const sql = getAgentMigrationSQL().toLowerCase();
+	expect(
+		/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
+			sql,
+		) &&
+			!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(sql),
+	).toBe(true);
+});
+
+test("view public_products is created", () => {
+	expect(
+		/create\s+(or\s+replace\s+)?view\s+public_products/.test(
+			getAgentMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("view uses security_invoker = true", () => {
+	expect(
+		/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("SELECT policy on products for authenticated role", () => {
+	const sql = getAgentMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.some(
+			(p) =>
+				p.includes("select") &&
+				p.includes("products") &&
+				/to\s+authenticated/.test(p),
+		),
+	).toBe(true);
+});
+
+test("NOTIFY pgrst reload schema is present", () => {
+	expect(/notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("overall quality: demonstrates PostgREST and schema best practices", () => {
+	const sql = getAgentMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/add\s+column\s+if\s+not\s+exists/.test(sql),
+		/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
+			sql,
+		),
+		/create\s+(or\s+replace\s+)?view\s+public_products/.test(sql),
+		/security_invoker\s*=\s*true/.test(sql),
+		policyBlocks.some(
+			(p) => p.includes("select") && /to\s+authenticated/.test(p),
+		),
+		/notify\s+pgrst/.test(sql),
+	];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/postgrest-schema-cache/meta.ts
+++ b/packages/evals/evals/postgrest-schema-cache/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-views.md",
+	"db-migrations-idempotent.md",
+	"db-rls-mandatory.md",
+	"db-rls-performance.md",
+	"db-schema-timestamps.md",
+];
--- a/packages/evals/evals/postgrest-schema-cache/package.json
+++ b/packages/evals/evals/postgrest-schema-cache/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "postgrest-schema-cache",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/rls-update-needs-select/EVAL.ts
+++ b/packages/evals/evals/rls-update-needs-select/EVAL.ts
@@ -1,122 +1,97 @@
-export const expectedReferenceFiles = [
-	"db-rls-common-mistakes.md",
-	"db-rls-policy-types.md",
-	"db-rls-performance.md",
-	"db-rls-mandatory.md",
-	"db-schema-timestamps.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates orders table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /orders/.test(sql);
-		},
-	},
-	{
-		name: "enables RLS on orders table",
-		check: () =>
-			/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "has SELECT policy on orders",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.some((p) => p.includes("for select"));
-		},
-	},
-	{
-		name: "has UPDATE policy with WITH CHECK on orders",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
-			return updatePolicy !== undefined && /with\s+check/.test(updatePolicy);
-		},
-	},
-	{
-		name: "all policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses (select auth.uid()) not bare auth.uid() in policies",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			for (const policy of policyBlocks) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "uses timestamptz not plain timestamp for created_at",
-		check: () => {
-			const rawSql = getMigrationSQL().toLowerCase();
-			const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
-			if (sql.includes("created_at")) {
-				return !hasPlainTimestamp.test(sql);
-			}
-			return true;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(sql),
-				policyBlocks.some((p) => p.includes("for select")),
-				policyBlocks.some(
-					(p) => p.includes("for update") && /with\s+check/.test(p),
-				),
-				/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
-					sql.replace(/--[^\n]*/g, ""),
-				),
-			];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+test("creates orders table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+table/.test(sql) && /orders/.test(sql)).toBe(true);
+});
+
+test("enables RLS on orders table", () => {
+	expect(
+		/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("has SELECT policy on orders", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(policyBlocks.some((p) => p.includes("for select"))).toBe(true);
+});
+
+test("has UPDATE policy with WITH CHECK on orders", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
+	expect(updatePolicy !== undefined && /with\s+check/.test(updatePolicy)).toBe(
+		true,
+	);
+});
+
+test("all policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	for (const policy of policyBlocks) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("uses timestamptz not plain timestamp for created_at", () => {
+	const rawSql = getMigrationSQL().toLowerCase();
+	const sql = rawSql.replace(/--[^\n]*/g, "");
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	if (sql.includes("created_at")) {
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
+	}
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(sql),
+		policyBlocks.some((p) => p.includes("for select")),
+		policyBlocks.some(
+			(p) => p.includes("for update") && /with\s+check/.test(p),
+		),
+		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
+			sql.replace(/--[^\n]*/g, ""),
+		),
+	];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/rls-update-needs-select/meta.ts
+++ b/packages/evals/evals/rls-update-needs-select/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-common-mistakes.md",
+	"db-rls-policy-types.md",
+	"db-rls-performance.md",
+	"db-rls-mandatory.md",
+	"db-schema-timestamps.md",
+];
--- a/packages/evals/evals/rls-update-needs-select/package.json
+++ b/packages/evals/evals/rls-update-needs-select/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "rls-update-needs-select",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/rls-user-metadata-role-check/EVAL.ts
+++ b/packages/evals/evals/rls-user-metadata-role-check/EVAL.ts
@@ -1,123 +1,92 @@
-export const expectedReferenceFiles = [
-	"db-rls-common-mistakes.md",
-	"db-rls-policy-types.md",
-	"db-rls-performance.md",
-	"db-rls-mandatory.md",
-	"db-schema-auth-fk.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists in supabase/migrations/", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists in supabase/migrations/",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates documents table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /documents/.test(sql);
-		},
-	},
-	{
-		name: "RLS enabled on documents table",
-		check: () =>
-			/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "uses app_metadata not user_metadata for role check",
-		check: () => /app_metadata/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "user_metadata does not appear in policy USING clauses",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return policyBlocks.every((p) => !p.includes("user_metadata"));
-		},
-	},
-	{
-		name: "has at least two SELECT policies (owner and admin)",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const hasOwnerPolicy = policyBlocks.some(
-				(p) =>
-					(p.includes("select") || !p.includes("insert")) &&
-					(p.includes("user_id") ||
-						p.includes("owner") ||
-						p.includes("auth.uid")),
-			);
-			const hasAdminPolicy = policyBlocks.some((p) =>
-				p.includes("app_metadata"),
-			);
-			return hasOwnerPolicy && hasAdminPolicy;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "uses (select auth.uid()) subselect form in policies",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			for (const policy of policyBlocks) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates Supabase best practices",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(sql),
-				/app_metadata/.test(sql),
-				policyBlocks.every((p) => !p.includes("user_metadata")),
-				/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				policyBlocks.some(
-					(p) =>
-						p.includes("user_id") ||
-						p.includes("owner") ||
-						p.includes("auth.uid"),
-				) && policyBlocks.some((p) => p.includes("app_metadata")),
-			];
-			return signals.filter(Boolean).length >= 5;
-		},
-	},
-];
+test("creates documents table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+table/.test(sql) && /documents/.test(sql)).toBe(true);
+});
+
+test("RLS enabled on documents table", () => {
+	expect(
+		/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("uses app_metadata not user_metadata for role check", () => {
+	expect(/app_metadata/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("user_metadata does not appear in policy USING clauses", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(policyBlocks.every((p) => !p.includes("user_metadata"))).toBe(true);
+});
+
+test("has at least two SELECT policies (owner and admin)", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const hasOwnerPolicy = policyBlocks.some(
+		(p) =>
+			(p.includes("select") || !p.includes("insert")) &&
+			(p.includes("user_id") || p.includes("owner") || p.includes("auth.uid")),
+	);
+	const hasAdminPolicy = policyBlocks.some((p) => p.includes("app_metadata"));
+	expect(hasOwnerPolicy && hasAdminPolicy).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("uses (select auth.uid()) subselect form in policies", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	for (const policy of policyBlocks) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates Supabase best practices", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(sql),
+		/app_metadata/.test(sql),
+		policyBlocks.every((p) => !p.includes("user_metadata")),
+		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		policyBlocks.some(
+			(p) =>
+				p.includes("user_id") || p.includes("owner") || p.includes("auth.uid"),
+		) && policyBlocks.some((p) => p.includes("app_metadata")),
+	];
+	expect(signals.filter(Boolean).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/rls-user-metadata-role-check/meta.ts
+++ b/packages/evals/evals/rls-user-metadata-role-check/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-rls-common-mistakes.md",
+	"db-rls-policy-types.md",
+	"db-rls-performance.md",
+	"db-rls-mandatory.md",
+	"db-schema-auth-fk.md",
+];
--- a/packages/evals/evals/rls-user-metadata-role-check/package.json
+++ b/packages/evals/evals/rls-user-metadata-role-check/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "rls-user-metadata-role-check",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/service-role-edge-function/EVAL.ts
+++ b/packages/evals/evals/service-role-edge-function/EVAL.ts
@@ -1,21 +1,13 @@
-export const expectedReferenceFiles = [
-	"db-security-service-role.md",
-	"edge-fun-quickstart.md",
-	"edge-db-supabase-client.md",
-	"edge-pat-cors.md",
-	"edge-pat-error-handling.md",
-];
-
 import { existsSync } from "node:fs";
 import { join } from "node:path";
-import type { EvalAssertion } from "../../src/eval-types.js";
+import { expect, test } from "vitest";

 import {
 	findFunctionFile,
 	getFunctionCode,
 	getSharedCode,
 	getSupabaseDir,
-} from "../eval-utils.ts";
+} from "./eval-utils.ts";

 const FUNCTION_NAME = "admin-reports";

@@ -24,79 +16,71 @@ function getAllCode(): string {
 	return `${code}\n${getSharedCode()}`;
 }

-export const assertions: EvalAssertion[] = [
-	{
-		name: "supabase project initialized (config.toml exists)",
-		check: () => existsSync(join(getSupabaseDir(), "config.toml")),
-	},
-	{
-		name: "edge function file exists",
-		check: () => findFunctionFile(FUNCTION_NAME) !== null,
-	},
-	{
-		name: "uses Deno.env.get for service role key",
-		check: () =>
+test("supabase project initialized (config.toml exists)", () => {
+	expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
+});
+
+test("edge function file exists", () => {
+	expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
+});
+
+test("uses Deno.env.get for service role key", () => {
+	expect(
+		/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
+			getAllCode(),
+		),
+	).toBe(true);
+});
+
+test("no hardcoded service role key", () => {
+	const allCode = getAllCode();
+	const lines = allCode.split("\n");
+	const nonCommentLines = lines.filter(
+		(line) => !line.trimStart().startsWith("//"),
+	);
+	expect(
+		nonCommentLines.some((line) =>
+			/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
+		),
+	).toBe(false);
+});
+
+test("createClient called with service role env var as second argument", () => {
+	const allCode = getAllCode();
+	expect(
+		/createClient/i.test(allCode) &&
 			/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
-				getAllCode(),
+				allCode,
 			),
-	},
-	{
-		name: "no hardcoded service role key",
-		check: () => {
-			const allCode = getAllCode();
-			const lines = allCode.split("\n");
-			const nonCommentLines = lines.filter(
-				(line) => !line.trimStart().startsWith("//"),
-			);
-			return !nonCommentLines.some((line) =>
-				/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
-			);
-		},
-	},
-	{
-		name: "createClient called with service role env var as second argument",
-		check: () => {
-			const allCode = getAllCode();
-			return (
-				/createClient/i.test(allCode) &&
-				/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
-					allCode,
-				)
-			);
-		},
-	},
-	{
-		name: "service role key env var name does not use NEXT_PUBLIC_ prefix",
-		check: () => !/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode()),
-	},
-	{
-		name: "CORS headers present",
-		check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
-	},
-	{
-		name: "returns JSON response",
-		check: () => {
-			const allCode = getAllCode();
-			return (
-				/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
-				/Response\.json/i.test(allCode) ||
-				/JSON\.stringify/i.test(allCode)
-			);
-		},
-	},
-	{
-		name: "overall quality: demonstrates service role Edge Function best practices",
-		check: () => {
-			const allCode = getAllCode();
-			const signals: RegExp[] = [
-				/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
-				/Access-Control-Allow-Origin/,
-				/createClient/i,
-				/\btry\s*\{/,
-				/Response\.json|JSON\.stringify/,
-				/Deno\.serve/,
-			];
-			return signals.filter((r) => r.test(allCode)).length >= 5;
-		},
-	},
-];
+	).toBe(true);
+});
+
+test("service role key env var name does not use NEXT_PUBLIC_ prefix", () => {
+	expect(/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode())).toBe(false);
+});
+
+test("CORS headers present", () => {
+	expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
+});
+
+test("returns JSON response", () => {
+	const allCode = getAllCode();
+	expect(
+		/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
+			/Response\.json/i.test(allCode) ||
+			/JSON\.stringify/i.test(allCode),
+	).toBe(true);
+});
+
+test("overall quality: demonstrates service role Edge Function best practices", () => {
+	const allCode = getAllCode();
+	const signals: RegExp[] = [
+		/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
+		/Access-Control-Allow-Origin/,
+		/createClient/i,
+		/\btry\s*\{/,
+		/Response\.json|JSON\.stringify/,
+		/Deno\.serve/,
+	];
+	expect(signals.filter((r) => r.test(allCode)).length >= 5).toBe(true);
+});
--- a/packages/evals/evals/service-role-edge-function/meta.ts
+++ b/packages/evals/evals/service-role-edge-function/meta.ts
@@ -0,0 +1,7 @@
+export const expectedReferenceFiles = [
+	"db-security-service-role.md",
+	"edge-fun-quickstart.md",
+	"edge-db-supabase-client.md",
+	"edge-pat-cors.md",
+	"edge-pat-error-handling.md",
+];
--- a/packages/evals/evals/service-role-edge-function/package.json
+++ b/packages/evals/evals/service-role-edge-function/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "service-role-edge-function",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/storage-rls-user-folders/EVAL.ts
+++ b/packages/evals/evals/storage-rls-user-folders/EVAL.ts
@@ -1,253 +1,240 @@
-export const expectedReferenceFiles = [
-	"storage-access-control.md",
-	"db-rls-mandatory.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates avatars bucket",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (
-				!/storage\.buckets/.test(sql) ||
-				!/avatars/.test(sql) ||
-				!/public/.test(sql)
-			)
-				return false;
-			const avatarsBlock = sql.match(
-				/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
-			);
-			return avatarsBlock !== null && /true/.test(avatarsBlock[0]);
-		},
-	},
-	{
-		name: "creates documents bucket",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (!/documents/.test(sql)) return false;
-			const documentsBlock = sql.match(
-				/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
-			);
-			return documentsBlock !== null && /false/.test(documentsBlock[0]);
-		},
-	},
-	{
-		name: "avatars bucket has mime type restriction",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/allowed_mime_types/.test(sql) &&
-				/image\/jpeg/.test(sql) &&
-				/image\/png/.test(sql) &&
-				/image\/webp/.test(sql)
-			);
-		},
-	},
-	{
-		name: "avatars bucket has file size limit",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (!/file_size_limit/.test(sql)) return false;
-			return (
-				/2097152/.test(sql) ||
-				/2\s*m/i.test(sql) ||
-				/2\s*\*\s*1024\s*\*\s*1024/.test(sql)
-			);
-		},
-	},
-	{
-		name: "storage policy uses foldername or path for user isolation",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
-			const usesPathMatch =
-				/\(\s*storage\.foldername\s*\(/.test(sql) ||
-				/\bname\b.*auth\.uid\(\)/.test(sql);
-			return (
-				(usesFoldername || usesPathMatch) &&
-				/auth\.uid\(\)\s*::\s*text/.test(sql)
-			);
-		},
-	},
-	{
-		name: "storage policy uses TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const storagePolicies = policyBlocks.filter((p) =>
-				p.toLowerCase().includes("storage.objects"),
-			);
-			const hasAuthenticatedPolicy = storagePolicies.some(
-				(p) =>
-					/to\s+(authenticated|public)/.test(p.toLowerCase()) ||
-					/auth\.uid\(\)/.test(p.toLowerCase()),
-			);
-			if (!hasAuthenticatedPolicy) return false;
-			const insertPolicies = storagePolicies.filter((p) =>
-				/for\s+insert/.test(p.toLowerCase()),
-			);
-			return insertPolicies.every(
-				(p) =>
-					/to\s+authenticated/.test(p.toLowerCase()) ||
-					/auth\.uid\(\)/.test(p.toLowerCase()),
-			);
-		},
-	},
-	{
-		name: "public read policy for avatars",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const avatarSelectPolicies = policyBlocks.filter(
-				(p) =>
-					p.toLowerCase().includes("storage.objects") &&
-					/for\s+select/.test(p.toLowerCase()) &&
-					p.toLowerCase().includes("avatars"),
-			);
-			if (avatarSelectPolicies.length === 0) return false;
-			return avatarSelectPolicies.some((p) => {
-				const lower = p.toLowerCase();
-				const hasExplicitPublic =
-					/to\s+public/.test(lower) || /to\s+anon/.test(lower);
-				const hasNoToClause = !/\bto\s+\w+/.test(lower);
-				const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
-				return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
-			});
-		},
-	},
-	{
-		name: "documents bucket is fully private",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const documentPolicies = policyBlocks.filter(
-				(p) =>
-					p.toLowerCase().includes("storage.objects") &&
-					p.toLowerCase().includes("documents"),
-			);
-			if (documentPolicies.length === 0) return false;
-			return documentPolicies.every(
-				(p) =>
-					!/to\s+public/.test(p) &&
-					!/to\s+anon/.test(p) &&
-					(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
-			);
-		},
-	},
-	{
-		name: "creates file_metadata table",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return /create\s+table/.test(sql) && /file_metadata/.test(sql);
-		},
-	},
-	{
-		name: "file_metadata has FK to auth.users with CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "RLS enabled on file_metadata",
-		check: () =>
-			/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
-				getMigrationSQL().toLowerCase(),
-			),
-	},
-	{
-		name: "file_metadata policies use (select auth.uid())",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const metadataPolicies = policyBlocks.filter((p) =>
-				p.toLowerCase().includes("file_metadata"),
-			);
-			for (const policy of metadataPolicies) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "uses timestamptz for time columns",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (
-				!sql.includes("created_at") &&
-				!sql.includes("updated_at") &&
-				!sql.includes("uploaded_at")
-			) {
-				return true;
-			}
-			const columnDefs = sql.match(
-				/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
-			);
-			if (!columnDefs) return true;
-			return columnDefs.every((def) =>
-				/timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
-			);
-		},
-	},
-	{
-		name: "index on file_metadata user_id",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/create\s+index/.test(sql) &&
-				/file_metadata/.test(sql) &&
-				/user_id/.test(sql)
-			);
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "overall quality score",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const signals = [
-				/insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
-				/insert\s+into\s+storage\.buckets[\s\S]*?documents/,
-				/allowed_mime_types/,
-				/file_size_limit/,
-				/storage\.foldername/,
-				/auth\.uid\(\)\s*::\s*text/,
-				/to\s+authenticated/,
-				/to\s+(public|anon)/,
-				/enable\s+row\s+level\s+security/,
-				/on\s+delete\s+cascade/,
-				/\(select\s+auth\.uid\(\)\)/,
-				/create\s+index/,
-				/timestamptz/,
-				/if\s+not\s+exists/,
-				/create\s+table[\s\S]*?file_metadata/,
-			];
-			return signals.filter((r) => r.test(sql)).length >= 11;
-		},
-	},
-];
+test("creates avatars bucket", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (
+		!/storage\.buckets/.test(sql) ||
+		!/avatars/.test(sql) ||
+		!/public/.test(sql)
+	) {
+		expect(false).toBe(true);
+		return;
+	}
+	const avatarsBlock = sql.match(
+		/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
+	);
+	expect(avatarsBlock !== null && /true/.test(avatarsBlock[0])).toBe(true);
+});
+
+test("creates documents bucket", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (!/documents/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
+	const documentsBlock = sql.match(
+		/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
+	);
+	expect(documentsBlock !== null && /false/.test(documentsBlock[0])).toBe(true);
+});
+
+test("avatars bucket has mime type restriction", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/allowed_mime_types/.test(sql) &&
+			/image\/jpeg/.test(sql) &&
+			/image\/png/.test(sql) &&
+			/image\/webp/.test(sql),
+	).toBe(true);
+});
+
+test("avatars bucket has file size limit", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (!/file_size_limit/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		/2097152/.test(sql) ||
+			/2\s*m/i.test(sql) ||
+			/2\s*\*\s*1024\s*\*\s*1024/.test(sql),
+	).toBe(true);
+});
+
+test("storage policy uses foldername or path for user isolation", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
+	const usesPathMatch =
+		/\(\s*storage\.foldername\s*\(/.test(sql) ||
+		/\bname\b.*auth\.uid\(\)/.test(sql);
+	expect(
+		(usesFoldername || usesPathMatch) && /auth\.uid\(\)\s*::\s*text/.test(sql),
+	).toBe(true);
+});
+
+test("storage policy uses TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const storagePolicies = policyBlocks.filter((p) =>
+		p.toLowerCase().includes("storage.objects"),
+	);
+	const hasAuthenticatedPolicy = storagePolicies.some(
+		(p) =>
+			/to\s+(authenticated|public)/.test(p.toLowerCase()) ||
+			/auth\.uid\(\)/.test(p.toLowerCase()),
+	);
+	if (!hasAuthenticatedPolicy) {
+		expect(false).toBe(true);
+		return;
+	}
+	const insertPolicies = storagePolicies.filter((p) =>
+		/for\s+insert/.test(p.toLowerCase()),
+	);
+	expect(
+		insertPolicies.every(
+			(p) =>
+				/to\s+authenticated/.test(p.toLowerCase()) ||
+				/auth\.uid\(\)/.test(p.toLowerCase()),
+		),
+	).toBe(true);
+});
+
+test("public read policy for avatars", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const avatarSelectPolicies = policyBlocks.filter(
+		(p) =>
+			p.toLowerCase().includes("storage.objects") &&
+			/for\s+select/.test(p.toLowerCase()) &&
+			p.toLowerCase().includes("avatars"),
+	);
+	if (avatarSelectPolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		avatarSelectPolicies.some((p) => {
+			const lower = p.toLowerCase();
+			const hasExplicitPublic =
+				/to\s+public/.test(lower) || /to\s+anon/.test(lower);
+			const hasNoToClause = !/\bto\s+\w+/.test(lower);
+			const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
+			return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
+		}),
+	).toBe(true);
+});
+
+test("documents bucket is fully private", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const documentPolicies = policyBlocks.filter(
+		(p) =>
+			p.toLowerCase().includes("storage.objects") &&
+			p.toLowerCase().includes("documents"),
+	);
+	if (documentPolicies.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(
+		documentPolicies.every(
+			(p) =>
+				!/to\s+public/.test(p) &&
+				!/to\s+anon/.test(p) &&
+				(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
+		),
+	).toBe(true);
+});
+
+test("creates file_metadata table", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(/create\s+table/.test(sql) && /file_metadata/.test(sql)).toBe(true);
+});
+
+test("file_metadata has FK to auth.users with CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("RLS enabled on file_metadata", () => {
+	expect(
+		/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("file_metadata policies use (select auth.uid())", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const metadataPolicies = policyBlocks.filter((p) =>
+		p.toLowerCase().includes("file_metadata"),
+	);
+	for (const policy of metadataPolicies) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("uses timestamptz for time columns", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (
+		!sql.includes("created_at") &&
+		!sql.includes("updated_at") &&
+		!sql.includes("uploaded_at")
+	) {
+		expect(true).toBe(true);
+		return;
+	}
+	const columnDefs = sql.match(
+		/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
+	);
+	if (!columnDefs) {
+		expect(true).toBe(true);
+		return;
+	}
+	expect(
+		columnDefs.every((def) =>
+			/timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
+		),
+	).toBe(true);
+});
+
+test("index on file_metadata user_id", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/create\s+index/.test(sql) &&
+			/file_metadata/.test(sql) &&
+			/user_id/.test(sql),
+	).toBe(true);
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("overall quality score", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const signals = [
+		/insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
+		/insert\s+into\s+storage\.buckets[\s\S]*?documents/,
+		/allowed_mime_types/,
+		/file_size_limit/,
+		/storage\.foldername/,
+		/auth\.uid\(\)\s*::\s*text/,
+		/to\s+authenticated/,
+		/to\s+(public|anon)/,
+		/enable\s+row\s+level\s+security/,
+		/on\s+delete\s+cascade/,
+		/\(select\s+auth\.uid\(\)\)/,
+		/create\s+index/,
+		/timestamptz/,
+		/if\s+not\s+exists/,
+		/create\s+table[\s\S]*?file_metadata/,
+	];
+	expect(signals.filter((r) => r.test(sql)).length >= 11).toBe(true);
+});
--- a/packages/evals/evals/storage-rls-user-folders/meta.ts
+++ b/packages/evals/evals/storage-rls-user-folders/meta.ts
@@ -0,0 +1,10 @@
+export const expectedReferenceFiles = [
+	"storage-access-control.md",
+	"db-rls-mandatory.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/storage-rls-user-folders/package.json
+++ b/packages/evals/evals/storage-rls-user-folders/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "storage-rls-user-folders",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/evals/team-rls-security-definer/EVAL.ts
+++ b/packages/evals/evals/team-rls-security-definer/EVAL.ts
@@ -1,216 +1,193 @@
-export const expectedReferenceFiles = [
-	"db-rls-mandatory.md",
-	"db-rls-policy-types.md",
-	"db-rls-common-mistakes.md",
-	"db-rls-performance.md",
-	"db-security-functions.md",
-	"db-schema-auth-fk.md",
-	"db-schema-timestamps.md",
-	"db-perf-indexes.md",
-	"db-migrations-idempotent.md",
-];
+import { expect, test } from "vitest";

-import type { EvalAssertion } from "../../src/eval-types.js";
+import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";

-import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
+test("migration file exists", () => {
+	expect(findMigrationFiles().length > 0).toBe(true);
+});

-export const assertions: EvalAssertion[] = [
-	{
-		name: "migration file exists",
-		check: () => findMigrationFiles().length > 0,
-	},
-	{
-		name: "creates organizations table",
-		check: () =>
-			/create\s+table[\s\S]*?organizations/.test(
-				getMigrationSQL().toLowerCase(),
+test("creates organizations table", () => {
+	expect(
+		/create\s+table[\s\S]*?organizations/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("creates memberships table", () => {
+	expect(
+		/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("creates projects table", () => {
+	expect(
+		/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("enables RLS on all tables", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) &&
+			/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
+				sql,
+			) &&
+			/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
+				sql,
 			),
-	},
-	{
-		name: "creates memberships table",
-		check: () =>
-			/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "creates projects table",
-		check: () =>
-			/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "enables RLS on all tables",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) &&
-				/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) &&
-				/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				)
-			);
-		},
-	},
-	{
-		name: "FK to auth.users with ON DELETE CASCADE",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/references\s+auth\.users/.test(sql) &&
-				/on\s+delete\s+cascade/.test(sql)
-			);
-		},
-	},
-	{
-		name: "org_id FK on projects",
-		check: () =>
-			/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
-				getMigrationSQL().toLowerCase(),
+	).toBe(true);
+});
+
+test("FK to auth.users with ON DELETE CASCADE", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+	).toBe(true);
+});
+
+test("org_id FK on projects", () => {
+	expect(
+		/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
+			getMigrationSQL().toLowerCase(),
+		),
+	).toBe(true);
+});
+
+test("private schema created", () => {
+	expect(
+		/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
+	).toBe(true);
+});
+
+test("security_definer helper function", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	expect(
+		/private\./.test(sql) &&
+			/security\s+definer/.test(sql) &&
+			/set\s+search_path\s*=\s*''/.test(sql),
+	).toBe(true);
+});
+
+test("policies use (select auth.uid())", () => {
+	const sql = getMigrationSQL();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	if (policyBlocks.length === 0) {
+		expect(false).toBe(true);
+		return;
+	}
+	for (const policy of policyBlocks) {
+		if (
+			policy.includes("auth.uid()") &&
+			!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
+		) {
+			expect(false).toBe(true);
+			return;
+		}
+	}
+	expect(true).toBe(true);
+});
+
+test("policies use TO authenticated", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	expect(
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+	).toBe(true);
+});
+
+test("index on membership lookup columns", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	if (!/create\s+index/.test(sql)) {
+		expect(false).toBe(true);
+		return;
+	}
+	const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
+	expect(
+		indexBlocks.filter(
+			(idx) =>
+				idx.includes("user_id") ||
+				idx.includes("org_id") ||
+				idx.includes("organization_id"),
+		).length >= 1,
+	).toBe(true);
+});
+
+test("uses timestamptz", () => {
+	const rawSql = getMigrationSQL().toLowerCase();
+	const sql = rawSql.replace(/--[^\n]*/g, "");
+	const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
+	if (
+		sql.includes("created_at") ||
+		sql.includes("updated_at") ||
+		sql.includes("_at ")
+	) {
+		expect(hasPlainTimestamp.test(sql)).toBe(false);
+	} else {
+		expect(true).toBe(true);
+	}
+});
+
+test("idempotent DDL", () => {
+	expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
+});
+
+test("stable or immutable on helper function", () => {
+	expect(/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase())).toBe(
+		true,
+	);
+});
+
+test("delete policy restricted to owner role", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const deletePolicy = policyBlocks.find(
+		(p) =>
+			p.toLowerCase().includes("delete") && p.toLowerCase().includes("project"),
+	);
+	if (!deletePolicy) {
+		expect(false).toBe(true);
+		return;
+	}
+	expect(/owner|admin/.test(deletePolicy.toLowerCase())).toBe(true);
+});
+
+test("overall quality score", () => {
+	const sql = getMigrationSQL().toLowerCase();
+	const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
+	const signals = [
+		/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
+			sql,
+		) &&
+			/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
+				sql,
+			) &&
+			/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
+				sql,
 			),
-	},
-	{
-		name: "private schema created",
-		check: () =>
-			/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "security_definer helper function",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			return (
-				/private\./.test(sql) &&
-				/security\s+definer/.test(sql) &&
-				/set\s+search_path\s*=\s*''/.test(sql)
-			);
-		},
-	},
-	{
-		name: "policies use (select auth.uid())",
-		check: () => {
-			const sql = getMigrationSQL();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			if (policyBlocks.length === 0) return false;
-			for (const policy of policyBlocks) {
-				if (
-					policy.includes("auth.uid()") &&
-					!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
-				) {
-					return false;
-				}
-			}
-			return true;
-		},
-	},
-	{
-		name: "policies use TO authenticated",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			return (
-				policyBlocks.length > 0 &&
-				policyBlocks.every((p) => /to\s+authenticated/.test(p))
-			);
-		},
-	},
-	{
-		name: "index on membership lookup columns",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			if (!/create\s+index/.test(sql)) return false;
-			const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
-			return (
-				indexBlocks.filter(
-					(idx) =>
-						idx.includes("user_id") ||
-						idx.includes("org_id") ||
-						idx.includes("organization_id"),
-				).length >= 1
-			);
-		},
-	},
-	{
-		name: "uses timestamptz",
-		check: () => {
-			const rawSql = getMigrationSQL().toLowerCase();
-			const sql = rawSql.replace(/--[^\n]*/g, "");
-			const hasPlainTimestamp =
-				/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
-			if (
-				sql.includes("created_at") ||
-				sql.includes("updated_at") ||
-				sql.includes("_at ")
-			) {
-				return !hasPlainTimestamp.test(sql);
-			}
-			return true;
-		},
-	},
-	{
-		name: "idempotent DDL",
-		check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "stable or immutable on helper function",
-		check: () =>
-			/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase()),
-	},
-	{
-		name: "delete policy restricted to owner role",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const deletePolicy = policyBlocks.find(
-				(p) =>
-					p.toLowerCase().includes("delete") &&
-					p.toLowerCase().includes("project"),
-			);
-			if (!deletePolicy) return false;
-			return /owner|admin/.test(deletePolicy.toLowerCase());
-		},
-	},
-	{
-		name: "overall quality score",
-		check: () => {
-			const sql = getMigrationSQL().toLowerCase();
-			const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
-			const signals = [
-				/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
-					sql,
-				) &&
-					/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
-						sql,
-					) &&
-					/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
-						sql,
-					),
-				/references\s+auth\.users/.test(sql) &&
-					/on\s+delete\s+cascade/.test(sql),
-				/create\s+schema[\s\S]*?private/.test(sql),
-				/security\s+definer/.test(sql) &&
-					/set\s+search_path\s*=\s*''/.test(sql),
-				/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
-				policyBlocks.length > 0 &&
-					policyBlocks.every((p) => /to\s+authenticated/.test(p)),
-				/create\s+index/.test(sql),
-				!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
-					sql.replace(/--[^\n]*/g, ""),
-				),
-				/if\s+not\s+exists/.test(sql),
-				policyBlocks.some(
-					(p) =>
-						p.toLowerCase().includes("delete") &&
-						p.toLowerCase().includes("project") &&
-						/owner|admin/.test(p.toLowerCase()),
-				),
-				/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(sql),
-				policyBlocks.length >= 3,
-				/role/.test(sql),
-				/private\./.test(sql),
-				/\bstable\b|\bimmutable\b/.test(sql),
-			];
-			return signals.filter(Boolean).length >= 11;
-		},
-	},
-];
+		/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
+		/create\s+schema[\s\S]*?private/.test(sql),
+		/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
+		/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
+		policyBlocks.length > 0 &&
+			policyBlocks.every((p) => /to\s+authenticated/.test(p)),
+		/create\s+index/.test(sql),
+		!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
+			sql.replace(/--[^\n]*/g, ""),
+		),
+		/if\s+not\s+exists/.test(sql),
+		policyBlocks.some(
+			(p) =>
+				p.toLowerCase().includes("delete") &&
+				p.toLowerCase().includes("project") &&
+				/owner|admin/.test(p.toLowerCase()),
+		),
+		/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(sql),
+		policyBlocks.length >= 3,
+		/role/.test(sql),
+		/private\./.test(sql),
+		/\bstable\b|\bimmutable\b/.test(sql),
+	];
+	expect(signals.filter(Boolean).length >= 11).toBe(true);
+});
--- a/packages/evals/evals/team-rls-security-definer/meta.ts
+++ b/packages/evals/evals/team-rls-security-definer/meta.ts
@@ -0,0 +1,11 @@
+export const expectedReferenceFiles = [
+	"db-rls-mandatory.md",
+	"db-rls-policy-types.md",
+	"db-rls-common-mistakes.md",
+	"db-rls-performance.md",
+	"db-security-functions.md",
+	"db-schema-auth-fk.md",
+	"db-schema-timestamps.md",
+	"db-perf-indexes.md",
+	"db-migrations-idempotent.md",
+];
--- a/packages/evals/evals/team-rls-security-definer/package.json
+++ b/packages/evals/evals/team-rls-security-definer/package.json
@@ -1,5 +1,8 @@
 {
 	"name": "team-rls-security-definer",
 	"private": true,
-	"type": "module"
+	"type": "module",
+	"devDependencies": {
+		"vitest": "^2.0.0"
+	}
 }
--- a/packages/evals/experiments/experiment.ts
+++ b/packages/evals/experiments/experiment.ts
@@ -0,0 +1,125 @@
+import { execFileSync } from "node:child_process";
+import { existsSync, readdirSync, readFileSync } from "node:fs";
+import { dirname, join, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import type { ExperimentConfig } from "@vercel/agent-eval";
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const EVALS_ROOT = resolve(__dirname, "..");
+const REPO_ROOT = resolve(EVALS_ROOT, "..", "..");
+const PROJECT_DIR = join(EVALS_ROOT, "project");
+
+const SKILL_NAME = process.env.EVAL_SKILL ?? "supabase";
+const SKILL_DIR = join(REPO_ROOT, "skills", SKILL_NAME);
+
+const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
+const isBaseline = process.env.EVAL_BASELINE === "true";
+
+// ---------------------------------------------------------------------------
+// Skill file loader — reads all skill files to inject into the sandbox
+// ---------------------------------------------------------------------------
+
+function readSkillFiles(): Record<string, string> {
+	const files: Record<string, string> = {};
+
+	for (const name of ["SKILL.md", "AGENTS.md"]) {
+		const src = join(SKILL_DIR, name);
+		if (existsSync(src)) {
+			const content = readFileSync(src, "utf-8");
+			files[`.agents/skills/${SKILL_NAME}/${name}`] = content;
+			files[`.claude/skills/${SKILL_NAME}/${name}`] = content;
+		}
+	}
+
+	const refsDir = join(SKILL_DIR, "references");
+	if (existsSync(refsDir)) {
+		for (const f of readdirSync(refsDir)) {
+			const content = readFileSync(join(refsDir, f), "utf-8");
+			files[`.agents/skills/${SKILL_NAME}/references/${f}`] = content;
+			files[`.claude/skills/${SKILL_NAME}/references/${f}`] = content;
+		}
+	}
+
+	return files;
+}
+
+// ---------------------------------------------------------------------------
+// DB reset — clears all user-created objects between scenarios
+// ---------------------------------------------------------------------------
+
+const RESET_SQL = `
+  DROP SCHEMA public CASCADE;
+  CREATE SCHEMA public;
+  GRANT ALL ON SCHEMA public TO postgres;
+  GRANT ALL ON SCHEMA public TO anon;
+  GRANT ALL ON SCHEMA public TO authenticated;
+  GRANT ALL ON SCHEMA public TO service_role;
+  DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
+  NOTIFY pgrst, 'reload schema';
+`.trim();
+
+function resetDB(): void {
+	const dbUrl =
+		process.env.SUPABASE_DB_URL ??
+		"postgresql://postgres:postgres@127.0.0.1:54322/postgres";
+	execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
+		stdio: "inherit",
+		timeout: 30_000,
+	});
+}
+
+// ---------------------------------------------------------------------------
+// Experiment configuration
+// ---------------------------------------------------------------------------
+
+const config: ExperimentConfig = {
+	agent: "claude-code",
+	model: "claude-sonnet-4-6",
+	runs: 1,
+	earlyExit: true,
+	timeout: 1800,
+	sandbox: "docker",
+	evals: process.env.EVAL_SCENARIO ?? "*",
+
+	setup: async (sandbox) => {
+		// 1. Reset DB for a clean slate
+		resetDB();
+
+		// 2. Seed supabase config so the agent can run `supabase db push`
+		const configPath = join(PROJECT_DIR, "supabase", "config.toml");
+		if (existsSync(configPath)) {
+			await sandbox.writeFiles({
+				"supabase/config.toml": readFileSync(configPath, "utf-8"),
+			});
+		}
+
+		// 3. Write MCP config pointing to host Supabase instance
+		await sandbox.writeFiles({
+			".mcp.json": JSON.stringify(
+				{
+					mcpServers: {
+						supabase: { type: "http", url: `${supabaseUrl}/mcp` },
+					},
+				},
+				null,
+				"\t",
+			),
+		});
+
+		// 4. Write eval-utils.ts into the workspace so EVAL.ts can import it
+		//    (agent-eval only copies the fixture's own directory into the sandbox)
+		const evalUtilsPath = join(EVALS_ROOT, "evals", "eval-utils.ts");
+		if (existsSync(evalUtilsPath)) {
+			await sandbox.writeFiles({
+				"eval-utils.ts": readFileSync(evalUtilsPath, "utf-8"),
+			});
+		}
+
+		// 5. Install skill files (unless baseline mode)
+		if (!isBaseline) {
+			await sandbox.writeFiles(readSkillFiles());
+		}
+	},
+};
+
+export default config;
--- a/packages/evals/package-lock.json
+++ b/packages/evals/package-lock.json
--- a/packages/evals/package.json
+++ b/packages/evals/package.json
@@ -6,17 +6,19 @@
 	"license": "MIT",
 	"description": "Agent evaluation system for Supabase skills",
 	"scripts": {
-		"eval": "tsx src/runner.ts",
-		"eval:upload": "BRAINTRUST_UPLOAD=true tsx src/runner.ts"
+		"eval": "agent-eval",
+		"eval:dry": "agent-eval --dry",
+		"eval:smoke": "agent-eval --smoke",
+		"eval:upload": "tsx src/upload.ts"
 	},
 	"dependencies": {
-		"@anthropic-ai/claude-code": "^2.1.49",
-		"braintrust": "^3.0.0",
-		"skills": "^1.4.0"
+		"@vercel/agent-eval": "^0.9.2",
+		"braintrust": "^3.0.0"
 	},
 	"devDependencies": {
 		"@types/node": "^20.10.0",
 		"tsx": "^4.7.0",
-		"typescript": "^5.3.0"
+		"typescript": "^5.3.0",
+		"vitest": "^4.0.18"
 	}
 }
--- a/packages/evals/scripts/eval.sh
+++ b/packages/evals/scripts/eval.sh
@@ -0,0 +1,55 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+EVALS_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
+PROJECT_DIR="$EVALS_DIR/project"
+
+# ---------------------------------------------------------------------------
+# Parse CLI arguments
+# ---------------------------------------------------------------------------
+
+AGENT_EVAL_ARGS=()
+UPLOAD=true  # Always upload to Braintrust by default
+
+while [[ $# -gt 0 ]]; do
+  case "$1" in
+    --skill)
+      export EVAL_SKILL="$2"
+      shift 2
+      ;;
+    --scenario)
+      export EVAL_SCENARIO="$2"
+      shift 2
+      ;;
+    *)
+      AGENT_EVAL_ARGS+=("$1")
+      shift
+      ;;
+  esac
+done
+
+echo "Starting Supabase..."
+supabase start --exclude studio,imgproxy,mailpit --workdir "$PROJECT_DIR"
+
+# Export keys so experiment.ts and vitest assertions can connect
+eval "$(supabase status --output json --workdir "$PROJECT_DIR" | \
+  node -e "
+    const s = JSON.parse(require('fs').readFileSync('/dev/stdin','utf-8'));
+    console.log('export SUPABASE_URL=' + (s.API_URL || 'http://127.0.0.1:54321'));
+    console.log('export SUPABASE_ANON_KEY=' + s.ANON_KEY);
+    console.log('export SUPABASE_SERVICE_ROLE_KEY=' + s.SERVICE_ROLE_KEY);
+    console.log('export SUPABASE_DB_URL=' + (s.DB_URL || 'postgresql://postgres:postgres@127.0.0.1:54322/postgres'));
+  ")"
+
+trap 'echo "Stopping Supabase..."; supabase stop --no-backup --workdir "$PROJECT_DIR"' EXIT
+
+echo "Running agent-eval..."
+cd "$EVALS_DIR"
+npx agent-eval "${AGENT_EVAL_ARGS[@]+"${AGENT_EVAL_ARGS[@]}"}"
+
+# Upload results to Braintrust (default: true, skip with --no-upload)
+if [ "$UPLOAD" = "true" ]; then
+  echo "Uploading results to Braintrust..."
+  npx tsx src/upload.ts
+fi
--- a/packages/evals/src/eval-types.ts
+++ b/packages/evals/src/eval-types.ts
@@ -1,21 +0,0 @@
-/**
- * A single assertion to run against the agent's workspace output.
- *
- * Used by EVAL.ts files to declare what the agent's work should produce.
- * The runner executes these in-process (no test framework required).
- */
-export interface EvalAssertion {
-	/** Human-readable name shown in Braintrust and local output */
-	name: string;
-	/** Return true = pass, false/throw = fail */
-	check: () => boolean | Promise<boolean>;
-	/** Timeout in ms for async checks (default: no timeout) */
-	timeout?: number;
-}
-
-/** Result of running a single EvalAssertion */
-export interface AssertionResult {
-	name: string;
-	passed: boolean;
-	error?: string;
-}
--- a/packages/evals/src/runner.ts
+++ b/packages/evals/src/runner.ts
@@ -1,372 +0,0 @@
-import { existsSync, readdirSync, readFileSync } from "node:fs";
-import { join, resolve } from "node:path";
-import type { AssertionResult, EvalAssertion } from "./eval-types.js";
-import { runAgent } from "./runner/agent.js";
-import {
-	seedBraintrustDataset,
-	uploadToBraintrust,
-} from "./runner/braintrust.js";
-import { createResultDir, saveRunArtifacts } from "./runner/persist.js";
-import { preflight } from "./runner/preflight.js";
-import { listModifiedFiles, printSummary } from "./runner/results.js";
-import { createWorkspace } from "./runner/scaffold.js";
-import {
-	assertionsPassedScorer,
-	finalResultScorer,
-	referenceFilesUsageScorer,
-	skillUsageScorer,
-} from "./runner/scorers.js";
-import {
-	getKeys,
-	resetDB,
-	startSupabase,
-	stopSupabase,
-} from "./runner/supabase-setup.js";
-import {
-	buildTranscriptSummary,
-	type TranscriptSummary,
-} from "./runner/transcript.js";
-import type { EvalRunResult, EvalScenario } from "./types.js";
-
-// ---------------------------------------------------------------------------
-// Configuration from environment
-// ---------------------------------------------------------------------------
-
-const DEFAULT_MODEL = "claude-sonnet-4-5-20250929";
-const DEFAULT_SKILL = "supabase";
-const AGENT_TIMEOUT = 30 * 60 * 1000; // 30 minutes
-
-const model = process.env.EVAL_MODEL ?? DEFAULT_MODEL;
-const skillName = process.env.EVAL_SKILL ?? DEFAULT_SKILL;
-const scenarioFilter = process.env.EVAL_SCENARIO;
-const isBaseline = process.env.EVAL_BASELINE === "true";
-const skillEnabled = !isBaseline;
-
-// Run-level timestamp shared across all scenarios in a single invocation
-const runTimestamp = new Date()
-	.toISOString()
-	.replace(/[:.]/g, "-")
-	.replace("Z", "");
-
-// ---------------------------------------------------------------------------
-// Discover scenarios
-// ---------------------------------------------------------------------------
-
-function findEvalsDir(): string {
-	let dir = process.cwd();
-	for (let i = 0; i < 10; i++) {
-		const candidate = join(dir, "packages", "evals", "evals");
-		if (existsSync(candidate)) return candidate;
-		const parent = resolve(dir, "..");
-		if (parent === dir) break;
-		dir = parent;
-	}
-	throw new Error("Could not find packages/evals/evals/ directory");
-}
-
-function discoverScenarios(): EvalScenario[] {
-	const evalsDir = findEvalsDir();
-	const dirs = readdirSync(evalsDir, { withFileTypes: true }).filter(
-		(d) => d.isDirectory() && existsSync(join(evalsDir, d.name, "PROMPT.md")),
-	);
-
-	return dirs.map((d) => ({
-		id: d.name,
-		name: d.name,
-		tags: [],
-	}));
-}
-
-// ---------------------------------------------------------------------------
-// Scenario threshold
-// ---------------------------------------------------------------------------
-
-function getPassThreshold(scenarioId: string): number | null {
-	const scenariosDir = join(findEvalsDir(), "..", "scenarios");
-	const scenarioFile = join(scenariosDir, `${scenarioId}.md`);
-	if (!existsSync(scenarioFile)) return null;
-
-	const content = readFileSync(scenarioFile, "utf-8");
-	const match = content.match(/\*\*pass_threshold:\*\*\s*(\d+)/);
-	return match ? Number.parseInt(match[1], 10) : null;
-}
-
-// ---------------------------------------------------------------------------
-// In-process assertion runner (replaces vitest subprocess)
-// ---------------------------------------------------------------------------
-
-async function runAssertions(
-	assertions: EvalAssertion[],
-): Promise<AssertionResult[]> {
-	return Promise.all(
-		assertions.map(async (a) => {
-			try {
-				let result: boolean;
-				if (a.timeout) {
-					const timeoutPromise = new Promise<never>((_, reject) =>
-						setTimeout(
-							() =>
-								reject(new Error(`Assertion timed out after ${a.timeout}ms`)),
-							a.timeout,
-						),
-					);
-					result = await Promise.race([
-						Promise.resolve(a.check()),
-						timeoutPromise,
-					]);
-				} else {
-					result = await Promise.resolve(a.check());
-				}
-				return { name: a.name, passed: Boolean(result) };
-			} catch (e) {
-				return { name: a.name, passed: false, error: String(e) };
-			}
-		}),
-	);
-}
-
-// ---------------------------------------------------------------------------
-// Run a single eval
-// ---------------------------------------------------------------------------
-
-async function runEval(
-	scenario: EvalScenario,
-	skillEnabled: boolean,
-): Promise<{
-	result: EvalRunResult;
-	transcript?: TranscriptSummary;
-	expectedReferenceFiles: string[];
-}> {
-	const evalsDir = findEvalsDir();
-	const evalDir = join(evalsDir, scenario.id);
-	const variant = skillEnabled ? "with-skill" : "baseline";
-
-	console.log(`\n--- ${scenario.id} (${variant}) ---`);
-
-	// Load assertions and expected reference files from EVAL.ts
-	const evalFilePath = existsSync(join(evalDir, "EVAL.tsx"))
-		? join(evalDir, "EVAL.tsx")
-		: join(evalDir, "EVAL.ts");
-
-	const {
-		assertions = [] as EvalAssertion[],
-		expectedReferenceFiles = [] as string[],
-	} = await import(evalFilePath).catch(() => ({
-		assertions: [] as EvalAssertion[],
-		expectedReferenceFiles: [] as string[],
-	}));
-
-	const passThreshold = getPassThreshold(scenario.id);
-	const prompt = readFileSync(join(evalDir, "PROMPT.md"), "utf-8").trim();
-
-	// 1. Create isolated workspace
-	const { workspacePath, cleanup } = createWorkspace({ evalDir, skillEnabled });
-	console.log(`  Workspace: ${workspacePath}`);
-
-	try {
-		// 2. Run the agent
-		console.log(`  Running agent (${model})...`);
-		const startedAt = Date.now();
-		const agentResult = await runAgent({
-			cwd: workspacePath,
-			prompt,
-			model,
-			timeout: AGENT_TIMEOUT,
-			skillEnabled,
-			skillName: skillEnabled ? skillName : undefined,
-		});
-		console.log(
-			`  Agent finished in ${(agentResult.duration / 1000).toFixed(1)}s`,
-		);
-
-		// 3. Run assertions in-process from the workspace directory so that
-		//    eval-utils.ts helpers resolve paths relative to the workspace.
-		console.log("  Running assertions...");
-		const prevCwd = process.cwd();
-		process.chdir(workspacePath);
-		const assertionResults = await runAssertions(assertions).finally(() => {
-			process.chdir(prevCwd);
-		});
-		const passedCount = assertionResults.filter((a) => a.passed).length;
-		const totalCount = assertionResults.length;
-
-		const passed = passThreshold
-			? totalCount > 0 && passedCount >= passThreshold
-			: totalCount > 0 && passedCount === totalCount;
-
-		const pct =
-			totalCount > 0 ? ((passedCount / totalCount) * 100).toFixed(1) : "0.0";
-		const thresholdInfo = passThreshold
-			? `, threshold: ${((passThreshold / totalCount) * 100).toFixed(0)}%`
-			: "";
-		console.log(
-			`  Assertions: ${passedCount}/${totalCount} passed (${pct}%${thresholdInfo})`,
-		);
-
-		// 4. Collect modified files
-		const filesModified = listModifiedFiles(workspacePath, evalDir);
-
-		// 5. Build transcript summary
-		const summary = buildTranscriptSummary(agentResult.events);
-
-		// 6. Run scorers
-		const skillScore = skillUsageScorer(summary, skillName);
-		const refScore = referenceFilesUsageScorer(summary, expectedReferenceFiles);
-		const assertScore = assertionsPassedScorer({
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			status: passed ? "passed" : "failed",
-		} as EvalRunResult);
-		const finalScore = finalResultScorer({
-			status: passed ? "passed" : "failed",
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			passThreshold: passThreshold ?? undefined,
-		} as EvalRunResult);
-
-		const result: EvalRunResult = {
-			scenario: scenario.id,
-			agent: "claude-code",
-			model,
-			skillEnabled,
-			status: passed ? "passed" : "failed",
-			duration: agentResult.duration,
-			agentOutput: agentResult.output,
-			testsPassed: passedCount,
-			testsTotal: totalCount,
-			passThreshold: passThreshold ?? undefined,
-			assertionResults,
-			filesModified,
-			toolCallCount: summary.toolCalls.length,
-			costUsd: summary.totalCostUsd ?? undefined,
-			prompt,
-			startedAt,
-			durationApiMs: summary.totalDurationApiMs,
-			totalInputTokens: summary.totalInputTokens,
-			totalOutputTokens: summary.totalOutputTokens,
-			totalCacheReadTokens: summary.totalCacheReadTokens,
-			totalCacheCreationTokens: summary.totalCacheCreationTokens,
-			modelUsage: summary.modelUsage,
-			toolErrorCount: summary.toolErrorCount,
-			permissionDenialCount: summary.permissionDenialCount,
-			loadedSkills: summary.skills,
-			referenceFilesRead: summary.referenceFilesRead,
-			scores: {
-				skillUsage: skillScore.score,
-				referenceFilesUsage: refScore.score,
-				assertionsPassed: assertScore.score,
-				finalResult: finalScore.score,
-			},
-		};
-
-		// 7. Persist results
-		const resultDir = createResultDir(runTimestamp, scenario.id, variant);
-		result.resultsDir = resultDir;
-		saveRunArtifacts({
-			resultDir,
-			rawTranscript: agentResult.rawTranscript,
-			assertionResults,
-			result,
-			transcriptSummary: summary,
-		});
-
-		return { result, transcript: summary, expectedReferenceFiles };
-	} catch (error) {
-		const err = error as Error;
-		return {
-			result: {
-				scenario: scenario.id,
-				agent: "claude-code",
-				model,
-				skillEnabled,
-				status: "error",
-				duration: 0,
-				agentOutput: "",
-				testsPassed: 0,
-				testsTotal: 0,
-				filesModified: [],
-				error: err.message,
-			},
-			expectedReferenceFiles: [],
-		};
-	} finally {
-		cleanup();
-	}
-}
-
-// ---------------------------------------------------------------------------
-// Main
-// ---------------------------------------------------------------------------
-
-async function main() {
-	preflight();
-
-	console.log("Supabase Skills Evals");
-	console.log(`Model: ${model}`);
-	console.log(`Mode: ${isBaseline ? "baseline (no skills)" : "with skills"}`);
-
-	let scenarios = discoverScenarios();
-
-	if (scenarioFilter) {
-		scenarios = scenarios.filter((s) => s.id === scenarioFilter);
-		if (scenarios.length === 0) {
-			console.error(`Scenario not found: ${scenarioFilter}`);
-			process.exit(1);
-		}
-	}
-
-	console.log(`Scenarios: ${scenarios.map((s) => s.id).join(", ")}`);
-
-	// Start the shared Supabase instance once for all scenarios.
-	startSupabase();
-	const keys = getKeys();
-
-	// Inject keys into process.env so assertions can connect to the real DB.
-	process.env.SUPABASE_URL = keys.apiUrl;
-	process.env.SUPABASE_ANON_KEY = keys.anonKey;
-	process.env.SUPABASE_SERVICE_ROLE_KEY = keys.serviceRoleKey;
-	process.env.SUPABASE_DB_URL = keys.dbUrl;
-
-	const results: EvalRunResult[] = [];
-	const transcripts = new Map<string, TranscriptSummary>();
-	const expectedRefFiles = new Map<string, string[]>();
-
-	try {
-		for (const scenario of scenarios) {
-			// Reset the database before each scenario for a clean slate.
-			console.log(`\n  Resetting DB for ${scenario.id}...`);
-			resetDB(keys.dbUrl);
-
-			const { result, transcript, expectedReferenceFiles } = await runEval(
-				scenario,
-				skillEnabled,
-			);
-			results.push(result);
-			if (transcript) {
-				transcripts.set(result.scenario, transcript);
-			}
-			expectedRefFiles.set(result.scenario, expectedReferenceFiles);
-		}
-	} finally {
-		stopSupabase();
-	}
-
-	// Use the results dir from the first result (all share the same timestamp)
-	const resultsDir = results.find((r) => r.resultsDir)?.resultsDir;
-	printSummary(results, resultsDir);
-
-	console.log("\nUploading to Braintrust...");
-	await seedBraintrustDataset(results, expectedRefFiles);
-	await uploadToBraintrust(results, {
-		model,
-		skillEnabled,
-		runTimestamp,
-		transcripts,
-		expectedRefFiles,
-	});
-}
-
-main().catch((err) => {
-	console.error("Fatal error:", err);
-	process.exit(1);
-});
--- a/packages/evals/src/runner/agent.ts
+++ b/packages/evals/src/runner/agent.ts
@@ -1,145 +0,0 @@
-import { spawn } from "node:child_process";
-import { resolveClaudeBin } from "./preflight.js";
-import {
-	extractFinalOutput,
-	parseStreamJsonOutput,
-	type TranscriptEvent,
-} from "./transcript.js";
-
-export interface AgentRunResult {
-	/** Extracted final text output (backward-compatible). */
-	output: string;
-	duration: number;
-	/** Raw NDJSON transcript string from stream-json. */
-	rawTranscript: string;
-	/** Parsed transcript events. */
-	events: TranscriptEvent[];
-}
-
-/**
- * Invoke Claude Code in print mode as a subprocess.
- *
- * Uses --output-format stream-json to capture structured NDJSON events
- * including tool calls, results, and reasoning steps.
- *
- * The agent operates in the workspace directory and can read/write files,
- * and has access to the local Supabase MCP server so it can apply migrations
- * and query the real database. --strict-mcp-config ensures only the local
- * Supabase instance is reachable — no host MCP servers leak in.
- *
- * --setting-sources project,local prevents skills from the user's global
- * ~/.agents/skills/ from leaking into the eval environment.
- *
- * When skillEnabled, --agents injects the target skill directly into the
- * agent's context, guaranteeing it is present (not just discoverable).
- */
-export async function runAgent(opts: {
-	cwd: string;
-	prompt: string;
-	model: string;
-	timeout: number;
-	skillEnabled: boolean;
-	/** Skill name to inject via --agents (e.g. "supabase"). Used when skillEnabled. */
-	skillName?: string;
-}): Promise<AgentRunResult> {
-	const start = Date.now();
-
-	// Point the agent's MCP config at the shared local Supabase instance.
-	// --strict-mcp-config ensures host .mcp.json is ignored entirely.
-	const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
-	const mcpConfig = JSON.stringify({
-		mcpServers: {
-			supabase: {
-				type: "http",
-				url: `${supabaseUrl}/mcp`,
-			},
-		},
-	});
-
-	const args = [
-		"-p", // Print mode (non-interactive)
-		"--verbose",
-		"--output-format",
-		"stream-json",
-		"--model",
-		opts.model,
-		"--no-session-persistence",
-		"--dangerously-skip-permissions",
-		"--tools",
-		"Edit,Write,Bash,Read,Glob,Grep",
-		"--mcp-config",
-		mcpConfig,
-		"--strict-mcp-config",
-		// Prevent skills from the user's global ~/.agents/skills/ from leaking
-		// into the eval environment. Only workspace (project) and local sources
-		// are loaded, so the eval sees only what was explicitly installed.
-		"--setting-sources",
-		"project,local",
-	];
-
-	if (opts.skillEnabled && opts.skillName) {
-		// Inject the target skill directly into the agent context via --agents.
-		// This guarantees the skill is embedded in the subagent's context at
-		// startup (not just available as a slash command).
-		const agentsDef = JSON.stringify({
-			main: {
-				description: `Supabase developer agent with ${opts.skillName} skill`,
-				skills: [opts.skillName],
-			},
-		});
-		args.push("--agents", agentsDef);
-	} else if (!opts.skillEnabled) {
-		// Baseline runs: disable all skills so the agent relies on innate knowledge
-		args.push("--disable-slash-commands");
-	}
-
-	const env = { ...process.env };
-	// Remove all Claude-related env vars to avoid nested-session detection
-	for (const key of Object.keys(env)) {
-		if (key === "CLAUDECODE" || key.startsWith("CLAUDE_")) {
-			delete env[key];
-		}
-	}
-
-	const claudeBin = resolveClaudeBin();
-
-	return new Promise<AgentRunResult>((resolve) => {
-		const child = spawn(claudeBin, args, {
-			cwd: opts.cwd,
-			env,
-			stdio: ["pipe", "pipe", "pipe"],
-		});
-
-		// Pipe prompt via stdin and close — this is the standard way to
-		// pass multi-line prompts to `claude -p`.
-		child.stdin.write(opts.prompt);
-		child.stdin.end();
-
-		let stdout = "";
-		let stderr = "";
-		child.stdout.on("data", (d: Buffer) => {
-			stdout += d.toString();
-		});
-		child.stderr.on("data", (d: Buffer) => {
-			stderr += d.toString();
-		});
-
-		const timer = setTimeout(() => {
-			child.kill();
-		}, opts.timeout);
-
-		child.on("close", () => {
-			clearTimeout(timer);
-			const rawTranscript = stdout || stderr;
-			const events = parseStreamJsonOutput(rawTranscript);
-			const output = extractFinalOutput(events) || rawTranscript;
-
-			resolve({
-				output,
-				duration: Date.now() - start,
-				rawTranscript,
-				events,
-			});
-		});
-	});
-}
--- a/packages/evals/src/runner/braintrust.ts
+++ b/packages/evals/src/runner/braintrust.ts
@@ -1,295 +0,0 @@
-import assert from "node:assert";
-import { init, initDataset, initLogger, type Logger } from "braintrust";
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-/**
- * Initialize a Braintrust project logger for real-time per-scenario logging.
- * Call this once at startup and pass the logger to logScenarioToLogger().
- */
-export function initBraintrustLogger(): Logger<true> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-	return initLogger({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		asyncFlush: true,
-	});
-}
-
-/**
- * Log a single scenario result to the Braintrust project logger in real-time.
- * This runs alongside the experiment upload, giving immediate visibility in
- * the project log as each scenario completes.
- */
-export function logScenarioToLogger(
-	logger: Logger<true>,
-	r: EvalRunResult,
-	transcript?: TranscriptSummary,
-): void {
-	const scores: Record<string, number> = {
-		skill_usage: r.scores?.skillUsage ?? 0,
-		reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
-		assertions_passed: r.scores?.assertionsPassed ?? 0,
-		final_result: r.scores?.finalResult ?? 0,
-	};
-
-	const metadata: Record<string, unknown> = {
-		agent: r.agent,
-		model: r.model,
-		skillEnabled: r.skillEnabled,
-		testsPassed: r.testsPassed,
-		testsTotal: r.testsTotal,
-		toolCallCount: r.toolCallCount ?? 0,
-		contextWindowUsed:
-			(r.totalInputTokens ?? 0) +
-			(r.totalCacheReadTokens ?? 0) +
-			(r.totalCacheCreationTokens ?? 0),
-		totalOutputTokens: r.totalOutputTokens,
-		modelUsage: r.modelUsage,
-		toolErrorCount: r.toolErrorCount,
-		permissionDenialCount: r.permissionDenialCount,
-		loadedSkills: r.loadedSkills,
-		referenceFilesRead: r.referenceFilesRead,
-		...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
-		...(r.error ? { error: r.error } : {}),
-	};
-
-	const spanOptions = r.startedAt
-		? { name: r.scenario, startTime: r.startedAt / 1000 }
-		: { name: r.scenario };
-
-	if (transcript && transcript.toolCalls.length > 0) {
-		logger.traced((span) => {
-			span.log({
-				input: {
-					scenario: r.scenario,
-					prompt: r.prompt ?? "",
-					skillEnabled: r.skillEnabled,
-				},
-				output: {
-					status: r.status,
-					agentOutput: r.agentOutput,
-					filesModified: r.filesModified,
-					assertionResults: r.assertionResults,
-				},
-				expected: { testsTotal: r.testsTotal },
-				scores,
-				metadata,
-			});
-
-			for (const tc of transcript.toolCalls) {
-				span.traced(
-					(childSpan) => {
-						childSpan.log({
-							input: { tool: tc.tool, args: tc.input },
-							output: {
-								preview: tc.outputPreview,
-								isError: tc.isError,
-								...(tc.stderr ? { stderr: tc.stderr } : {}),
-							},
-							metadata: { toolUseId: tc.toolUseId },
-						});
-					},
-					{ name: `tool:${tc.tool}` },
-				);
-			}
-		}, spanOptions);
-	} else {
-		logger.traced((span) => {
-			span.log({
-				input: {
-					scenario: r.scenario,
-					prompt: r.prompt ?? "",
-					skillEnabled: r.skillEnabled,
-				},
-				output: {
-					status: r.status,
-					agentOutput: r.agentOutput,
-					filesModified: r.filesModified,
-					assertionResults: r.assertionResults,
-				},
-				expected: { testsTotal: r.testsTotal },
-				scores,
-				metadata,
-			});
-		}, spanOptions);
-	}
-}
-
-/**
- * Seed a Braintrust dataset with one row per scenario.
- *
- * Uses scenario.id as the stable row ID so re-seeding is idempotent.
- * Each row stores the prompt and expected assertions/reference files,
- * giving Braintrust a stable baseline to track per-scenario score trends
- * across experiment runs.
- */
-export async function seedBraintrustDataset(
-	results: EvalRunResult[],
-	expectedRefFiles: Map<string, string[]>,
-): Promise<void> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-
-	const dataset = initDataset({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		dataset: "supabase-skill-scenarios",
-	});
-
-	for (const r of results) {
-		dataset.insert({
-			id: r.scenario,
-			input: {
-				scenario: r.scenario,
-				prompt: r.prompt ?? "",
-			},
-			expected: {
-				testsTotal: r.testsTotal,
-				passThreshold: r.passThreshold ?? 1.0,
-				expectedReferenceFiles: expectedRefFiles.get(r.scenario) ?? [],
-			},
-			metadata: { scenario: r.scenario },
-		});
-	}
-
-	await dataset.flush();
-	console.log("Braintrust dataset seeded: supabase-skill-scenarios");
-}
-
-/**
- * Upload eval results to Braintrust as an experiment.
- *
- * Each EvalRunResult becomes a row in the experiment with:
- * - input: scenario ID, prompt content, skillEnabled flag
- * - output: status, agent output, files modified, assertion results
- * - expected: total tests, pass threshold
- * - scores: skill_usage, reference_files_usage, assertions_passed, final_result
- * - metadata: agent, model, skillEnabled, test counts, tool calls, context window, output tokens, model usage, errors, cost
- * - spans: one child span per agent tool call (when transcript available)
- * - datasetRecordId: links this row to the dataset row for per-scenario tracking
- */
-export async function uploadToBraintrust(
-	results: EvalRunResult[],
-	opts: {
-		model: string;
-		skillEnabled: boolean;
-		runTimestamp: string;
-		transcripts: Map<string, TranscriptSummary>;
-		expectedRefFiles: Map<string, string[]>;
-	},
-): Promise<void> {
-	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
-	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
-
-	const variant = opts.skillEnabled ? "skill" : "baseline";
-	const experiment = await init({
-		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		experiment: `${opts.model}-${variant}-${opts.runTimestamp}`,
-		baseExperiment: process.env.BRAINTRUST_BASE_EXPERIMENT ?? undefined,
-		metadata: {
-			model: opts.model,
-			skillEnabled: opts.skillEnabled,
-			runTimestamp: opts.runTimestamp,
-			scenarioCount: results.length,
-		},
-	});
-
-	for (const r of results) {
-		const transcript = opts.transcripts.get(r.scenario);
-
-		const scores: Record<string, number> = {
-			skill_usage: r.scores?.skillUsage ?? 0,
-			reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
-			assertions_passed: r.scores?.assertionsPassed ?? 0,
-			final_result: r.scores?.finalResult ?? 0,
-		};
-
-		const input = {
-			scenario: r.scenario,
-			prompt: r.prompt ?? "",
-			skillEnabled: r.skillEnabled,
-		};
-
-		const output = {
-			status: r.status,
-			agentOutput: r.agentOutput,
-			filesModified: r.filesModified,
-			assertionResults: r.assertionResults,
-		};
-
-		const expected = {
-			testsTotal: r.testsTotal,
-			passThreshold: 1.0,
-		};
-
-		const metadata: Record<string, unknown> = {
-			agent: r.agent,
-			model: r.model,
-			skillEnabled: r.skillEnabled,
-			testsPassed: r.testsPassed,
-			testsTotal: r.testsTotal,
-			toolCallCount: r.toolCallCount ?? 0,
-			contextWindowUsed:
-				(r.totalInputTokens ?? 0) +
-				(r.totalCacheReadTokens ?? 0) +
-				(r.totalCacheCreationTokens ?? 0),
-			totalOutputTokens: r.totalOutputTokens,
-			modelUsage: r.modelUsage,
-			toolErrorCount: r.toolErrorCount,
-			permissionDenialCount: r.permissionDenialCount,
-			loadedSkills: r.loadedSkills,
-			referenceFilesRead: r.referenceFilesRead,
-			...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
-			...(r.error ? { error: r.error } : {}),
-		};
-
-		const spanOptions = r.startedAt
-			? { name: r.scenario, startTime: r.startedAt / 1000 }
-			: { name: r.scenario };
-
-		if (transcript && transcript.toolCalls.length > 0) {
-			experiment.traced((span) => {
-				span.log({
-					input,
-					output,
-					expected,
-					scores,
-					metadata,
-					datasetRecordId: r.scenario,
-				});
-
-				for (const tc of transcript.toolCalls) {
-					span.traced(
-						(childSpan) => {
-							childSpan.log({
-								input: { tool: tc.tool, args: tc.input },
-								output: {
-									preview: tc.outputPreview,
-									isError: tc.isError,
-									...(tc.stderr ? { stderr: tc.stderr } : {}),
-								},
-								metadata: { toolUseId: tc.toolUseId },
-							});
-						},
-						{ name: `tool:${tc.tool}` },
-					);
-				}
-			}, spanOptions);
-		} else {
-			experiment.traced((span) => {
-				span.log({
-					input,
-					output,
-					expected,
-					scores,
-					metadata,
-					datasetRecordId: r.scenario,
-				});
-			}, spanOptions);
-		}
-	}
-
-	const summary = await experiment.summarize();
-	console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
-	await experiment.close();
-}
--- a/packages/evals/src/runner/persist.ts
+++ b/packages/evals/src/runner/persist.ts
@@ -1,61 +0,0 @@
-import { mkdirSync, writeFileSync } from "node:fs";
-import { dirname, join } from "node:path";
-import { fileURLToPath } from "node:url";
-import type { AssertionResult } from "../eval-types.js";
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/** Resolve the base directory for storing results.
- *  Supports EVAL_RESULTS_DIR override for Docker volume mounts. */
-function resultsBase(): string {
-	if (process.env.EVAL_RESULTS_DIR) {
-		return process.env.EVAL_RESULTS_DIR;
-	}
-	// Default: packages/evals/results (__dirname is packages/evals/src/runner)
-	return join(__dirname, "..", "..", "results");
-}
-
-/** Create the results directory for a single scenario run. Returns the path. */
-export function createResultDir(
-	runTimestamp: string,
-	scenarioId: string,
-	variant: "with-skill" | "baseline",
-): string {
-	const dir = join(resultsBase(), runTimestamp, scenarioId, variant);
-	mkdirSync(dir, { recursive: true });
-	return dir;
-}
-
-/** Save all artifacts for a single eval run. */
-export function saveRunArtifacts(opts: {
-	resultDir: string;
-	rawTranscript: string;
-	assertionResults: AssertionResult[];
-	result: EvalRunResult;
-	transcriptSummary: TranscriptSummary;
-}): void {
-	writeFileSync(
-		join(opts.resultDir, "transcript.jsonl"),
-		opts.rawTranscript,
-		"utf-8",
-	);
-
-	writeFileSync(
-		join(opts.resultDir, "assertions.json"),
-		JSON.stringify(opts.assertionResults, null, 2),
-		"utf-8",
-	);
-
-	writeFileSync(
-		join(opts.resultDir, "result.json"),
-		JSON.stringify(
-			{ ...opts.result, transcript: opts.transcriptSummary },
-			null,
-			2,
-		),
-		"utf-8",
-	);
-}
--- a/packages/evals/src/runner/preflight.ts
+++ b/packages/evals/src/runner/preflight.ts
@@ -1,126 +0,0 @@
-import { execFileSync } from "node:child_process";
-import { accessSync, constants, existsSync } from "node:fs";
-import { dirname, join } from "node:path";
-import { fileURLToPath } from "node:url";
-
-/** Detect if we're running inside the eval Docker container. */
-export function isRunningInDocker(): boolean {
-	if (process.env.IN_DOCKER === "true") return true;
-	try {
-		accessSync("/.dockerenv", constants.F_OK);
-		return true;
-	} catch {
-		return false;
-	}
-}
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/**
- * Resolve the `claude` binary path.
- *
- * Looks in the following order:
- * 1. Local node_modules/.bin/claude (installed via @anthropic-ai/claude-code)
- * 2. Global `claude` on PATH
- *
- * Throws with an actionable message when neither is found.
- */
-export function resolveClaudeBin(): string {
-	// packages/evals/node_modules/.bin/claude
-	const localBin = join(
-		__dirname,
-		"..",
-		"..",
-		"node_modules",
-		".bin",
-		"claude",
-	);
-	if (existsSync(localBin)) {
-		return localBin;
-	}
-
-	// Fall back to PATH
-	try {
-		execFileSync("claude", ["--version"], {
-			stdio: "ignore",
-			timeout: 10_000,
-		});
-		return "claude";
-	} catch {
-		throw new Error(
-			[
-				"claude CLI not found.",
-				"",
-				"Install it in one of these ways:",
-				"  npm install          (uses @anthropic-ai/claude-code from package.json)",
-				"  npm i -g @anthropic-ai/claude-code",
-				"",
-				"Ensure ANTHROPIC_API_KEY is set in the environment.",
-			].join("\n"),
-		);
-	}
-}
-
-/**
- * Verify the host environment has everything needed before spending
- * API credits on an eval run.
- *
- * Checks: Node >= 20, Docker running, supabase CLI available, claude CLI available, API key set.
- */
-export function preflight(): void {
-	const errors: string[] = [];
-
-	// Node.js >= 20
-	const [major] = process.versions.node.split(".").map(Number);
-	if (major < 20) {
-		errors.push(`Node.js >= 20 required (found ${process.versions.node})`);
-	}
-
-	// Docker daemon must be running — needed by the supabase CLI to manage containers.
-	// Required whether running locally or inside the eval container (socket-mounted).
-	try {
-		execFileSync("docker", ["info"], { stdio: "ignore", timeout: 10_000 });
-	} catch {
-		errors.push(
-			isRunningInDocker()
-				? "Docker daemon not reachable inside container. Mount the socket: -v /var/run/docker.sock:/var/run/docker.sock"
-				: "Docker is not running (required by supabase CLI)",
-		);
-	}
-
-	// Supabase CLI available
-	try {
-		execFileSync("supabase", ["--version"], {
-			stdio: "ignore",
-			timeout: 10_000,
-		});
-	} catch {
-		errors.push(
-			"supabase CLI not found. Install it: https://supabase.com/docs/guides/cli/getting-started",
-		);
-	}
-
-	// Claude CLI available
-	try {
-		resolveClaudeBin();
-	} catch (err) {
-		errors.push((err as Error).message);
-	}
-
-	// API key
-	if (!process.env.ANTHROPIC_API_KEY) {
-		errors.push(
-			"ANTHROPIC_API_KEY is not set. Claude Code requires this for authentication.",
-		);
-	}
-
-	if (errors.length > 0) {
-		console.error("Preflight checks failed:\n");
-		for (const e of errors) {
-			console.error(`  - ${e}`);
-		}
-		console.error("");
-		process.exit(1);
-	}
-}
--- a/packages/evals/src/runner/results.ts
+++ b/packages/evals/src/runner/results.ts
@@ -1,84 +0,0 @@
-import { readdirSync, statSync } from "node:fs";
-import { join } from "node:path";
-import type { EvalRunResult } from "../types.js";
-
-/**
- * List files created or modified by the agent in the workspace.
- * Compares against the original eval directory to find new files.
- */
-export function listModifiedFiles(
-	workspacePath: string,
-	originalEvalDir: string,
-): string[] {
-	const modified: string[] = [];
-
-	function walk(dir: string, prefix: string) {
-		const entries = readdirSync(dir, { withFileTypes: true });
-		for (const entry of entries) {
-			if (
-				entry.name === "node_modules" ||
-				entry.name === ".agents" ||
-				entry.name === ".claude" ||
-				entry.name === "EVAL.ts" ||
-				entry.name === "EVAL.tsx"
-			)
-				continue;
-
-			const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
-			const fullPath = join(dir, entry.name);
-
-			if (entry.isDirectory()) {
-				walk(fullPath, relPath);
-			} else {
-				// Check if file is new (not in original eval dir)
-				const originalPath = join(originalEvalDir, relPath);
-				try {
-					statSync(originalPath);
-				} catch {
-					// File doesn't exist in original — it was created by the agent
-					modified.push(relPath);
-				}
-			}
-		}
-	}
-
-	walk(workspacePath, "");
-	return modified;
-}
-
-/** Print a summary table of eval results. */
-export function printSummary(
-	results: EvalRunResult[],
-	resultsDir?: string,
-): void {
-	console.log("\n=== Eval Results ===\n");
-
-	for (const r of results) {
-		const icon = r.status === "passed" ? "PASS" : "FAIL";
-		const skill = r.skillEnabled ? "with-skill" : "baseline";
-		const pct =
-			r.testsTotal > 0
-				? ((r.testsPassed / r.testsTotal) * 100).toFixed(1)
-				: "0.0";
-		const thresholdInfo =
-			r.passThreshold && r.testsTotal > 0
-				? `, threshold: ${((r.passThreshold / r.testsTotal) * 100).toFixed(0)}%`
-				: "";
-		console.log(
-			`[${icon}] ${r.scenario} | ${r.model} | ${skill} | ${(r.duration / 1000).toFixed(1)}s | ${pct}% (${r.testsPassed}/${r.testsTotal}${thresholdInfo})`,
-		);
-		if (r.filesModified.length > 0) {
-			console.log(`       Files: ${r.filesModified.join(", ")}`);
-		}
-		if (r.status === "error" && r.error) {
-			console.log(`       Error: ${r.error}`);
-		}
-	}
-
-	const passed = results.filter((r) => r.status === "passed").length;
-	console.log(`\nTotal: ${passed}/${results.length} passed`);
-
-	if (resultsDir) {
-		console.log(`\nResults saved to: ${resultsDir}`);
-	}
-}
--- a/packages/evals/src/runner/scaffold.ts
+++ b/packages/evals/src/runner/scaffold.ts
@@ -1,74 +0,0 @@
-import {
-	cpSync,
-	existsSync,
-	mkdirSync,
-	mkdtempSync,
-	readdirSync,
-	rmSync,
-	writeFileSync,
-} from "node:fs";
-import { tmpdir } from "node:os";
-import { join } from "node:path";
-import { EVAL_PROJECT_DIR } from "./supabase-setup.js";
-
-/**
- * Create an isolated workspace for an eval run.
- *
- * 1. Copy the eval directory to a temp folder (excluding EVAL.ts/EVAL.tsx)
- * 2. Seed with the eval project's supabase/config.toml
- *
- * Skills are injected via the --agents flag in agent.ts (not installed into
- * the workspace here). Combined with --setting-sources project,local, this
- * prevents host ~/.agents/skills/ from leaking into the eval environment.
- *
- * Returns the path to the workspace and a cleanup function.
- */
-export function createWorkspace(opts: {
-	evalDir: string;
-	skillEnabled: boolean;
-}): { workspacePath: string; cleanup: () => void } {
-	const workspacePath = mkdtempSync(join(tmpdir(), "supabase-eval-"));
-
-	// Copy eval directory, excluding EVAL.ts/EVAL.tsx (hidden from agent)
-	const entries = readdirSync(opts.evalDir, { withFileTypes: true });
-	for (const entry of entries) {
-		if (entry.name === "EVAL.ts" || entry.name === "EVAL.tsx") continue;
-		const src = join(opts.evalDir, entry.name);
-		const dest = join(workspacePath, entry.name);
-		cpSync(src, dest, { recursive: true });
-	}
-
-	// Add .mcp.json so the agent connects to the local Supabase MCP server
-	writeFileSync(
-		join(workspacePath, ".mcp.json"),
-		JSON.stringify(
-			{
-				mcpServers: {
-					"local-supabase": {
-						type: "http",
-						url: "http://localhost:54321/mcp",
-					},
-				},
-			},
-			null,
-			"\t",
-		),
-	);
-
-	// Seed the workspace with the eval project's supabase/config.toml so the
-	// agent can run `supabase db push` against the shared local instance without
-	// needing to run `supabase init` or `supabase start` first.
-	const projectConfigSrc = join(EVAL_PROJECT_DIR, "supabase", "config.toml");
-	if (existsSync(projectConfigSrc)) {
-		const destSupabaseDir = join(workspacePath, "supabase");
-		mkdirSync(join(destSupabaseDir, "migrations"), { recursive: true });
-		cpSync(projectConfigSrc, join(destSupabaseDir, "config.toml"));
-	}
-
-	return {
-		workspacePath,
-		cleanup: () => {
-			rmSync(workspacePath, { recursive: true, force: true });
-		},
-	};
-}
--- a/packages/evals/src/runner/scorers.ts
+++ b/packages/evals/src/runner/scorers.ts
@@ -1,94 +0,0 @@
-import type { EvalRunResult } from "../types.js";
-import type { TranscriptSummary } from "./transcript.js";
-
-export interface ScoreResult {
-	name: string;
-	/** 0.0 – 1.0 */
-	score: number;
-	metadata?: Record<string, unknown>;
-}
-
-/**
- * skillUsageScorer — 1 if the target skill was in the agent's context, 0 otherwise.
- *
- * Detected via the `skills` array in the system init event of the NDJSON transcript.
- * Combined with `--setting-sources project,local` in agent.ts, this array is clean
- * (no host skill leakage), so its presence is a reliable signal.
- */
-export function skillUsageScorer(
-	transcript: TranscriptSummary,
-	skillName: string,
-): ScoreResult {
-	const loaded = transcript.skills.includes(skillName);
-	return {
-		name: "skill_usage",
-		score: loaded ? 1 : 0,
-		metadata: {
-			loadedSkills: transcript.skills,
-			targetSkill: skillName,
-		},
-	};
-}
-
-/**
- * referenceFilesUsageScorer — fraction of expected reference files actually read.
- *
- * Detected via Read tool calls whose file_path matches "/.agents/skills/*\/references/".
- * The expectedReferenceFiles list is declared in each EVAL.ts and should match the
- * "Skill References Exercised" table in the corresponding scenarios/*.md file.
- */
-export function referenceFilesUsageScorer(
-	transcript: TranscriptSummary,
-	expectedReferenceFiles: string[],
-): ScoreResult {
-	if (expectedReferenceFiles.length === 0) {
-		return {
-			name: "reference_files_usage",
-			score: 1,
-			metadata: { skipped: true },
-		};
-	}
-	const read = transcript.referenceFilesRead;
-	const hits = expectedReferenceFiles.filter((f) => read.includes(f)).length;
-	return {
-		name: "reference_files_usage",
-		score: hits / expectedReferenceFiles.length,
-		metadata: {
-			expected: expectedReferenceFiles,
-			read,
-			hits,
-			total: expectedReferenceFiles.length,
-		},
-	};
-}
-
-/**
- * assertionsPassedScorer — ratio of assertions passed vs total.
- */
-export function assertionsPassedScorer(result: EvalRunResult): ScoreResult {
-	const score =
-		result.testsTotal > 0 ? result.testsPassed / result.testsTotal : 0;
-	return {
-		name: "assertions_passed",
-		score,
-		metadata: { passed: result.testsPassed, total: result.testsTotal },
-	};
-}
-
-/**
- * finalResultScorer — 1 if the agent met the pass threshold, 0 otherwise.
- *
- * A result is "passed" when assertionsPassed >= passThreshold (set per scenario
- * in scenarios/*.md). This is the binary outcome used for Braintrust comparisons.
- */
-export function finalResultScorer(result: EvalRunResult): ScoreResult {
-	return {
-		name: "final_result",
-		score: result.status === "passed" ? 1 : 0,
-		metadata: {
-			testsPassed: result.testsPassed,
-			testsTotal: result.testsTotal,
-			passThreshold: result.passThreshold,
-		},
-	};
-}
--- a/packages/evals/src/runner/supabase-setup.ts
+++ b/packages/evals/src/runner/supabase-setup.ts
@@ -1,108 +0,0 @@
-import { execFileSync } from "node:child_process";
-import { dirname, resolve } from "node:path";
-import { fileURLToPath } from "node:url";
-
-const __filename = fileURLToPath(import.meta.url);
-const __dirname = dirname(__filename);
-
-/**
- * Directory that contains the eval Supabase project (supabase/config.toml).
- * The runner starts the shared Supabase instance from here.
- * Agent workspaces get a copy of supabase/config.toml so they can
- * connect to the same running instance via `supabase db push`.
- */
-export const EVAL_PROJECT_DIR = resolve(__dirname, "..", "..", "project");
-
-export interface SupabaseKeys {
-	apiUrl: string;
-	dbUrl: string;
-	anonKey: string;
-	serviceRoleKey: string;
-}
-
-/**
- * Start the local Supabase stack for the eval project.
- * Idempotent: if already running, the CLI prints a message and exits 0.
- */
-export function startSupabase(): void {
-	console.log("  Starting Supabase...");
-	execFileSync("supabase", ["start", "--exclude", "studio,imgproxy,mailpit"], {
-		cwd: EVAL_PROJECT_DIR,
-		stdio: "inherit",
-		timeout: 5 * 60 * 1000, // 5 min for first image pull
-	});
-}
-
-// SQL that clears all user-created objects and migration history between scenarios.
-// Avoids `supabase db reset` which restarts containers and triggers flaky health checks.
-const RESET_SQL = `
-  -- Drop and recreate public schema (removes all user tables/views/functions)
-  DROP SCHEMA public CASCADE;
-  CREATE SCHEMA public;
-  GRANT ALL ON SCHEMA public TO postgres;
-  GRANT ALL ON SCHEMA public TO anon;
-  GRANT ALL ON SCHEMA public TO authenticated;
-  GRANT ALL ON SCHEMA public TO service_role;
-
-  -- Clear migration history so the next agent's db push starts from a clean slate
-  DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
-
-  -- Notify PostgREST to reload its schema cache
-  NOTIFY pgrst, 'reload schema';
-`.trim();
-
-/**
- * Reset the database to a clean state between scenarios.
- *
- * Uses direct SQL via psql instead of `supabase db reset` to avoid the
- * container-restart cycle and its flaky health checks. This drops the
- * public schema (all user tables) and clears the migration history so
- * `supabase db push` in agent workspaces always starts fresh.
- */
-export function resetDB(dbUrl: string): void {
-	execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
-		stdio: "inherit",
-		timeout: 30 * 1000,
-	});
-}
-
-/**
- * Stop all Supabase containers for the eval project.
- * Called once after all scenarios complete.
- */
-export function stopSupabase(): void {
-	console.log("  Stopping Supabase...");
-	execFileSync("supabase", ["stop", "--no-backup"], {
-		cwd: EVAL_PROJECT_DIR,
-		stdio: "inherit",
-		timeout: 60 * 1000,
-	});
-}
-
-/**
- * Read the running instance's API URL and JWT keys.
- * Returns values that the runner injects into process.env so EVAL.ts
- * tests can connect to the real database.
- */
-export function getKeys(): SupabaseKeys {
-	const raw = execFileSync("supabase", ["status", "--output", "json"], {
-		cwd: EVAL_PROJECT_DIR,
-		timeout: 30 * 1000,
-	}).toString();
-
-	const status = JSON.parse(raw) as Record<string, string>;
-
-	const apiUrl = status.API_URL ?? "http://127.0.0.1:54321";
-	const dbUrl =
-		status.DB_URL ?? "postgresql://postgres:postgres@127.0.0.1:54322/postgres";
-	const anonKey = status.ANON_KEY ?? "";
-	const serviceRoleKey = status.SERVICE_ROLE_KEY ?? "";
-
-	if (!anonKey || !serviceRoleKey) {
-		throw new Error(
-			`supabase status returned missing keys. Raw output:\n${raw}`,
-		);
-	}
-
-	return { apiUrl, dbUrl, anonKey, serviceRoleKey };
-}
--- a/packages/evals/src/runner/transcript.ts
+++ b/packages/evals/src/runner/transcript.ts
@@ -1,301 +0,0 @@
-import { basename } from "node:path";
-
-export interface TranscriptEvent {
-	type: string;
-	[key: string]: unknown;
-}
-
-export interface ToolCallSummary {
-	tool: string;
-	toolUseId: string;
-	input: Record<string, unknown>;
-	/** First ~200 chars of output for quick scanning */
-	outputPreview: string;
-	/** Whether the tool call returned an error */
-	isError: boolean;
-	/** stderr output for Bash tool calls */
-	stderr: string;
-}
-
-export interface ModelUsage {
-	inputTokens: number;
-	outputTokens: number;
-	cacheReadInputTokens: number;
-	cacheCreationInputTokens: number;
-	costUSD: number;
-}
-
-export interface TranscriptSummary {
-	totalTurns: number;
-	totalDurationMs: number;
-	/** API-only latency (excludes local processing overhead) */
-	totalDurationApiMs: number;
-	totalCostUsd: number | null;
-	model: string | null;
-	toolCalls: ToolCallSummary[];
-	finalOutput: string;
-	/** Skills listed in the system init event (loaded into agent context) */
-	skills: string[];
-	/** Basenames of reference files the agent read via the Read tool */
-	referenceFilesRead: string[];
-	/** Per-model token usage and cost breakdown */
-	modelUsage: Record<string, ModelUsage>;
-	totalInputTokens: number;
-	totalOutputTokens: number;
-	totalCacheReadTokens: number;
-	totalCacheCreationTokens: number;
-	/** Count of tool calls that returned is_error === true */
-	toolErrorCount: number;
-	/** Whether the overall session ended in an error */
-	isError: boolean;
-	/** Count of permission_denials in the result event */
-	permissionDenialCount: number;
-}
-
-/** Parse a single NDJSON line. Returns null on empty or invalid input. */
-export function parseStreamJsonLine(line: string): TranscriptEvent | null {
-	const trimmed = line.trim();
-	if (!trimmed) return null;
-	try {
-		return JSON.parse(trimmed) as TranscriptEvent;
-	} catch {
-		return null;
-	}
-}
-
-/** Parse raw NDJSON stdout into an array of events. */
-export function parseStreamJsonOutput(raw: string): TranscriptEvent[] {
-	const events: TranscriptEvent[] = [];
-	for (const line of raw.split("\n")) {
-		const event = parseStreamJsonLine(line);
-		if (event) events.push(event);
-	}
-	return events;
-}
-
-/** Extract the final text output from parsed events (for backward compat). */
-export function extractFinalOutput(events: TranscriptEvent[]): string {
-	// Prefer the result event
-	for (const event of events) {
-		if (event.type === "result") {
-			const result = (event as Record<string, unknown>).result;
-			if (typeof result === "string") return result;
-		}
-	}
-
-	// Fallback: concatenate text blocks from the last assistant message
-	for (let i = events.length - 1; i >= 0; i--) {
-		const event = events[i];
-		if (event.type === "assistant") {
-			const msg = (event as Record<string, unknown>).message as
-				| Record<string, unknown>
-				| undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				const texts = content
-					.filter(
-						(b: Record<string, unknown>) =>
-							b.type === "text" && typeof b.text === "string",
-					)
-					.map((b: Record<string, unknown>) => b.text as string);
-				if (texts.length > 0) return texts.join("\n");
-			}
-		}
-	}
-
-	return "";
-}
-
-/** Return true if a file path points to a skill reference file. */
-function isReferenceFilePath(filePath: string): boolean {
-	return (
-		filePath.includes("/.agents/skills/") && filePath.includes("/references/")
-	);
-}
-
-/** Walk parsed events to build a transcript summary. */
-export function buildTranscriptSummary(
-	events: TranscriptEvent[],
-): TranscriptSummary {
-	const toolCalls: ToolCallSummary[] = [];
-	let finalOutput = "";
-	let totalDurationMs = 0;
-	let totalDurationApiMs = 0;
-	let totalCostUsd: number | null = null;
-	let model: string | null = null;
-	let totalTurns = 0;
-	let skills: string[] = [];
-	const referenceFilesRead: string[] = [];
-	let modelUsage: Record<string, ModelUsage> = {};
-	let totalInputTokens = 0;
-	let totalOutputTokens = 0;
-	let totalCacheReadTokens = 0;
-	let totalCacheCreationTokens = 0;
-	let toolErrorCount = 0;
-	let isError = false;
-	let permissionDenialCount = 0;
-
-	for (const event of events) {
-		const e = event as Record<string, unknown>;
-
-		// System init: extract model and loaded skills
-		if (e.type === "system" && e.subtype === "init") {
-			model = typeof e.model === "string" ? e.model : null;
-			if (Array.isArray(e.skills)) {
-				skills = e.skills.filter((s): s is string => typeof s === "string");
-			}
-		}
-
-		// Assistant messages: extract tool_use blocks
-		if (e.type === "assistant") {
-			const msg = e.message as Record<string, unknown> | undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				for (const block of content) {
-					if (block.type === "tool_use") {
-						const toolCall: ToolCallSummary = {
-							tool: block.name ?? "unknown",
-							toolUseId: block.id ?? "",
-							input: block.input ?? {},
-							outputPreview: "",
-							isError: false,
-							stderr: "",
-						};
-						toolCalls.push(toolCall);
-
-						// Track reference file reads
-						if (
-							block.name === "Read" &&
-							typeof block.input?.file_path === "string" &&
-							isReferenceFilePath(block.input.file_path)
-						) {
-							const base = basename(block.input.file_path);
-							if (!referenceFilesRead.includes(base)) {
-								referenceFilesRead.push(base);
-							}
-						}
-					}
-				}
-			}
-		}
-
-		// User messages: extract tool_result blocks and match to tool calls
-		if (e.type === "user") {
-			const msg = e.message as Record<string, unknown> | undefined;
-			const content = msg?.content;
-			if (Array.isArray(content)) {
-				for (const block of content) {
-					if (block.type === "tool_result") {
-						const matching = toolCalls.find(
-							(tc) => tc.toolUseId === block.tool_use_id,
-						);
-						if (matching) {
-							const text =
-								typeof block.content === "string"
-									? block.content
-									: JSON.stringify(block.content);
-							matching.outputPreview = text.slice(0, 200);
-
-							// Capture error state from tool result
-							if (block.is_error === true) {
-								matching.isError = true;
-								toolErrorCount++;
-							}
-						}
-					}
-				}
-			}
-
-			// Capture stderr from tool_use_result (Bash tool emits this at the user event level)
-			const toolUseResult = e.tool_use_result as
-				| Record<string, unknown>
-				| undefined;
-			if (toolUseResult && typeof toolUseResult.stderr === "string") {
-				// Match to the most recent Bash tool call without stderr set
-				const lastBash = [...toolCalls]
-					.reverse()
-					.find((tc) => tc.tool === "Bash" && !tc.stderr);
-				if (lastBash) {
-					lastBash.stderr = toolUseResult.stderr;
-				}
-			}
-		}
-
-		// Result event: final output, cost, duration, turns, token usage
-		if (e.type === "result") {
-			finalOutput = typeof e.result === "string" ? e.result : "";
-			totalDurationMs = typeof e.duration_ms === "number" ? e.duration_ms : 0;
-			totalDurationApiMs =
-				typeof e.duration_api_ms === "number" ? e.duration_api_ms : 0;
-			totalCostUsd =
-				typeof e.total_cost_usd === "number" ? e.total_cost_usd : null;
-			totalTurns = typeof e.num_turns === "number" ? e.num_turns : 0;
-			isError = e.is_error === true;
-			permissionDenialCount = Array.isArray(e.permission_denials)
-				? e.permission_denials.length
-				: 0;
-
-			// Aggregate token usage from the result event's usage field
-			const usage = e.usage as Record<string, unknown> | undefined;
-			if (usage) {
-				totalInputTokens =
-					typeof usage.input_tokens === "number" ? usage.input_tokens : 0;
-				totalOutputTokens =
-					typeof usage.output_tokens === "number" ? usage.output_tokens : 0;
-				totalCacheReadTokens =
-					typeof usage.cache_read_input_tokens === "number"
-						? usage.cache_read_input_tokens
-						: 0;
-				totalCacheCreationTokens =
-					typeof usage.cache_creation_input_tokens === "number"
-						? usage.cache_creation_input_tokens
-						: 0;
-			}
-
-			// Per-model usage breakdown (modelUsage keyed by model name)
-			const rawModelUsage = e.modelUsage as
-				| Record<string, Record<string, unknown>>
-				| undefined;
-			if (rawModelUsage) {
-				modelUsage = {};
-				for (const [modelName, mu] of Object.entries(rawModelUsage)) {
-					modelUsage[modelName] = {
-						inputTokens:
-							typeof mu.inputTokens === "number" ? mu.inputTokens : 0,
-						outputTokens:
-							typeof mu.outputTokens === "number" ? mu.outputTokens : 0,
-						cacheReadInputTokens:
-							typeof mu.cacheReadInputTokens === "number"
-								? mu.cacheReadInputTokens
-								: 0,
-						cacheCreationInputTokens:
-							typeof mu.cacheCreationInputTokens === "number"
-								? mu.cacheCreationInputTokens
-								: 0,
-						costUSD: typeof mu.costUSD === "number" ? mu.costUSD : 0,
-					};
-				}
-			}
-		}
-	}
-
-	return {
-		totalTurns,
-		totalDurationMs,
-		totalDurationApiMs,
-		totalCostUsd,
-		model,
-		toolCalls,
-		finalOutput,
-		skills,
-		referenceFilesRead,
-		modelUsage,
-		totalInputTokens,
-		totalOutputTokens,
-		totalCacheReadTokens,
-		totalCacheCreationTokens,
-		toolErrorCount,
-		isError,
-		permissionDenialCount,
-	};
-}
--- a/packages/evals/src/types.ts
+++ b/packages/evals/src/types.ts
@@ -1,85 +0,0 @@
-import type { AssertionResult } from "./eval-types.js";
-
-export interface EvalScenario {
-	/** Directory name under evals/ */
-	id: string;
-	/** Human-readable name */
-	name: string;
-	/** Tags for filtering */
-	tags: string[];
-}
-
-export interface AgentConfig {
-	/** Agent identifier */
-	agent: "claude-code";
-	/** Model to use */
-	model: string;
-	/** Whether the supabase skill is available */
-	skillEnabled: boolean;
-}
-
-export interface EvalRunResult {
-	scenario: string;
-	agent: string;
-	model: string;
-	skillEnabled: boolean;
-	status: "passed" | "failed" | "error";
-	duration: number;
-	/** Raw test runner output (for debugging) */
-	testOutput?: string;
-	agentOutput: string;
-	/** Number of assertions that passed */
-	testsPassed: number;
-	/** Total number of assertions */
-	testsTotal: number;
-	/** Minimum tests required to pass (from scenario config) */
-	passThreshold?: number;
-	/** Per-assertion pass/fail results */
-	assertionResults?: AssertionResult[];
-	/** Files the agent created or modified in the workspace */
-	filesModified: string[];
-	error?: string;
-	/** Path to the persisted results directory for this run */
-	resultsDir?: string;
-	/** Number of tool calls the agent made */
-	toolCallCount?: number;
-	/** Total cost in USD (from stream-json result event) */
-	costUsd?: number;
-	/** The PROMPT.md content sent to the agent */
-	prompt?: string;
-	/** Epoch ms when the agent run started (for Braintrust span timing) */
-	startedAt?: number;
-	/** API-only latency in ms (excludes local processing overhead) */
-	durationApiMs?: number;
-	/** Aggregate token counts from the result event */
-	totalInputTokens?: number;
-	totalOutputTokens?: number;
-	totalCacheReadTokens?: number;
-	totalCacheCreationTokens?: number;
-	/** Per-model token usage and cost breakdown */
-	modelUsage?: Record<
-		string,
-		{
-			inputTokens: number;
-			outputTokens: number;
-			cacheReadInputTokens: number;
-			cacheCreationInputTokens: number;
-			costUSD: number;
-		}
-	>;
-	/** Count of tool calls that returned is_error === true */
-	toolErrorCount?: number;
-	/** Count of permission_denials in the result event */
-	permissionDenialCount?: number;
-	/** Skills that were in the agent's context (from system init event) */
-	loadedSkills?: string[];
-	/** Basenames of skill reference files the agent read */
-	referenceFilesRead?: string[];
-	/** Computed scorer results */
-	scores?: {
-		skillUsage: number;
-		referenceFilesUsage: number;
-		assertionsPassed: number;
-		finalResult: number;
-	};
-}
--- a/packages/evals/src/upload.ts
+++ b/packages/evals/src/upload.ts
@@ -0,0 +1,350 @@
+/**
+ * Upload eval results from the results/ directory to Braintrust.
+ *
+ * Reads saved result.json, transcript.json, and outputs/eval.txt from each
+ * run, parses the vitest output to extract pass/fail counts, then uploads to
+ * Braintrust as an experiment.
+ *
+ * Usage:
+ *   BRAINTRUST_API_KEY=... BRAINTRUST_PROJECT_ID=... tsx src/upload.ts
+ *
+ * Optional env vars:
+ *   RESULTS_DIR   Override the results directory (default: results/)
+ *   RUN_TIMESTAMP Only upload a specific run (e.g. 2026-02-27T13-01-22.316Z)
+ */
+
+import assert from "node:assert";
+import { existsSync, readdirSync, readFileSync } from "node:fs";
+import { basename, dirname, join, resolve } from "node:path";
+import { fileURLToPath } from "node:url";
+import { init } from "braintrust";
+
+const __dirname = dirname(fileURLToPath(import.meta.url));
+const ROOT = resolve(__dirname, "..");
+
+// ---------------------------------------------------------------------------
+// Types matching the saved result files from @vercel/agent-eval
+// ---------------------------------------------------------------------------
+
+interface RunResult {
+	status: "passed" | "failed" | "error";
+	duration: number;
+	model: string;
+	o11y: {
+		totalTurns: number;
+		totalToolCalls: number;
+		toolCalls: Record<string, number>;
+		filesModified: string[];
+		filesRead: string[];
+		errors: string[];
+		thinkingBlocks: number;
+	};
+}
+
+interface TranscriptEvent {
+	type: "tool_call" | "tool_result" | "message" | "thinking";
+	tool?: {
+		name: string;
+		originalName: string;
+		args?: Record<string, unknown>;
+	};
+}
+
+interface Transcript {
+	agent: string;
+	model: string;
+	events: TranscriptEvent[];
+}
+
+interface ParsedEvalOutput {
+	passed: number;
+	failed: number;
+	total: number;
+	tests: Array<{ name: string; passed: boolean }>;
+}
+
+// ---------------------------------------------------------------------------
+// Parse vitest eval.txt output
+// ---------------------------------------------------------------------------
+
+function parseEvalOutput(text: string): ParsedEvalOutput {
+	const tests: Array<{ name: string; passed: boolean }> = [];
+
+	for (const line of text.split("\n")) {
+		const passMatch = line.match(/^\s+✓\s+(.+)$/);
+		const failMatch = line.match(/^\s+[✗×]\s+(.+)$/);
+		if (passMatch) tests.push({ name: passMatch[1].trim(), passed: true });
+		else if (failMatch)
+			tests.push({ name: failMatch[1].trim(), passed: false });
+	}
+
+	if (tests.length > 0) {
+		const passed = tests.filter((t) => t.passed).length;
+		return {
+			passed,
+			failed: tests.length - passed,
+			total: tests.length,
+			tests,
+		};
+	}
+
+	// Fallback: parse summary line
+	const summaryMatch = text.match(
+		/Tests\s+(\d+)\s+passed(?:\s*\|\s*(\d+)\s+failed)?\s+\((\d+)\)/,
+	);
+	if (summaryMatch) {
+		const passed = parseInt(summaryMatch[1], 10);
+		const failed = summaryMatch[2] ? parseInt(summaryMatch[2], 10) : 0;
+		const total = parseInt(summaryMatch[3], 10);
+		return { passed, failed, total, tests };
+	}
+
+	return { passed: 0, failed: 0, total: 0, tests };
+}
+
+// ---------------------------------------------------------------------------
+// Extract reference file reads from transcript
+// ---------------------------------------------------------------------------
+
+function extractReferenceFilesRead(transcript: Transcript): string[] {
+	const read: string[] = [];
+	for (const event of transcript.events) {
+		if (event.type !== "tool_call" || !event.tool?.args) continue;
+		if (event.tool.name !== "file_read") continue;
+		const filePath = String(
+			event.tool.args._extractedPath ?? event.tool.args.file_path ?? "",
+		);
+		if (
+			(filePath.includes("/.claude/skills/") ||
+				filePath.includes("/.agents/skills/")) &&
+			filePath.includes("/references/")
+		) {
+			const base = basename(filePath);
+			if (!read.includes(base)) read.push(base);
+		}
+	}
+	return read;
+}
+
+// ---------------------------------------------------------------------------
+// Find all experiment run directories
+// ---------------------------------------------------------------------------
+
+interface RunEntry {
+	runTimestamp: string;
+	evalName: string;
+	runIndex: number;
+	runDir: string;
+	result: RunResult;
+	transcript: Transcript;
+	evalOutput: string | null;
+	prompt: string;
+}
+
+function findRuns(resultsDir: string, filterTimestamp?: string): RunEntry[] {
+	const entries: RunEntry[] = [];
+	const experimentDir = join(resultsDir, "experiment");
+	if (!existsSync(experimentDir)) return entries;
+
+	const timestamps = readdirSync(experimentDir).filter(
+		(t) => !filterTimestamp || t === filterTimestamp,
+	);
+
+	for (const runTimestamp of timestamps) {
+		const tsDir = join(experimentDir, runTimestamp);
+		const evalNames = readdirSync(tsDir).filter((name) =>
+			readdirSync(join(tsDir, name)).some((f) => f.startsWith("run-")),
+		);
+
+		for (const evalName of evalNames) {
+			const evalDir = join(tsDir, evalName);
+			const promptPath = resolve(ROOT, "evals", evalName, "PROMPT.md");
+			const prompt = existsSync(promptPath)
+				? readFileSync(promptPath, "utf-8").trim()
+				: "";
+
+			const runDirs = readdirSync(evalDir)
+				.filter((d) => /^run-\d+$/.test(d))
+				.sort();
+
+			for (const runDir of runDirs) {
+				const runIndex = parseInt(runDir.replace("run-", ""), 10);
+				const runPath = join(evalDir, runDir);
+				const resultPath = join(runPath, "result.json");
+				const transcriptPath = join(runPath, "transcript.json");
+				const evalOutputPath = join(runPath, "outputs", "eval.txt");
+
+				if (!existsSync(resultPath) || !existsSync(transcriptPath)) continue;
+
+				const result: RunResult = JSON.parse(readFileSync(resultPath, "utf-8"));
+				const transcript: Transcript = JSON.parse(
+					readFileSync(transcriptPath, "utf-8"),
+				);
+				const evalOutput = existsSync(evalOutputPath)
+					? readFileSync(evalOutputPath, "utf-8")
+					: null;
+
+				entries.push({
+					runTimestamp,
+					evalName,
+					runIndex,
+					runDir: runPath,
+					result,
+					transcript,
+					evalOutput,
+					prompt,
+				});
+			}
+		}
+	}
+
+	return entries;
+}
+
+// ---------------------------------------------------------------------------
+// Main upload flow
+// ---------------------------------------------------------------------------
+
+async function main() {
+	assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
+	assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
+
+	const resultsDir = resolve(ROOT, process.env.RESULTS_DIR ?? "results");
+	const filterTimestamp = process.env.RUN_TIMESTAMP;
+
+	const runs = findRuns(resultsDir, filterTimestamp);
+	if (runs.length === 0) {
+		console.error("No runs found in", resultsDir);
+		process.exit(1);
+	}
+
+	console.log(
+		`Found ${runs.length} run(s) across ${new Set(runs.map((r) => r.runTimestamp)).size} experiment(s)`,
+	);
+
+	const byTimestamp = new Map<string, RunEntry[]>();
+	for (const r of runs) {
+		const group = byTimestamp.get(r.runTimestamp) ?? [];
+		group.push(r);
+		byTimestamp.set(r.runTimestamp, group);
+	}
+
+	for (const [runTimestamp, timestampRuns] of byTimestamp) {
+		const model = timestampRuns[0].result.model;
+		const skillEnabled = process.env.EVAL_BASELINE !== "true";
+		const variant = skillEnabled ? "skill" : "baseline";
+		const experimentName = `${model}-${variant}-${runTimestamp}`;
+
+		console.log(
+			`\nUploading experiment: ${experimentName} (${timestampRuns.length} rows)`,
+		);
+
+		const experiment = init({
+			projectId: process.env.BRAINTRUST_PROJECT_ID as string,
+			experiment: experimentName,
+			metadata: {
+				model,
+				runTimestamp,
+				skillEnabled,
+				evalCount: timestampRuns.length,
+			},
+		});
+
+		for (const run of timestampRuns) {
+			const evalParsed = run.evalOutput
+				? parseEvalOutput(run.evalOutput)
+				: { passed: 0, failed: 0, total: 0, tests: [] };
+
+			console.log(
+				`  [${run.evalName}] run-${run.runIndex} — tests: ${evalParsed.passed}/${evalParsed.total} passed`,
+			);
+
+			// Reference files scorer
+			const metaPath = resolve(ROOT, "evals", run.evalName, "meta.ts");
+			const metaMod = existsSync(metaPath)
+				? ((await import(metaPath)) as {
+						expectedReferenceFiles?: string[];
+					})
+				: {};
+			const expectedRefs = metaMod.expectedReferenceFiles ?? [];
+			const refsRead = extractReferenceFilesRead(run.transcript);
+			const refHits = expectedRefs.filter((f) => refsRead.includes(f)).length;
+			const referenceFilesUsage =
+				expectedRefs.length > 0 ? refHits / expectedRefs.length : 1;
+
+			console.log(
+				`  reference files: ${refHits}/${expectedRefs.length} read (${refsRead.join(", ") || "none"})`,
+			);
+
+			const scores: Record<string, number> = {
+				assertions_passed:
+					evalParsed.total > 0 ? evalParsed.passed / evalParsed.total : 0,
+				reference_files_usage: referenceFilesUsage,
+				final_result: run.result.status === "passed" ? 1 : 0,
+			};
+
+			const metadata: Record<string, unknown> = {
+				model: run.result.model,
+				evalName: run.evalName,
+				runIndex: run.runIndex,
+				totalTurns: run.result.o11y.totalTurns,
+				totalToolCalls: run.result.o11y.totalToolCalls,
+				toolCalls: run.result.o11y.toolCalls,
+				filesModified: run.result.o11y.filesModified,
+				errors: run.result.o11y.errors,
+				thinkingBlocks: run.result.o11y.thinkingBlocks,
+				duration: run.result.duration,
+				referenceFilesRead: refsRead,
+				expectedReferenceFiles: expectedRefs,
+			};
+
+			experiment.traced(
+				(span) => {
+					span.log({
+						input: { eval: run.evalName, prompt: run.prompt },
+						output: {
+							status: run.result.status,
+							filesModified: run.result.o11y.filesModified,
+							tests: evalParsed.tests,
+							evalOutput: run.evalOutput,
+						},
+						expected: {
+							testsTotal: evalParsed.total,
+							expectedReferenceFiles: expectedRefs,
+						},
+						scores,
+						metadata,
+						datasetRecordId: run.evalName,
+					});
+
+					// Child spans for each tool call in the transcript
+					for (const event of run.transcript.events) {
+						if (event.type !== "tool_call" || !event.tool) continue;
+						span.traced(
+							(child) => {
+								child.log({
+									input: {
+										tool: event.tool?.name,
+										args: event.tool?.args ?? {},
+									},
+									output: {},
+									metadata: { originalName: event.tool?.originalName },
+								});
+							},
+							{ name: `tool:${event.tool.name}` },
+						);
+					}
+				},
+				{ name: `${run.evalName}/run-${run.runIndex}` },
+			);
+		}
+
+		const summary = await experiment.summarize();
+		console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
+	}
+}
+
+main().catch((err) => {
+	console.error(err);
+	process.exit(1);
+});
--- a/packages/evals/tsconfig.json
+++ b/packages/evals/tsconfig.json
@@ -1,16 +1,11 @@
 {
 	"compilerOptions": {
 		"target": "ES2022",
-		"module": "ESNext",
-		"moduleResolution": "bundler",
-		"esModuleInterop": true,
+		"module": "NodeNext",
+		"moduleResolution": "NodeNext",
 		"strict": true,
 		"skipLibCheck": true,
-		"outDir": "dist",
-		"rootDir": "src",
-		"declaration": true,
-		"resolveJsonModule": true
+		"noEmit": true
 	},
-	"include": ["src/**/*"],
-	"exclude": ["node_modules", "dist", "evals"]
+	"include": ["experiments", "src", "evals"]
 }