use agent-evals package

This commit is contained in:
Pedro Rodrigues
2026-02-27 15:32:55 +00:00
parent 0894f5683e
commit 9c6fd293eb
61 changed files with 4208 additions and 4652 deletions

View File

@@ -46,14 +46,19 @@ sources = ["test/**", "skills/**"]
# ── Eval tasks ──────────────────────────────────────────────────────── # ── Eval tasks ────────────────────────────────────────────────────────
[tasks.eval] [tasks.eval]
description = "Run workflow evals" description = "Run workflow evals (use -- to pass args, e.g. mise run eval -- --skill supabase --scenario rls-update-needs-select)"
run = "npm --prefix packages/evals run eval" run = "bash packages/evals/scripts/eval.sh"
sources = ["packages/evals/src/**", "packages/evals/evals/**"] sources = ["packages/evals/evals/**", "packages/evals/experiments/**"]
[tasks."eval:dry"]
description = "Dry run workflow evals (no API calls)"
run = "npm --prefix packages/evals run eval:dry"
sources = ["packages/evals/evals/**", "packages/evals/experiments/**"]
[tasks."eval:upload"] [tasks."eval:upload"]
description = "Run workflow evals and upload to Braintrust" description = "Upload eval results to Braintrust"
run = "npm --prefix packages/evals run eval:upload" run = "npm --prefix packages/evals run eval:upload"
sources = ["packages/evals/src/**", "packages/evals/evals/**"] sources = ["packages/evals/results/**"]
# ── Docker eval tasks ──────────────────────────────────────────────── # ── Docker eval tasks ────────────────────────────────────────────────
@@ -71,7 +76,6 @@ docker run --rm \
-e EVAL_SCENARIO \ -e EVAL_SCENARIO \
-e EVAL_BASELINE \ -e EVAL_BASELINE \
-e EVAL_SKILL \ -e EVAL_SKILL \
-e BRAINTRUST_UPLOAD \
-e BRAINTRUST_API_KEY \ -e BRAINTRUST_API_KEY \
-e BRAINTRUST_PROJECT_ID \ -e BRAINTRUST_PROJECT_ID \
-e EVAL_RESULTS_DIR=/app/results \ -e EVAL_RESULTS_DIR=/app/results \

View File

@@ -1,57 +1,56 @@
# Evals — Agent Guide # Evals — Agent Guide
This package evaluates whether AI agents correctly implement Supabase tasks This package evaluates whether AI agents correctly implement Supabase tasks
when using skill documentation. Modeled after when using skill documentation. Built on
[Vercel's next-evals-oss](https://github.com/vercel-labs/next-evals-oss): each [@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
eval is a self-contained project with a task prompt, the agent works on it, and self-contained scenario with a task prompt, the agent works in a Docker sandbox,
hidden tests check the result. Binary pass/fail. and hidden vitest assertions check the result. Binary pass/fail.
## Architecture ## Architecture
``` ```
1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir) 1. eval.sh starts Supabase, exports keys
2. Install skills via `skills add` CLI (or skip for baseline) 2. agent-eval reads experiments/experiment.ts
3. Run: claude -p "prompt" --cwd /tmp/eval-xxx 3. For each scenario:
4. Agent reads skill, creates migrations/code in the workspace a. setup() resets DB, writes config + skills into Docker sandbox
5. Copy hidden EVAL.ts into workspace, run vitest b. Agent (Claude Code) runs PROMPT.md in the sandbox
6. Capture pass/fail c. EVAL.ts (vitest) asserts against agent output
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
5. Optional: upload.ts pushes results to Braintrust
``` ```
The agent is **Claude Code** invoked via `claude -p` (print mode). It operates The agent is **Claude Code** running inside a Docker sandbox managed by
on a real filesystem in a temp directory and can read/write files freely. `@vercel/agent-eval`. It operates on a real filesystem and can read/write files
freely.
**Important**: MCP servers are disabled via `--strict-mcp-config` with an empty ## File Structure
config. This ensures the agent uses only local tools (Bash, Edit, Write, Read,
Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All
work must happen on the local filesystem — e.g., creating migration files in
`supabase/migrations/`, not applying them to a remote project.
## Eval Structure
Each eval lives in `evals/{scenario-name}/`:
``` ```
evals/auth-rls-new-project/ packages/evals/
experiments/
experiment.ts # ExperimentConfig — agent, sandbox, setup() hook
scripts/
eval.sh # Supabase lifecycle wrapper (start → eval → stop)
src/
upload.ts # Standalone Braintrust result uploader
evals/
eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
{scenario}/
PROMPT.md # Task description (visible to agent) PROMPT.md # Task description (visible to agent)
EVAL.ts # Vitest assertions (hidden from agent during run) EVAL.ts # Vitest assertions (hidden from agent during run)
package.json # Minimal project manifest meta.ts # expectedReferenceFiles for scoring
package.json # Minimal manifest with vitest devDep
project/
supabase/ supabase/
config.toml # Pre-initialized supabase config config.toml # Shared Supabase config seeded into each sandbox
migrations/ # Empty — agent creates files here scenarios/ # Workflow scenario proposals
results/ # Output from eval runs (gitignored)
``` ```
**EVAL.ts** is never copied to the workspace until after the agent finishes.
This prevents the agent from "teaching to the test."
## Running Evals ## Running Evals
Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
source files haven't changed. Use `--force` to bypass caching when you need
to re-run evals regardless (e.g., after changing environment variables or
re-running the same scenario):
```bash ```bash
# Run all scenarios with skills (default) # Run all scenarios with skills
mise run eval mise run eval
# Force re-run (bypass source caching) # Force re-run (bypass source caching)
@@ -66,64 +65,52 @@ EVAL_MODEL=claude-opus-4-6 mise run eval
# Run without skills (baseline) # Run without skills (baseline)
EVAL_BASELINE=true mise run eval EVAL_BASELINE=true mise run eval
# Install only a specific skill # Dry run (no API calls)
EVAL_SKILL=supabase mise run eval mise run eval:dry
# Upload results to Braintrust # Upload results to Braintrust
mise run eval:upload mise run eval:upload
# Force upload (bypass cache)
mise run --force eval:upload
``` ```
## Baseline Mode ## Baseline Mode
Set `EVAL_BASELINE=true` to run scenarios **without** skills. By default, Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
scenarios run with skills installed via the `skills` CLI. default, skill files from `skills/supabase/` are written into the sandbox.
To compare with-skill vs baseline, run evals twice: Compare with-skill vs baseline:
```bash ```bash
mise run eval # with skills mise run eval # with skills
EVAL_BASELINE=true mise run eval # without skills (baseline) EVAL_BASELINE=true mise run eval # without skills (baseline)
``` ```
Compare the results to measure how much skills improve agent output.
## Adding Scenarios ## Adding Scenarios
1. Create `evals/{scenario-name}/` with `PROMPT.md`, `EVAL.ts`, `package.json` 1. Create `evals/{scenario-name}/` with:
2. Add any starter files the agent should see (e.g., `supabase/config.toml`) - `PROMPT.md` — task description for the agent
3. Write vitest assertions in `EVAL.ts` that check the agent's output files - `EVAL.ts` — vitest assertions checking agent output
4. Document the scenario in `scenarios/SCENARIOS.md` - `meta.ts` — export `expectedReferenceFiles` array for scoring
- `package.json``{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
2. Add any starter files the agent should see (they get copied via `setup()`)
3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)
## Environment ## Environment
``` ```
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-5-20250929) EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6)
EVAL_SCENARIO=... # Optional: run single scenario EVAL_SCENARIO=... # Optional: run single scenario
EVAL_SKILL=... # Optional: install only this skill (e.g., "supabase") EVAL_BASELINE=true # Optional: run without skills
EVAL_BASELINE=true # Optional: run without skills (baseline mode) BRAINTRUST_API_KEY=... # Required for eval:upload
BRAINTRUST_UPLOAD=true # Optional: upload results to Braintrust BRAINTRUST_PROJECT_ID=... # Required for eval:upload
BRAINTRUST_API_KEY=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_PROJECT_ID=... # Required when BRAINTRUST_UPLOAD=true
BRAINTRUST_BASE_EXPERIMENT=... # Optional: compare against a named experiment
``` ```
## Key Files ## Docker Evals
``` Build and run evals inside Docker (e.g., for CI):
src/
runner.ts # Main orchestrator ```bash
types.ts # Core interfaces mise run eval:docker:build # Build the eval Docker image
runner/ mise run eval:docker # Run evals in Docker
scaffold.ts # Creates temp workspace from eval template mise run eval:docker:shell # Debug shell in eval container
agent.ts # Invokes claude -p as subprocess
test.ts # Runs vitest EVAL.ts against workspace
results.ts # Collects results and prints summary
evals/
auth-rls-new-project/ # Scenario 1
scenarios/
SCENARIOS.md # Scenario descriptions
``` ```

View File

@@ -1,77 +1,68 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-schema-auth-fk.md",
"db-security-functions.md",
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates profiles table", () => {
{
name: "migration file exists",
check: () => findMigrationFiles().length > 0,
},
{
name: "creates profiles table",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+table/.test(sql) && /profiles/.test(sql); expect(/create\s+table/.test(sql) && /profiles/.test(sql)).toBe(true);
}, });
},
{ test("FK references auth.users", () => {
name: "FK references auth.users", expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
check: () => true,
/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()), );
}, });
{
name: "ON DELETE CASCADE present", test("ON DELETE CASCADE present", () => {
check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()), expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
}, true,
{ );
name: "RLS enabled on profiles", });
check: () =>
test("RLS enabled on profiles", () => {
expect(
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test( /alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "trigger function uses SECURITY DEFINER",
check: () => /security\s+definer/.test(getMigrationSQL().toLowerCase()), test("trigger function uses SECURITY DEFINER", () => {
}, expect(/security\s+definer/.test(getMigrationSQL().toLowerCase())).toBe(true);
{ });
name: "trigger function sets search_path",
check: () => test("trigger function sets search_path", () => {
expect(
/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()), /set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "trigger created on auth.users",
check: () => test("trigger created on auth.users", () => {
expect(
/create\s+trigger[\s\S]*?on\s+auth\.users/.test( /create\s+trigger[\s\S]*?on\s+auth\.users/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "policies scoped to authenticated",
check: () => { test("policies scoped to authenticated", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("overall quality: demonstrates Supabase best practices", () => {
name: "overall quality: demonstrates Supabase best practices",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql), /alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
/security\s+definer/.test(sql), /security\s+definer/.test(sql),
/set\s+search_path\s*=\s*''/.test(sql), /set\s+search_path\s*=\s*''/.test(sql),
@@ -79,7 +70,5 @@ export const assertions: EvalAssertion[] = [
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)), policyBlocks.every((p) => /to\s+authenticated/.test(p)),
]; ];
return signals.filter(Boolean).length >= 5; expect(signals.filter(Boolean).length >= 5).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,6 @@
export const expectedReferenceFiles = [
"db-schema-auth-fk.md",
"db-security-functions.md",
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "auth-fk-cascade-delete", "name": "auth-fk-cascade-delete",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,16 +1,6 @@
export const expectedReferenceFiles = [
"dev-getting-started.md",
"db-rls-mandatory.md",
"db-rls-policy-types.md",
"db-rls-common-mistakes.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-migrations-idempotent.md",
];
import { existsSync } from "node:fs"; import { existsSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
import { import {
anonSeeesNoRows, anonSeeesNoRows,
@@ -19,43 +9,42 @@ import {
getSupabaseDir, getSupabaseDir,
queryTable, queryTable,
tableExists, tableExists,
} from "../eval-utils.ts"; } from "./eval-utils.ts";
export const assertions: EvalAssertion[] = [ test("supabase project initialized (config.toml exists)", () => {
{ expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
name: "supabase project initialized (config.toml exists)", });
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
}, test("migration file exists in supabase/migrations/", () => {
{ expect(findMigrationFiles().length > 0).toBe(true);
name: "migration file exists in supabase/migrations/", });
check: () => findMigrationFiles().length > 0,
}, test("creates tasks table", () => {
{
name: "creates tasks table",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+table/.test(sql) && /tasks/.test(sql); expect(/create\s+table/.test(sql) && /tasks/.test(sql)).toBe(true);
}, });
},
{ test("enables RLS on tasks table", () => {
name: "enables RLS on tasks table", expect(
check: () =>
/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test( /alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "has foreign key to auth.users",
check: () => test("has foreign key to auth.users", () => {
/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()), expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
}, true,
{ );
name: "uses ON DELETE CASCADE for auth FK", });
check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
}, test("uses ON DELETE CASCADE for auth FK", () => {
{ expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
name: "uses (select auth.uid()) not bare auth.uid() in policies", true,
check: () => { );
});
test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
for (const policy of policyBlocks) { for (const policy of policyBlocks) {
@@ -63,61 +52,53 @@ export const assertions: EvalAssertion[] = [
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("policies use TO authenticated", () => {
name: "policies use TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("uses timestamptz not plain timestamp for time columns", () => {
name: "uses timestamptz not plain timestamp for time columns",
check: () => {
const rawSql = getMigrationSQL().toLowerCase(); const rawSql = getMigrationSQL().toLowerCase();
const sql = rawSql.replace(/--[^\n]*/g, ""); const sql = rawSql.replace(/--[^\n]*/g, "");
const hasPlainTimestamp = const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
if ( if (
sql.includes("created_at") || sql.includes("created_at") ||
sql.includes("updated_at") || sql.includes("updated_at") ||
sql.includes("due_date") sql.includes("due_date")
) { ) {
return !hasPlainTimestamp.test(sql); expect(hasPlainTimestamp.test(sql)).toBe(false);
} else {
expect(true).toBe(true);
} }
return true; });
},
}, test("creates index on user_id column", () => {
{
name: "creates index on user_id column",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+index/.test(sql) && /user_id/.test(sql); expect(/create\s+index/.test(sql) && /user_id/.test(sql)).toBe(true);
}, });
},
{ test("does not use SERIAL or BIGSERIAL for primary key", () => {
name: "does not use SERIAL or BIGSERIAL for primary key",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return !/\bserial\b/.test(sql) && !/\bbigserial\b/.test(sql); expect(/\bserial\b/.test(sql)).toBe(false);
}, expect(/\bbigserial\b/.test(sql)).toBe(false);
}, });
{
name: "migration is idempotent (uses IF NOT EXISTS)", test("migration is idempotent (uses IF NOT EXISTS)", () => {
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()), expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
}, });
{
name: "overall quality: demonstrates Supabase best practices", test("overall quality: demonstrates Supabase best practices", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const signals = [ const signals = [
/enable\s+row\s+level\s+security/, /enable\s+row\s+level\s+security/,
@@ -126,25 +107,18 @@ export const assertions: EvalAssertion[] = [
/on\s+delete\s+cascade/, /on\s+delete\s+cascade/,
/create\s+index/, /create\s+index/,
]; ];
return signals.filter((r) => r.test(sql)).length >= 4; expect(signals.filter((r) => r.test(sql)).length >= 4).toBe(true);
}, });
},
{ test("tasks table exists in the database after migration", async () => {
name: "tasks table exists in the database after migration", expect(await tableExists("tasks")).toBe(true);
check: () => tableExists("tasks"), }, 10_000);
timeout: 10_000,
}, test("tasks table is queryable with service role", async () => {
{
name: "tasks table is queryable with service role",
check: async () => {
const { error } = await queryTable("tasks", "service_role"); const { error } = await queryTable("tasks", "service_role");
return error === null; expect(error === null).toBe(true);
}, }, 10_000);
timeout: 10_000,
}, test("tasks table returns no rows for anon (RLS is active)", async () => {
{ expect(await anonSeeesNoRows("tasks")).toBe(true);
name: "tasks table returns no rows for anon (RLS is active)", }, 10_000);
check: () => anonSeeesNoRows("tasks"),
timeout: 10_000,
},
];

View File

@@ -0,0 +1,9 @@
export const expectedReferenceFiles = [
"dev-getting-started.md",
"db-rls-mandatory.md",
"db-rls-policy-types.md",
"db-rls-common-mistakes.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-migrations-idempotent.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "auth-rls-new-project", "name": "auth-rls-new-project",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,11 +1,6 @@
export const expectedReferenceFiles = [
"dev-getting-started.md",
"edge-fun-quickstart.md",
];
import { readdirSync, readFileSync } from "node:fs"; import { readdirSync, readFileSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
const cwd = process.cwd(); const cwd = process.cwd();
@@ -27,79 +22,72 @@ function getReferenceContent(): string {
return readFileSync(file, "utf-8"); return readFileSync(file, "utf-8");
} }
export const assertions: EvalAssertion[] = [ test("CLI_REFERENCE.md exists in project root", () => {
{ expect(findReferenceFile() !== null).toBe(true);
name: "CLI_REFERENCE.md exists in project root", });
check: () => findReferenceFile() !== null,
}, test("no hallucinated functions log command", () => {
{
name: "no hallucinated functions log command",
check: () => {
const content = getReferenceContent(); const content = getReferenceContent();
return ( expect(
!/`supabase\s+functions\s+log`/.test(content) && /`supabase\s+functions\s+log`/.test(content) ||
!/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) && /^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) ||
!/^\s*supabase\s+functions\s+log\b/m.test(content) /^\s*supabase\s+functions\s+log\b/m.test(content),
); ).toBe(false);
}, });
},
{ test("no hallucinated db query command", () => {
name: "no hallucinated db query command",
check: () => {
const content = getReferenceContent(); const content = getReferenceContent();
return ( expect(
!/`supabase\s+db\s+query`/.test(content) && /`supabase\s+db\s+query`/.test(content) ||
!/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) && /^\s*npx\s+supabase\s+db\s+query\b/m.test(content) ||
!/^\s*supabase\s+db\s+query\b/m.test(content) /^\s*supabase\s+db\s+query\b/m.test(content),
); ).toBe(false);
}, });
},
{ test("mentions supabase functions serve for local development", () => {
name: "mentions supabase functions serve for local development", expect(
check: () =>
/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()), /supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
}, ).toBe(true);
{ });
name: "mentions supabase functions deploy",
check: () => test("mentions supabase functions deploy", () => {
expect(
/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()), /supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
}, ).toBe(true);
{ });
name: "mentions psql or SQL Editor or connection string for ad-hoc SQL",
check: () => { test("mentions psql or SQL Editor or connection string for ad-hoc SQL", () => {
const content = getReferenceContent().toLowerCase(); const content = getReferenceContent().toLowerCase();
return ( expect(
/\bpsql\b/.test(content) || /\bpsql\b/.test(content) ||
/sql\s+editor/.test(content) || /sql\s+editor/.test(content) ||
/connection\s+string/.test(content) || /connection\s+string/.test(content) ||
/supabase\s+db\s+dump/.test(content) /supabase\s+db\s+dump/.test(content),
); ).toBe(true);
}, });
},
{ test("mentions supabase db push or supabase db reset for migrations", () => {
name: "mentions supabase db push or supabase db reset for migrations",
check: () => {
const content = getReferenceContent().toLowerCase(); const content = getReferenceContent().toLowerCase();
return ( expect(
/supabase\s+db\s+push/.test(content) || /supabase\s+db\s+push/.test(content) ||
/supabase\s+db\s+reset/.test(content) /supabase\s+db\s+reset/.test(content),
).toBe(true);
});
test("mentions supabase start for local stack", () => {
expect(/supabase\s+start/.test(getReferenceContent().toLowerCase())).toBe(
true,
); );
}, });
},
{ test("mentions Dashboard or Logs Explorer for production log viewing", () => {
name: "mentions supabase start for local stack",
check: () => /supabase\s+start/.test(getReferenceContent().toLowerCase()),
},
{
name: "mentions Dashboard or Logs Explorer for production log viewing",
check: () => {
const content = getReferenceContent().toLowerCase(); const content = getReferenceContent().toLowerCase();
return /\bdashboard\b/.test(content) || /logs\s+explorer/.test(content); expect(/\bdashboard\b/.test(content) || /logs\s+explorer/.test(content)).toBe(
}, true,
}, );
{ });
name: "overall quality: uses real CLI commands throughout",
check: () => { test("overall quality: uses real CLI commands throughout", () => {
const content = getReferenceContent().toLowerCase(); const content = getReferenceContent().toLowerCase();
const signals = [ const signals = [
/supabase\s+start/, /supabase\s+start/,
@@ -122,7 +110,5 @@ export const assertions: EvalAssertion[] = [
const hallucinationMatches = hallucinations.filter((r) => const hallucinationMatches = hallucinations.filter((r) =>
r.test(content), r.test(content),
).length; ).length;
return positiveMatches >= 5 && hallucinationMatches === 0; expect(positiveMatches >= 5 && hallucinationMatches === 0).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,4 @@
export const expectedReferenceFiles = [
"dev-getting-started.md",
"edge-fun-quickstart.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "cli-hallucinated-commands", "name": "cli-hallucinated-commands",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,71 +1,48 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-security-functions.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-schema-realtime.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
"realtime-setup-auth.md",
"realtime-broadcast-database.md",
"realtime-setup-channels.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates rooms table", () => {
{ expect(
name: "migration file exists",
check: () => findMigrationFiles().length > 0,
},
{
name: "creates rooms table",
check: () =>
/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()), /create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "creates room_members table",
check: () => { test("creates room_members table", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/create\s+table[\s\S]*?room_members/.test(sql) || /create\s+table[\s\S]*?room_members/.test(sql) ||
/create\s+table[\s\S]*?room_users/.test(sql) || /create\s+table[\s\S]*?room_users/.test(sql) ||
/create\s+table[\s\S]*?memberships/.test(sql) /create\s+table[\s\S]*?memberships/.test(sql),
); ).toBe(true);
}, });
},
{ test("creates content table", () => {
name: "creates content table",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/create\s+table[\s\S]*?content/.test(sql) || /create\s+table[\s\S]*?content/.test(sql) ||
/create\s+table[\s\S]*?items/.test(sql) || /create\s+table[\s\S]*?items/.test(sql) ||
/create\s+table[\s\S]*?documents/.test(sql) || /create\s+table[\s\S]*?documents/.test(sql) ||
/create\s+table[\s\S]*?posts/.test(sql) || /create\s+table[\s\S]*?posts/.test(sql) ||
/create\s+table[\s\S]*?messages/.test(sql) /create\s+table[\s\S]*?messages/.test(sql),
); ).toBe(true);
}, });
},
{ test("room_members has role column with owner/editor/viewer", () => {
name: "room_members has role column with owner/editor/viewer",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/role/.test(sql) && /role/.test(sql) &&
/owner/.test(sql) && /owner/.test(sql) &&
/editor/.test(sql) && /editor/.test(sql) &&
/viewer/.test(sql) /viewer/.test(sql),
); ).toBe(true);
}, });
},
{ test("enables RLS on all application tables", () => {
name: "enables RLS on all application tables",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const roomsRls = const roomsRls =
/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
@@ -97,72 +74,66 @@ export const assertions: EvalAssertion[] = [
/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
); );
return roomsRls && membershipRls && contentRls; expect(roomsRls && membershipRls && contentRls).toBe(true);
}, });
},
{ test("FK to auth.users with ON DELETE CASCADE", () => {
name: "FK to auth.users with ON DELETE CASCADE",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("content has room_id FK referencing rooms", () => {
{ expect(
name: "content has room_id FK referencing rooms",
check: () =>
/room_id[\s\S]*?references[\s\S]*?rooms/.test( /room_id[\s\S]*?references[\s\S]*?rooms/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "policies use (select auth.uid())",
check: () => { test("policies use (select auth.uid())", () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
if (policyBlocks.length === 0) return false; if (policyBlocks.length === 0) {
expect(false).toBe(true);
return;
}
for (const policy of policyBlocks) { for (const policy of policyBlocks) {
if ( if (
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("policies use TO authenticated", () => {
name: "policies use TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const appPolicies = policyBlocks.filter( const appPolicies = policyBlocks.filter(
(p) => !p.includes("realtime.messages"), (p) => !p.includes("realtime.messages"),
); );
return ( expect(
appPolicies.length > 0 && appPolicies.length > 0 &&
appPolicies.every((p) => /to\s+authenticated/.test(p)) appPolicies.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("private schema with security_definer helper function", () => {
name: "private schema with security_definer helper function",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/create\s+schema[\s\S]*?private/.test(sql) && /create\s+schema[\s\S]*?private/.test(sql) &&
/private\./.test(sql) && /private\./.test(sql) &&
/security\s+definer/.test(sql) && /security\s+definer/.test(sql) &&
/set\s+search_path\s*=\s*''/.test(sql) /set\s+search_path\s*=\s*''/.test(sql),
); ).toBe(true);
}, });
},
{ test("role-based write policies: content INSERT/UPDATE restricted to owner or editor", () => {
name: "role-based write policies: content INSERT/UPDATE restricted to owner or editor",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const writePolicies = policyBlocks.filter( const writePolicies = policyBlocks.filter(
@@ -174,14 +145,12 @@ export const assertions: EvalAssertion[] = [
p.includes("posts") || p.includes("posts") ||
p.includes("messages")), p.includes("messages")),
); );
return writePolicies.some( expect(
(p) => p.includes("owner") || p.includes("editor"), writePolicies.some((p) => p.includes("owner") || p.includes("editor")),
); ).toBe(true);
}, });
},
{ test("viewer role is read-only (no write access to content)", () => {
name: "viewer role is read-only (no write access to content)",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const contentWritePolicies = policyBlocks.filter( const contentWritePolicies = policyBlocks.filter(
@@ -193,8 +162,11 @@ export const assertions: EvalAssertion[] = [
p.includes("posts") || p.includes("posts") ||
p.includes("messages")), p.includes("messages")),
); );
if (contentWritePolicies.length === 0) return true; if (contentWritePolicies.length === 0) {
return !contentWritePolicies.some((p) => { expect(true).toBe(true);
return;
}
const result = !contentWritePolicies.some((p) => {
const mentionsRole = const mentionsRole =
p.includes("owner") || p.includes("editor") || p.includes("viewer"); p.includes("owner") || p.includes("editor") || p.includes("viewer");
if (!mentionsRole) return true; if (!mentionsRole) return true;
@@ -202,26 +174,26 @@ export const assertions: EvalAssertion[] = [
p.includes("viewer") && !p.includes("owner") && !p.includes("editor") p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
); );
}); });
}, expect(result).toBe(true);
}, });
{
name: "indexes on membership lookup columns", test("indexes on membership lookup columns", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if (!/create\s+index/.test(sql)) return false; if (!/create\s+index/.test(sql)) {
expect(false).toBe(true);
return;
}
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? []; const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
return ( expect(
indexBlocks.filter( indexBlocks.filter(
(idx) => (idx) =>
idx.toLowerCase().includes("user_id") || idx.toLowerCase().includes("user_id") ||
idx.toLowerCase().includes("room_id"), idx.toLowerCase().includes("room_id"),
).length >= 1 ).length >= 1,
); ).toBe(true);
}, });
},
{ test("uses timestamptz not plain timestamp", () => {
name: "uses timestamptz not plain timestamp",
check: () => {
const rawSql = getMigrationSQL().toLowerCase(); const rawSql = getMigrationSQL().toLowerCase();
const sql = rawSql.replace(/--[^\n]*/g, ""); const sql = rawSql.replace(/--[^\n]*/g, "");
const hasPlainTimestamp = const hasPlainTimestamp =
@@ -231,36 +203,33 @@ export const assertions: EvalAssertion[] = [
sql.includes("updated_at") || sql.includes("updated_at") ||
sql.includes("_at ") sql.includes("_at ")
) { ) {
return !hasPlainTimestamp.test(sql); expect(hasPlainTimestamp.test(sql)).toBe(false);
} else {
expect(true).toBe(true);
} }
return true; });
},
}, test("idempotent DDL", () => {
{ expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
name: "idempotent DDL", });
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
}, test("realtime publication enabled for content table", () => {
{ expect(
name: "realtime publication enabled for content table",
check: () =>
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test( /alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "broadcast trigger for content changes",
check: () => { test("broadcast trigger for content changes", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
(/realtime\.broadcast_changes/.test(sql) || (/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql)) &&
/realtime\.send/.test(sql)) && /create\s+trigger/.test(sql),
/create\s+trigger/.test(sql) ).toBe(true);
); });
},
}, test("broadcast trigger function uses security definer", () => {
{
name: "broadcast trigger function uses security definer",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const functionBlocks = const functionBlocks =
sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? []; sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
@@ -269,46 +238,52 @@ export const assertions: EvalAssertion[] = [
f.toLowerCase().includes("realtime.broadcast_changes") || f.toLowerCase().includes("realtime.broadcast_changes") ||
f.toLowerCase().includes("realtime.send"), f.toLowerCase().includes("realtime.send"),
); );
if (realtimeFunctions.length === 0) return false; if (realtimeFunctions.length === 0) {
return realtimeFunctions.some( expect(false).toBe(true);
return;
}
expect(
realtimeFunctions.some(
(f) => (f) =>
/security\s+definer/.test(f.toLowerCase()) && /security\s+definer/.test(f.toLowerCase()) &&
/set\s+search_path\s*=\s*''/.test(f.toLowerCase()), /set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
); ),
}, ).toBe(true);
}, });
{
name: "RLS policies on realtime.messages", test("RLS policies on realtime.messages", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const realtimePolicies = policyBlocks.filter((p) => const realtimePolicies = policyBlocks.filter((p) =>
p.includes("realtime.messages"), p.includes("realtime.messages"),
); );
if (realtimePolicies.length === 0) return false; if (realtimePolicies.length === 0) {
return realtimePolicies.some( expect(false).toBe(true);
return;
}
expect(
realtimePolicies.some(
(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p), (p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
); ),
}, ).toBe(true);
}, });
{
name: "realtime policy checks extension column", test("realtime policy checks extension column", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const realtimePolicies = policyBlocks.filter((p) => const realtimePolicies = policyBlocks.filter((p) =>
p.includes("realtime.messages"), p.includes("realtime.messages"),
); );
return realtimePolicies.some( expect(
realtimePolicies.some(
(p) => (p) =>
p.includes("extension") && p.includes("extension") &&
(p.includes("broadcast") || p.includes("presence")), (p.includes("broadcast") || p.includes("presence")),
); ),
}, ).toBe(true);
}, });
{
name: "overall quality score", test("overall quality score", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -321,24 +296,19 @@ export const assertions: EvalAssertion[] = [
/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
), ),
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
/create\s+schema[\s\S]*?private/.test(sql), /create\s+schema[\s\S]*?private/.test(sql),
/security\s+definer/.test(sql) && /security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
/set\s+search_path\s*=\s*''/.test(sql),
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql), /\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.filter((p) => !p.includes("realtime.messages")).length > policyBlocks.filter((p) => !p.includes("realtime.messages")).length > 0 &&
0 &&
policyBlocks policyBlocks
.filter((p) => !p.includes("realtime.messages")) .filter((p) => !p.includes("realtime.messages"))
.every((p) => /to\s+authenticated/.test(p)), .every((p) => /to\s+authenticated/.test(p)),
/create\s+index/.test(sql), /create\s+index/.test(sql),
/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql), /timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
/if\s+not\s+exists/.test(sql), /if\s+not\s+exists/.test(sql),
sql.includes("owner") && sql.includes("owner") && sql.includes("editor") && sql.includes("viewer"),
sql.includes("editor") &&
sql.includes("viewer"),
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql), /alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql), /realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
/create\s+trigger/.test(sql), /create\s+trigger/.test(sql),
@@ -348,7 +318,5 @@ export const assertions: EvalAssertion[] = [
.some((p) => p.includes("extension")), .some((p) => p.includes("extension")),
/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql), /room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
]; ];
return signals.filter(Boolean).length >= 13; expect(signals.filter(Boolean).length >= 13).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,14 @@
export const expectedReferenceFiles = [
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-security-functions.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-schema-realtime.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
"realtime-setup-auth.md",
"realtime-broadcast-database.md",
"realtime-setup-channels.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "collaborative-rooms-realtime", "name": "collaborative-rooms-realtime",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,12 +1,6 @@
export const expectedReferenceFiles = [
"db-conn-pooling.md",
"db-migrations-idempotent.md",
"db-schema-auth-fk.md",
];
import { existsSync, readdirSync, readFileSync } from "node:fs"; import { existsSync, readdirSync, readFileSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
const cwd = process.cwd(); const cwd = process.cwd();
@@ -65,59 +59,51 @@ function getAllOutputContent(): string {
return parts.join("\n"); return parts.join("\n");
} }
export const assertions: EvalAssertion[] = [ test("prisma schema file exists", () => {
{ expect(findPrismaSchema() !== null).toBe(true);
name: "prisma schema file exists", });
check: () => findPrismaSchema() !== null,
}, test("prisma schema references pooler port 6543", () => {
{ expect(/6543/.test(getAllOutputContent())).toBe(true);
name: "prisma schema references pooler port 6543", });
check: () => /6543/.test(getAllOutputContent()),
}, test("pgbouncer=true param present", () => {
{ expect(/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase())).toBe(
name: "pgbouncer=true param present", true,
check: () => );
/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase()), });
},
{ test("DIRECT_URL provided for migrations", () => {
name: "DIRECT_URL provided for migrations",
check: () => {
const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`; const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
return /directurl/.test(allContent) || /direct_url/.test(allContent); expect(/directurl/.test(allContent) || /direct_url/.test(allContent)).toBe(
}, true,
}, );
{ });
name: "datasource block references directUrl or DIRECT_URL env var",
check: () => { test("datasource block references directUrl or DIRECT_URL env var", () => {
const schema = getPrismaSchema().toLowerCase(); const schema = getPrismaSchema().toLowerCase();
const datasourceBlock = const datasourceBlock =
schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? ""; schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
return ( expect(
/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock) /directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock),
); ).toBe(true);
}, });
},
{ test("connection limit set to 1 for serverless", () => {
name: "connection limit set to 1 for serverless",
check: () => {
const content = getAllOutputContent().toLowerCase(); const content = getAllOutputContent().toLowerCase();
return ( expect(
/connection_limit\s*=\s*1/.test(content) || /connection_limit\s*=\s*1/.test(content) ||
/connection_limit:\s*1/.test(content) || /connection_limit:\s*1/.test(content) ||
/connectionlimit\s*=\s*1/.test(content) /connectionlimit\s*=\s*1/.test(content),
); ).toBe(true);
}, });
},
{ test("explanation distinguishes port 6543 vs 5432", () => {
name: "explanation distinguishes port 6543 vs 5432",
check: () => {
const content = getAllOutputContent(); const content = getAllOutputContent();
return /6543/.test(content) && /5432/.test(content); expect(/6543/.test(content) && /5432/.test(content)).toBe(true);
}, });
},
{ test("overall quality: demonstrates correct Prisma + Supabase pooler setup", () => {
name: "overall quality: demonstrates correct Prisma + Supabase pooler setup",
check: () => {
const schema = getPrismaSchema().toLowerCase(); const schema = getPrismaSchema().toLowerCase();
const envContent = getAllEnvContent().toLowerCase(); const envContent = getAllEnvContent().toLowerCase();
const allContent = `${schema}\n${envContent}`; const allContent = `${schema}\n${envContent}`;
@@ -128,7 +114,5 @@ export const assertions: EvalAssertion[] = [
/connection_limit\s*=\s*1|connection_limit:\s*1/, /connection_limit\s*=\s*1|connection_limit:\s*1/,
/5432/, /5432/,
]; ];
return signals.filter((r) => r.test(allContent)).length >= 4; expect(signals.filter((r) => r.test(allContent)).length >= 4).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,5 @@
export const expectedReferenceFiles = [
"db-conn-pooling.md",
"db-migrations-idempotent.md",
"db-schema-auth-fk.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "connection-pooling-prisma", "name": "connection-pooling-prisma",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,14 +1,6 @@
export const expectedReferenceFiles = [
"edge-fun-quickstart.md",
"edge-fun-project-structure.md",
"edge-pat-cors.md",
"edge-pat-error-handling.md",
"dev-getting-started.md",
];
import { existsSync, readdirSync } from "node:fs"; import { existsSync, readdirSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
import { import {
findFunctionFile, findFunctionFile,
@@ -17,7 +9,7 @@ import {
getFunctionsDir, getFunctionsDir,
getSharedCode, getSharedCode,
getSupabaseDir, getSupabaseDir,
} from "../eval-utils.ts"; } from "./eval-utils.ts";
const FUNCTION_NAME = "hello-world"; const FUNCTION_NAME = "hello-world";
@@ -33,61 +25,57 @@ function getCatchBlockCode(): string {
return code.slice(catchIndex); return code.slice(catchIndex);
} }
export const assertions: EvalAssertion[] = [ test("supabase project initialized", () => {
{ expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
name: "supabase project initialized", });
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
}, test("function directory exists", () => {
{ expect(existsSync(join(getFunctionsDir(), FUNCTION_NAME))).toBe(true);
name: "function directory exists", });
check: () => existsSync(join(getFunctionsDir(), FUNCTION_NAME)),
}, test("function index file exists", () => {
{ expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
name: "function index file exists", });
check: () => findFunctionFile(FUNCTION_NAME) !== null,
}, test("uses Deno.serve", () => {
{ expect(/Deno\.serve/.test(getFunctionCode(FUNCTION_NAME))).toBe(true);
name: "uses Deno.serve", });
check: () => /Deno\.serve/.test(getFunctionCode(FUNCTION_NAME)),
}, test("returns JSON response", () => {
{
name: "returns JSON response",
check: () => {
const allCode = getAllCode(); const allCode = getAllCode();
return ( expect(
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) || /content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
/Response\.json/i.test(allCode) || /Response\.json/i.test(allCode) ||
/JSON\.stringify/i.test(allCode) /JSON\.stringify/i.test(allCode),
); ).toBe(true);
}, });
},
{ test("handles OPTIONS preflight", () => {
name: "handles OPTIONS preflight",
check: () => {
const allCode = getAllCode(); const allCode = getAllCode();
return /['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode); expect(/['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode)).toBe(
}, true,
}, );
{ });
name: "defines CORS headers",
check: () => /Access-Control-Allow-Origin/.test(getAllCode()), test("defines CORS headers", () => {
}, expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
{ });
name: "CORS allows required headers",
check: () => { test("CORS allows required headers", () => {
const allCode = getAllCode().toLowerCase(); const allCode = getAllCode().toLowerCase();
return ( expect(
/access-control-allow-headers/.test(allCode) && /access-control-allow-headers/.test(allCode) &&
/authorization/.test(allCode) && /authorization/.test(allCode) &&
/apikey/.test(allCode) /apikey/.test(allCode),
); ).toBe(true);
}, });
},
{ test("error response has CORS headers", () => {
name: "error response has CORS headers",
check: () => {
const catchCode = getCatchBlockCode(); const catchCode = getCatchBlockCode();
if (catchCode.length === 0) return false; if (catchCode.length === 0) {
expect(false).toBe(true);
return;
}
const sharedCode = getSharedCode(); const sharedCode = getSharedCode();
const directCors = const directCors =
/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode); /corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
@@ -95,51 +83,45 @@ export const assertions: EvalAssertion[] = [
/errorResponse|jsonResponse|json_response|error_response/i.test( /errorResponse|jsonResponse|json_response|error_response/i.test(
catchCode, catchCode,
) && /Access-Control-Allow-Origin/i.test(sharedCode); ) && /Access-Control-Allow-Origin/i.test(sharedCode);
return directCors || callsSharedHelper; expect(directCors || callsSharedHelper).toBe(true);
}, });
},
{ test("has try-catch for error handling", () => {
name: "has try-catch for error handling",
check: () => {
const code = getFunctionCode(FUNCTION_NAME); const code = getFunctionCode(FUNCTION_NAME);
return /\btry\s*\{/.test(code) && /\bcatch\b/.test(code); expect(/\btry\s*\{/.test(code) && /\bcatch\b/.test(code)).toBe(true);
}, });
},
{ test("returns proper error status code", () => {
name: "returns proper error status code",
check: () => {
const catchCode = getCatchBlockCode(); const catchCode = getCatchBlockCode();
if (catchCode.length === 0) return false; if (catchCode.length === 0) {
return ( expect(false).toBe(true);
return;
}
expect(
/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) || /status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode) /[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode),
); ).toBe(true);
}, });
},
{ test("shared CORS module exists", () => {
name: "shared CORS module exists", expect(findSharedCorsFile() !== null).toBe(true);
check: () => findSharedCorsFile() !== null, });
},
{ test("function imports from shared", () => {
name: "function imports from shared", expect(
check: () =>
/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)), /from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
}, ).toBe(true);
{ });
name: "function uses hyphenated name",
check: () => { test("function uses hyphenated name", () => {
const dirs = existsSync(getFunctionsDir()) const dirs = existsSync(getFunctionsDir())
? readdirSync(getFunctionsDir()) ? readdirSync(getFunctionsDir())
: []; : [];
const helloDir = dirs.find( const helloDir = dirs.find((d) => d.includes("hello") && d.includes("world"));
(d) => d.includes("hello") && d.includes("world"), expect(helloDir !== undefined && /^hello-world$/.test(helloDir)).toBe(true);
); });
return helloDir !== undefined && /^hello-world$/.test(helloDir);
}, test("overall quality: demonstrates Edge Function best practices", () => {
},
{
name: "overall quality: demonstrates Edge Function best practices",
check: () => {
const allCode = getAllCode().toLowerCase(); const allCode = getAllCode().toLowerCase();
const signals = [ const signals = [
/deno\.serve/, /deno\.serve/,
@@ -151,7 +133,5 @@ export const assertions: EvalAssertion[] = [
/authorization/, /authorization/,
/apikey/, /apikey/,
]; ];
return signals.filter((r) => r.test(allCode)).length >= 6; expect(signals.filter((r) => r.test(allCode)).length >= 6).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"edge-fun-quickstart.md",
"edge-fun-project-structure.md",
"edge-pat-cors.md",
"edge-pat-error-handling.md",
"dev-getting-started.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "edge-function-hello-world", "name": "edge-function-hello-world",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,83 +1,70 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-schema-extensions.md",
"db-rls-mandatory.md",
"db-migrations-idempotent.md",
"db-schema-auth-fk.md",
"db-rls-common-mistakes.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("extension installed in extensions schema", () => {
{ expect(
name: "migration file exists",
check: () => findMigrationFiles().length > 0,
},
{
name: "extension installed in extensions schema",
check: () =>
/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test( /create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "IF NOT EXISTS on extension creation",
check: () => test("IF NOT EXISTS on extension creation", () => {
expect(
/create\s+extension\s+if\s+not\s+exists/.test( /create\s+extension\s+if\s+not\s+exists/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "vector column with correct dimensions",
check: () => test("vector column with correct dimensions", () => {
expect(
/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test( /(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "HNSW index used instead of IVFFlat",
check: () => /using\s+hnsw/.test(getMigrationSQL().toLowerCase()), test("HNSW index used instead of IVFFlat", () => {
}, expect(/using\s+hnsw/.test(getMigrationSQL().toLowerCase())).toBe(true);
{ });
name: "RLS enabled on documents table",
check: () => test("RLS enabled on documents table", () => {
expect(
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "FK to auth.users with ON DELETE CASCADE",
check: () => { test("FK to auth.users with ON DELETE CASCADE", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("policies use TO authenticated", () => {
{
name: "policies use TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("idempotent table creation (IF NOT EXISTS)", () => {
name: "idempotent table creation (IF NOT EXISTS)", expect(
check: () => /create\s+table\s+if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
/create\s+table\s+if\s+not\s+exists/.test( ).toBe(true);
getMigrationSQL().toLowerCase(), });
),
}, test("overall quality: demonstrates pgvector best practices", () => {
{
name: "overall quality: demonstrates pgvector best practices",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -88,13 +75,10 @@ export const assertions: EvalAssertion[] = [
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
), ),
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)), policyBlocks.every((p) => /to\s+authenticated/.test(p)),
/if\s+not\s+exists/.test(sql), /if\s+not\s+exists/.test(sql),
]; ];
return signals.filter(Boolean).length >= 6; expect(signals.filter(Boolean).length >= 6).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"db-schema-extensions.md",
"db-rls-mandatory.md",
"db-migrations-idempotent.md",
"db-schema-auth-fk.md",
"db-rls-common-mistakes.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "extension-wrong-schema", "name": "extension-wrong-schema",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,14 +1,6 @@
export const expectedReferenceFiles = [
"db-rls-views.md",
"db-migrations-idempotent.md",
"db-rls-mandatory.md",
"db-rls-performance.md",
"db-schema-timestamps.md",
];
import { existsSync, readdirSync, readFileSync } from "node:fs"; import { existsSync, readdirSync, readFileSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
const migrationsDir = join(process.cwd(), "supabase", "migrations"); const migrationsDir = join(process.cwd(), "supabase", "migrations");
const STARTER_MIGRATION = "20240101000000_create_products.sql"; const STARTER_MIGRATION = "20240101000000_create_products.sql";
@@ -29,71 +21,70 @@ function getAgentMigrationSQL(): string {
return files.map((f) => readFileSync(f, "utf-8")).join("\n"); return files.map((f) => readFileSync(f, "utf-8")).join("\n");
} }
export const assertions: EvalAssertion[] = [ test("new migration file exists", () => {
{ expect(findAgentMigrationFiles().length > 0).toBe(true);
name: "new migration file exists", });
check: () => findAgentMigrationFiles().length > 0,
}, test("ADD COLUMN IF NOT EXISTS for description", () => {
{ expect(
name: "ADD COLUMN IF NOT EXISTS for description",
check: () =>
/add\s+column\s+if\s+not\s+exists\s+description/.test( /add\s+column\s+if\s+not\s+exists\s+description/.test(
getAgentMigrationSQL().toLowerCase(), getAgentMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "ADD COLUMN IF NOT EXISTS for published_at",
check: () => test("ADD COLUMN IF NOT EXISTS for published_at", () => {
expect(
/add\s+column\s+if\s+not\s+exists\s+published_at/.test( /add\s+column\s+if\s+not\s+exists\s+published_at/.test(
getAgentMigrationSQL().toLowerCase(), getAgentMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "published_at uses timestamptz not plain timestamp",
check: () => { test("published_at uses timestamptz not plain timestamp", () => {
const sql = getAgentMigrationSQL().toLowerCase(); const sql = getAgentMigrationSQL().toLowerCase();
return ( expect(
/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test( /published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
sql, sql,
) && ) &&
!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test( !/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(sql),
sql, ).toBe(true);
) });
);
}, test("view public_products is created", () => {
}, expect(
{
name: "view public_products is created",
check: () =>
/create\s+(or\s+replace\s+)?view\s+public_products/.test( /create\s+(or\s+replace\s+)?view\s+public_products/.test(
getAgentMigrationSQL().toLowerCase(), getAgentMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "view uses security_invoker = true",
check: () => test("view uses security_invoker = true", () => {
expect(
/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()), /security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "SELECT policy on products for authenticated role",
check: () => { test("SELECT policy on products for authenticated role", () => {
const sql = getAgentMigrationSQL().toLowerCase(); const sql = getAgentMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return policyBlocks.some( expect(
policyBlocks.some(
(p) => (p) =>
p.includes("select") && p.includes("select") &&
p.includes("products") && p.includes("products") &&
/to\s+authenticated/.test(p), /to\s+authenticated/.test(p),
),
).toBe(true);
});
test("NOTIFY pgrst reload schema is present", () => {
expect(/notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase())).toBe(
true,
); );
}, });
},
{ test("overall quality: demonstrates PostgREST and schema best practices", () => {
name: "NOTIFY pgrst reload schema is present",
check: () => /notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase()),
},
{
name: "overall quality: demonstrates PostgREST and schema best practices",
check: () => {
const sql = getAgentMigrationSQL().toLowerCase(); const sql = getAgentMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -108,7 +99,5 @@ export const assertions: EvalAssertion[] = [
), ),
/notify\s+pgrst/.test(sql), /notify\s+pgrst/.test(sql),
]; ];
return signals.filter(Boolean).length >= 5; expect(signals.filter(Boolean).length >= 5).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"db-rls-views.md",
"db-migrations-idempotent.md",
"db-rls-mandatory.md",
"db-rls-performance.md",
"db-schema-timestamps.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "postgrest-schema-cache", "name": "postgrest-schema-cache",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,65 +1,49 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-rls-common-mistakes.md",
"db-rls-policy-types.md",
"db-rls-performance.md",
"db-rls-mandatory.md",
"db-schema-timestamps.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates orders table", () => {
{
name: "migration file exists",
check: () => findMigrationFiles().length > 0,
},
{
name: "creates orders table",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+table/.test(sql) && /orders/.test(sql); expect(/create\s+table/.test(sql) && /orders/.test(sql)).toBe(true);
}, });
},
{ test("enables RLS on orders table", () => {
name: "enables RLS on orders table", expect(
check: () =>
/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test( /alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "has SELECT policy on orders",
check: () => { test("has SELECT policy on orders", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return policyBlocks.some((p) => p.includes("for select")); expect(policyBlocks.some((p) => p.includes("for select"))).toBe(true);
}, });
},
{ test("has UPDATE policy with WITH CHECK on orders", () => {
name: "has UPDATE policy with WITH CHECK on orders",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const updatePolicy = policyBlocks.find((p) => p.includes("for update")); const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
return updatePolicy !== undefined && /with\s+check/.test(updatePolicy); expect(updatePolicy !== undefined && /with\s+check/.test(updatePolicy)).toBe(
}, true,
}, );
{ });
name: "all policies use TO authenticated",
check: () => { test("all policies use TO authenticated", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
name: "uses (select auth.uid()) not bare auth.uid() in policies",
check: () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
for (const policy of policyBlocks) { for (const policy of policyBlocks) {
@@ -67,38 +51,32 @@ export const assertions: EvalAssertion[] = [
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("uses timestamptz not plain timestamp for created_at", () => {
name: "uses timestamptz not plain timestamp for created_at",
check: () => {
const rawSql = getMigrationSQL().toLowerCase(); const rawSql = getMigrationSQL().toLowerCase();
const sql = rawSql.replace(/--[^\n]*/g, ""); const sql = rawSql.replace(/--[^\n]*/g, "");
const hasPlainTimestamp = const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
if (sql.includes("created_at")) { if (sql.includes("created_at")) {
return !hasPlainTimestamp.test(sql); expect(hasPlainTimestamp.test(sql)).toBe(false);
} else {
expect(true).toBe(true);
} }
return true; });
},
}, test("FK to auth.users with ON DELETE CASCADE", () => {
{
name: "FK to auth.users with ON DELETE CASCADE",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("overall quality: demonstrates Supabase best practices", () => {
{
name: "overall quality: demonstrates Supabase best practices",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -110,13 +88,10 @@ export const assertions: EvalAssertion[] = [
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql), /\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)), policyBlocks.every((p) => /to\s+authenticated/.test(p)),
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test( !/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
sql.replace(/--[^\n]*/g, ""), sql.replace(/--[^\n]*/g, ""),
), ),
]; ];
return signals.filter(Boolean).length >= 5; expect(signals.filter(Boolean).length >= 5).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"db-rls-common-mistakes.md",
"db-rls-policy-types.md",
"db-rls-performance.md",
"db-rls-mandatory.md",
"db-schema-timestamps.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "rls-update-needs-select", "name": "rls-update-needs-select",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,78 +1,56 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-rls-common-mistakes.md",
"db-rls-policy-types.md",
"db-rls-performance.md",
"db-rls-mandatory.md",
"db-schema-auth-fk.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists in supabase/migrations/", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates documents table", () => {
{
name: "migration file exists in supabase/migrations/",
check: () => findMigrationFiles().length > 0,
},
{
name: "creates documents table",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+table/.test(sql) && /documents/.test(sql); expect(/create\s+table/.test(sql) && /documents/.test(sql)).toBe(true);
}, });
},
{ test("RLS enabled on documents table", () => {
name: "RLS enabled on documents table", expect(
check: () =>
/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test( /alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "uses app_metadata not user_metadata for role check",
check: () => /app_metadata/.test(getMigrationSQL().toLowerCase()), test("uses app_metadata not user_metadata for role check", () => {
}, expect(/app_metadata/.test(getMigrationSQL().toLowerCase())).toBe(true);
{ });
name: "user_metadata does not appear in policy USING clauses",
check: () => { test("user_metadata does not appear in policy USING clauses", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return policyBlocks.every((p) => !p.includes("user_metadata")); expect(policyBlocks.every((p) => !p.includes("user_metadata"))).toBe(true);
}, });
},
{ test("has at least two SELECT policies (owner and admin)", () => {
name: "has at least two SELECT policies (owner and admin)",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const hasOwnerPolicy = policyBlocks.some( const hasOwnerPolicy = policyBlocks.some(
(p) => (p) =>
(p.includes("select") || !p.includes("insert")) && (p.includes("select") || !p.includes("insert")) &&
(p.includes("user_id") || (p.includes("user_id") || p.includes("owner") || p.includes("auth.uid")),
p.includes("owner") ||
p.includes("auth.uid")),
); );
const hasAdminPolicy = policyBlocks.some((p) => const hasAdminPolicy = policyBlocks.some((p) => p.includes("app_metadata"));
p.includes("app_metadata"), expect(hasOwnerPolicy && hasAdminPolicy).toBe(true);
); });
return hasOwnerPolicy && hasAdminPolicy;
}, test("policies use TO authenticated", () => {
},
{
name: "policies use TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("uses (select auth.uid()) subselect form in policies", () => {
name: "uses (select auth.uid()) subselect form in policies",
check: () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
for (const policy of policyBlocks) { for (const policy of policyBlocks) {
@@ -80,25 +58,21 @@ export const assertions: EvalAssertion[] = [
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("FK to auth.users with ON DELETE CASCADE", () => {
name: "FK to auth.users with ON DELETE CASCADE",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("overall quality: demonstrates Supabase best practices", () => {
{
name: "overall quality: demonstrates Supabase best practices",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -108,16 +82,11 @@ export const assertions: EvalAssertion[] = [
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql), /\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)), policyBlocks.every((p) => /to\s+authenticated/.test(p)),
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
policyBlocks.some( policyBlocks.some(
(p) => (p) =>
p.includes("user_id") || p.includes("user_id") || p.includes("owner") || p.includes("auth.uid"),
p.includes("owner") ||
p.includes("auth.uid"),
) && policyBlocks.some((p) => p.includes("app_metadata")), ) && policyBlocks.some((p) => p.includes("app_metadata")),
]; ];
return signals.filter(Boolean).length >= 5; expect(signals.filter(Boolean).length >= 5).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"db-rls-common-mistakes.md",
"db-rls-policy-types.md",
"db-rls-performance.md",
"db-rls-mandatory.md",
"db-schema-auth-fk.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "rls-user-metadata-role-check", "name": "rls-user-metadata-role-check",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,21 +1,13 @@
export const expectedReferenceFiles = [
"db-security-service-role.md",
"edge-fun-quickstart.md",
"edge-db-supabase-client.md",
"edge-pat-cors.md",
"edge-pat-error-handling.md",
];
import { existsSync } from "node:fs"; import { existsSync } from "node:fs";
import { join } from "node:path"; import { join } from "node:path";
import type { EvalAssertion } from "../../src/eval-types.js"; import { expect, test } from "vitest";
import { import {
findFunctionFile, findFunctionFile,
getFunctionCode, getFunctionCode,
getSharedCode, getSharedCode,
getSupabaseDir, getSupabaseDir,
} from "../eval-utils.ts"; } from "./eval-utils.ts";
const FUNCTION_NAME = "admin-reports"; const FUNCTION_NAME = "admin-reports";
@@ -24,69 +16,63 @@ function getAllCode(): string {
return `${code}\n${getSharedCode()}`; return `${code}\n${getSharedCode()}`;
} }
export const assertions: EvalAssertion[] = [ test("supabase project initialized (config.toml exists)", () => {
{ expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
name: "supabase project initialized (config.toml exists)", });
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
}, test("edge function file exists", () => {
{ expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
name: "edge function file exists", });
check: () => findFunctionFile(FUNCTION_NAME) !== null,
}, test("uses Deno.env.get for service role key", () => {
{ expect(
name: "uses Deno.env.get for service role key",
check: () =>
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test( /Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
getAllCode(), getAllCode(),
), ),
}, ).toBe(true);
{ });
name: "no hardcoded service role key",
check: () => { test("no hardcoded service role key", () => {
const allCode = getAllCode(); const allCode = getAllCode();
const lines = allCode.split("\n"); const lines = allCode.split("\n");
const nonCommentLines = lines.filter( const nonCommentLines = lines.filter(
(line) => !line.trimStart().startsWith("//"), (line) => !line.trimStart().startsWith("//"),
); );
return !nonCommentLines.some((line) => expect(
nonCommentLines.some((line) =>
/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line), /(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
); ),
}, ).toBe(false);
}, });
{
name: "createClient called with service role env var as second argument", test("createClient called with service role env var as second argument", () => {
check: () => {
const allCode = getAllCode(); const allCode = getAllCode();
return ( expect(
/createClient/i.test(allCode) && /createClient/i.test(allCode) &&
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test( /Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
allCode, allCode,
) ),
); ).toBe(true);
}, });
},
{ test("service role key env var name does not use NEXT_PUBLIC_ prefix", () => {
name: "service role key env var name does not use NEXT_PUBLIC_ prefix", expect(/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode())).toBe(false);
check: () => !/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode()), });
},
{ test("CORS headers present", () => {
name: "CORS headers present", expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
check: () => /Access-Control-Allow-Origin/.test(getAllCode()), });
},
{ test("returns JSON response", () => {
name: "returns JSON response",
check: () => {
const allCode = getAllCode(); const allCode = getAllCode();
return ( expect(
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) || /content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
/Response\.json/i.test(allCode) || /Response\.json/i.test(allCode) ||
/JSON\.stringify/i.test(allCode) /JSON\.stringify/i.test(allCode),
); ).toBe(true);
}, });
},
{ test("overall quality: demonstrates service role Edge Function best practices", () => {
name: "overall quality: demonstrates service role Edge Function best practices",
check: () => {
const allCode = getAllCode(); const allCode = getAllCode();
const signals: RegExp[] = [ const signals: RegExp[] = [
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i, /Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
@@ -96,7 +82,5 @@ export const assertions: EvalAssertion[] = [
/Response\.json|JSON\.stringify/, /Response\.json|JSON\.stringify/,
/Deno\.serve/, /Deno\.serve/,
]; ];
return signals.filter((r) => r.test(allCode)).length >= 5; expect(signals.filter((r) => r.test(allCode)).length >= 5).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,7 @@
export const expectedReferenceFiles = [
"db-security-service-role.md",
"edge-fun-quickstart.md",
"edge-db-supabase-client.md",
"edge-pat-cors.md",
"edge-pat-error-handling.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "service-role-edge-function", "name": "service-role-edge-function",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,91 +1,74 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"storage-access-control.md",
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates avatars bucket", () => {
{
name: "migration file exists",
check: () => findMigrationFiles().length > 0,
},
{
name: "creates avatars bucket",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if ( if (
!/storage\.buckets/.test(sql) || !/storage\.buckets/.test(sql) ||
!/avatars/.test(sql) || !/avatars/.test(sql) ||
!/public/.test(sql) !/public/.test(sql)
) ) {
return false; expect(false).toBe(true);
return;
}
const avatarsBlock = sql.match( const avatarsBlock = sql.match(
/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/, /insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
); );
return avatarsBlock !== null && /true/.test(avatarsBlock[0]); expect(avatarsBlock !== null && /true/.test(avatarsBlock[0])).toBe(true);
}, });
},
{ test("creates documents bucket", () => {
name: "creates documents bucket",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if (!/documents/.test(sql)) return false; if (!/documents/.test(sql)) {
expect(false).toBe(true);
return;
}
const documentsBlock = sql.match( const documentsBlock = sql.match(
/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/, /insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
); );
return documentsBlock !== null && /false/.test(documentsBlock[0]); expect(documentsBlock !== null && /false/.test(documentsBlock[0])).toBe(true);
}, });
},
{ test("avatars bucket has mime type restriction", () => {
name: "avatars bucket has mime type restriction",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/allowed_mime_types/.test(sql) && /allowed_mime_types/.test(sql) &&
/image\/jpeg/.test(sql) && /image\/jpeg/.test(sql) &&
/image\/png/.test(sql) && /image\/png/.test(sql) &&
/image\/webp/.test(sql) /image\/webp/.test(sql),
); ).toBe(true);
}, });
},
{ test("avatars bucket has file size limit", () => {
name: "avatars bucket has file size limit",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if (!/file_size_limit/.test(sql)) return false; if (!/file_size_limit/.test(sql)) {
return ( expect(false).toBe(true);
return;
}
expect(
/2097152/.test(sql) || /2097152/.test(sql) ||
/2\s*m/i.test(sql) || /2\s*m/i.test(sql) ||
/2\s*\*\s*1024\s*\*\s*1024/.test(sql) /2\s*\*\s*1024\s*\*\s*1024/.test(sql),
); ).toBe(true);
}, });
},
{ test("storage policy uses foldername or path for user isolation", () => {
name: "storage policy uses foldername or path for user isolation",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql); const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
const usesPathMatch = const usesPathMatch =
/\(\s*storage\.foldername\s*\(/.test(sql) || /\(\s*storage\.foldername\s*\(/.test(sql) ||
/\bname\b.*auth\.uid\(\)/.test(sql); /\bname\b.*auth\.uid\(\)/.test(sql);
return ( expect(
(usesFoldername || usesPathMatch) && (usesFoldername || usesPathMatch) && /auth\.uid\(\)\s*::\s*text/.test(sql),
/auth\.uid\(\)\s*::\s*text/.test(sql) ).toBe(true);
); });
},
}, test("storage policy uses TO authenticated", () => {
{
name: "storage policy uses TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const storagePolicies = policyBlocks.filter((p) => const storagePolicies = policyBlocks.filter((p) =>
@@ -96,20 +79,23 @@ export const assertions: EvalAssertion[] = [
/to\s+(authenticated|public)/.test(p.toLowerCase()) || /to\s+(authenticated|public)/.test(p.toLowerCase()) ||
/auth\.uid\(\)/.test(p.toLowerCase()), /auth\.uid\(\)/.test(p.toLowerCase()),
); );
if (!hasAuthenticatedPolicy) return false; if (!hasAuthenticatedPolicy) {
expect(false).toBe(true);
return;
}
const insertPolicies = storagePolicies.filter((p) => const insertPolicies = storagePolicies.filter((p) =>
/for\s+insert/.test(p.toLowerCase()), /for\s+insert/.test(p.toLowerCase()),
); );
return insertPolicies.every( expect(
insertPolicies.every(
(p) => (p) =>
/to\s+authenticated/.test(p.toLowerCase()) || /to\s+authenticated/.test(p.toLowerCase()) ||
/auth\.uid\(\)/.test(p.toLowerCase()), /auth\.uid\(\)/.test(p.toLowerCase()),
); ),
}, ).toBe(true);
}, });
{
name: "public read policy for avatars", test("public read policy for avatars", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const avatarSelectPolicies = policyBlocks.filter( const avatarSelectPolicies = policyBlocks.filter(
@@ -118,20 +104,23 @@ export const assertions: EvalAssertion[] = [
/for\s+select/.test(p.toLowerCase()) && /for\s+select/.test(p.toLowerCase()) &&
p.toLowerCase().includes("avatars"), p.toLowerCase().includes("avatars"),
); );
if (avatarSelectPolicies.length === 0) return false; if (avatarSelectPolicies.length === 0) {
return avatarSelectPolicies.some((p) => { expect(false).toBe(true);
return;
}
expect(
avatarSelectPolicies.some((p) => {
const lower = p.toLowerCase(); const lower = p.toLowerCase();
const hasExplicitPublic = const hasExplicitPublic =
/to\s+public/.test(lower) || /to\s+anon/.test(lower); /to\s+public/.test(lower) || /to\s+anon/.test(lower);
const hasNoToClause = !/\bto\s+\w+/.test(lower); const hasNoToClause = !/\bto\s+\w+/.test(lower);
const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower); const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction); return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
}),
).toBe(true);
}); });
},
}, test("documents bucket is fully private", () => {
{
name: "documents bucket is fully private",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const documentPolicies = policyBlocks.filter( const documentPolicies = policyBlocks.filter(
@@ -139,42 +128,41 @@ export const assertions: EvalAssertion[] = [
p.toLowerCase().includes("storage.objects") && p.toLowerCase().includes("storage.objects") &&
p.toLowerCase().includes("documents"), p.toLowerCase().includes("documents"),
); );
if (documentPolicies.length === 0) return false; if (documentPolicies.length === 0) {
return documentPolicies.every( expect(false).toBe(true);
return;
}
expect(
documentPolicies.every(
(p) => (p) =>
!/to\s+public/.test(p) && !/to\s+public/.test(p) &&
!/to\s+anon/.test(p) && !/to\s+anon/.test(p) &&
(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)), (/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
); ),
}, ).toBe(true);
}, });
{
name: "creates file_metadata table", test("creates file_metadata table", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return /create\s+table/.test(sql) && /file_metadata/.test(sql); expect(/create\s+table/.test(sql) && /file_metadata/.test(sql)).toBe(true);
}, });
},
{ test("file_metadata has FK to auth.users with CASCADE", () => {
name: "file_metadata has FK to auth.users with CASCADE",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("RLS enabled on file_metadata", () => {
{ expect(
name: "RLS enabled on file_metadata",
check: () =>
/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test( /alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "file_metadata policies use (select auth.uid())",
check: () => { test("file_metadata policies use (select auth.uid())", () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const metadataPolicies = policyBlocks.filter((p) => const metadataPolicies = policyBlocks.filter((p) =>
@@ -185,50 +173,51 @@ export const assertions: EvalAssertion[] = [
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("uses timestamptz for time columns", () => {
name: "uses timestamptz for time columns",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if ( if (
!sql.includes("created_at") && !sql.includes("created_at") &&
!sql.includes("updated_at") && !sql.includes("updated_at") &&
!sql.includes("uploaded_at") !sql.includes("uploaded_at")
) { ) {
return true; expect(true).toBe(true);
return;
} }
const columnDefs = sql.match( const columnDefs = sql.match(
/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g, /(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
); );
if (!columnDefs) return true; if (!columnDefs) {
return columnDefs.every((def) => expect(true).toBe(true);
return;
}
expect(
columnDefs.every((def) =>
/timestamptz|timestamp\s+with\s+time\s+zone/.test(def), /timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
); ),
}, ).toBe(true);
}, });
{
name: "index on file_metadata user_id", test("index on file_metadata user_id", () => {
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/create\s+index/.test(sql) && /create\s+index/.test(sql) &&
/file_metadata/.test(sql) && /file_metadata/.test(sql) &&
/user_id/.test(sql) /user_id/.test(sql),
); ).toBe(true);
}, });
},
{ test("idempotent DDL", () => {
name: "idempotent DDL", expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()), });
},
{ test("overall quality score", () => {
name: "overall quality score",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const signals = [ const signals = [
/insert\s+into\s+storage\.buckets[\s\S]*?avatars/, /insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
@@ -247,7 +236,5 @@ export const assertions: EvalAssertion[] = [
/if\s+not\s+exists/, /if\s+not\s+exists/,
/create\s+table[\s\S]*?file_metadata/, /create\s+table[\s\S]*?file_metadata/,
]; ];
return signals.filter((r) => r.test(sql)).length >= 11; expect(signals.filter((r) => r.test(sql)).length >= 11).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,10 @@
export const expectedReferenceFiles = [
"storage-access-control.md",
"db-rls-mandatory.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "storage-rls-user-folders", "name": "storage-rls-user-folders",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -1,46 +1,32 @@
export const expectedReferenceFiles = [ import { expect, test } from "vitest";
"db-rls-mandatory.md",
"db-rls-policy-types.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-security-functions.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
];
import type { EvalAssertion } from "../../src/eval-types.js"; import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts"; test("migration file exists", () => {
expect(findMigrationFiles().length > 0).toBe(true);
});
export const assertions: EvalAssertion[] = [ test("creates organizations table", () => {
{ expect(
name: "migration file exists", /create\s+table[\s\S]*?organizations/.test(getMigrationSQL().toLowerCase()),
check: () => findMigrationFiles().length > 0, ).toBe(true);
}, });
{
name: "creates organizations table", test("creates memberships table", () => {
check: () => expect(
/create\s+table[\s\S]*?organizations/.test(
getMigrationSQL().toLowerCase(),
),
},
{
name: "creates memberships table",
check: () =>
/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()), /create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "creates projects table",
check: () => test("creates projects table", () => {
expect(
/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()), /create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "enables RLS on all tables",
check: () => { test("enables RLS on all tables", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
) && ) &&
@@ -49,130 +35,125 @@ export const assertions: EvalAssertion[] = [
) && ) &&
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
) ),
); ).toBe(true);
}, });
},
{ test("FK to auth.users with ON DELETE CASCADE", () => {
name: "FK to auth.users with ON DELETE CASCADE",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql) ).toBe(true);
); });
},
}, test("org_id FK on projects", () => {
{ expect(
name: "org_id FK on projects",
check: () =>
/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test( /org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
getMigrationSQL().toLowerCase(), getMigrationSQL().toLowerCase(),
), ),
}, ).toBe(true);
{ });
name: "private schema created",
check: () => test("private schema created", () => {
expect(
/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()), /create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
}, ).toBe(true);
{ });
name: "security_definer helper function",
check: () => { test("security_definer helper function", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
return ( expect(
/private\./.test(sql) && /private\./.test(sql) &&
/security\s+definer/.test(sql) && /security\s+definer/.test(sql) &&
/set\s+search_path\s*=\s*''/.test(sql) /set\s+search_path\s*=\s*''/.test(sql),
); ).toBe(true);
}, });
},
{ test("policies use (select auth.uid())", () => {
name: "policies use (select auth.uid())",
check: () => {
const sql = getMigrationSQL(); const sql = getMigrationSQL();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
if (policyBlocks.length === 0) return false; if (policyBlocks.length === 0) {
expect(false).toBe(true);
return;
}
for (const policy of policyBlocks) { for (const policy of policyBlocks) {
if ( if (
policy.includes("auth.uid()") && policy.includes("auth.uid()") &&
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy) !/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
) { ) {
return false; expect(false).toBe(true);
return;
} }
} }
return true; expect(true).toBe(true);
}, });
},
{ test("policies use TO authenticated", () => {
name: "policies use TO authenticated",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
return ( expect(
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)) policyBlocks.every((p) => /to\s+authenticated/.test(p)),
); ).toBe(true);
}, });
},
{ test("index on membership lookup columns", () => {
name: "index on membership lookup columns",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
if (!/create\s+index/.test(sql)) return false; if (!/create\s+index/.test(sql)) {
expect(false).toBe(true);
return;
}
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? []; const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
return ( expect(
indexBlocks.filter( indexBlocks.filter(
(idx) => (idx) =>
idx.includes("user_id") || idx.includes("user_id") ||
idx.includes("org_id") || idx.includes("org_id") ||
idx.includes("organization_id"), idx.includes("organization_id"),
).length >= 1 ).length >= 1,
); ).toBe(true);
}, });
},
{ test("uses timestamptz", () => {
name: "uses timestamptz",
check: () => {
const rawSql = getMigrationSQL().toLowerCase(); const rawSql = getMigrationSQL().toLowerCase();
const sql = rawSql.replace(/--[^\n]*/g, ""); const sql = rawSql.replace(/--[^\n]*/g, "");
const hasPlainTimestamp = const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
if ( if (
sql.includes("created_at") || sql.includes("created_at") ||
sql.includes("updated_at") || sql.includes("updated_at") ||
sql.includes("_at ") sql.includes("_at ")
) { ) {
return !hasPlainTimestamp.test(sql); expect(hasPlainTimestamp.test(sql)).toBe(false);
} else {
expect(true).toBe(true);
} }
return true; });
},
}, test("idempotent DDL", () => {
{ expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
name: "idempotent DDL", });
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
}, test("stable or immutable on helper function", () => {
{ expect(/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase())).toBe(
name: "stable or immutable on helper function", true,
check: () => );
/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase()), });
},
{ test("delete policy restricted to owner role", () => {
name: "delete policy restricted to owner role",
check: () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const deletePolicy = policyBlocks.find( const deletePolicy = policyBlocks.find(
(p) => (p) =>
p.toLowerCase().includes("delete") && p.toLowerCase().includes("delete") && p.toLowerCase().includes("project"),
p.toLowerCase().includes("project"),
); );
if (!deletePolicy) return false; if (!deletePolicy) {
return /owner|admin/.test(deletePolicy.toLowerCase()); expect(false).toBe(true);
}, return;
}, }
{ expect(/owner|admin/.test(deletePolicy.toLowerCase())).toBe(true);
name: "overall quality score", });
check: () => {
test("overall quality score", () => {
const sql = getMigrationSQL().toLowerCase(); const sql = getMigrationSQL().toLowerCase();
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? []; const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
const signals = [ const signals = [
@@ -185,11 +166,9 @@ export const assertions: EvalAssertion[] = [
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test( /alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
sql, sql,
), ),
/references\s+auth\.users/.test(sql) && /references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
/on\s+delete\s+cascade/.test(sql),
/create\s+schema[\s\S]*?private/.test(sql), /create\s+schema[\s\S]*?private/.test(sql),
/security\s+definer/.test(sql) && /security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
/set\s+search_path\s*=\s*''/.test(sql),
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql), /\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
policyBlocks.length > 0 && policyBlocks.length > 0 &&
policyBlocks.every((p) => /to\s+authenticated/.test(p)), policyBlocks.every((p) => /to\s+authenticated/.test(p)),
@@ -210,7 +189,5 @@ export const assertions: EvalAssertion[] = [
/private\./.test(sql), /private\./.test(sql),
/\bstable\b|\bimmutable\b/.test(sql), /\bstable\b|\bimmutable\b/.test(sql),
]; ];
return signals.filter(Boolean).length >= 11; expect(signals.filter(Boolean).length >= 11).toBe(true);
}, });
},
];

View File

@@ -0,0 +1,11 @@
export const expectedReferenceFiles = [
"db-rls-mandatory.md",
"db-rls-policy-types.md",
"db-rls-common-mistakes.md",
"db-rls-performance.md",
"db-security-functions.md",
"db-schema-auth-fk.md",
"db-schema-timestamps.md",
"db-perf-indexes.md",
"db-migrations-idempotent.md",
];

View File

@@ -1,5 +1,8 @@
{ {
"name": "team-rls-security-definer", "name": "team-rls-security-definer",
"private": true, "private": true,
"type": "module" "type": "module",
"devDependencies": {
"vitest": "^2.0.0"
}
} }

View File

@@ -0,0 +1,125 @@
import { execFileSync } from "node:child_process";
import { existsSync, readdirSync, readFileSync } from "node:fs";
import { dirname, join, resolve } from "node:path";
import { fileURLToPath } from "node:url";
import type { ExperimentConfig } from "@vercel/agent-eval";
const __dirname = dirname(fileURLToPath(import.meta.url));
const EVALS_ROOT = resolve(__dirname, "..");
const REPO_ROOT = resolve(EVALS_ROOT, "..", "..");
const PROJECT_DIR = join(EVALS_ROOT, "project");
const SKILL_NAME = process.env.EVAL_SKILL ?? "supabase";
const SKILL_DIR = join(REPO_ROOT, "skills", SKILL_NAME);
const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
const isBaseline = process.env.EVAL_BASELINE === "true";
// ---------------------------------------------------------------------------
// Skill file loader — reads all skill files to inject into the sandbox
// ---------------------------------------------------------------------------
function readSkillFiles(): Record<string, string> {
const files: Record<string, string> = {};
for (const name of ["SKILL.md", "AGENTS.md"]) {
const src = join(SKILL_DIR, name);
if (existsSync(src)) {
const content = readFileSync(src, "utf-8");
files[`.agents/skills/${SKILL_NAME}/${name}`] = content;
files[`.claude/skills/${SKILL_NAME}/${name}`] = content;
}
}
const refsDir = join(SKILL_DIR, "references");
if (existsSync(refsDir)) {
for (const f of readdirSync(refsDir)) {
const content = readFileSync(join(refsDir, f), "utf-8");
files[`.agents/skills/${SKILL_NAME}/references/${f}`] = content;
files[`.claude/skills/${SKILL_NAME}/references/${f}`] = content;
}
}
return files;
}
// ---------------------------------------------------------------------------
// DB reset — clears all user-created objects between scenarios
// ---------------------------------------------------------------------------
const RESET_SQL = `
DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO anon;
GRANT ALL ON SCHEMA public TO authenticated;
GRANT ALL ON SCHEMA public TO service_role;
DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
NOTIFY pgrst, 'reload schema';
`.trim();
function resetDB(): void {
const dbUrl =
process.env.SUPABASE_DB_URL ??
"postgresql://postgres:postgres@127.0.0.1:54322/postgres";
execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
stdio: "inherit",
timeout: 30_000,
});
}
// ---------------------------------------------------------------------------
// Experiment configuration
// ---------------------------------------------------------------------------
const config: ExperimentConfig = {
agent: "claude-code",
model: "claude-sonnet-4-6",
runs: 1,
earlyExit: true,
timeout: 1800,
sandbox: "docker",
evals: process.env.EVAL_SCENARIO ?? "*",
setup: async (sandbox) => {
// 1. Reset DB for a clean slate
resetDB();
// 2. Seed supabase config so the agent can run `supabase db push`
const configPath = join(PROJECT_DIR, "supabase", "config.toml");
if (existsSync(configPath)) {
await sandbox.writeFiles({
"supabase/config.toml": readFileSync(configPath, "utf-8"),
});
}
// 3. Write MCP config pointing to host Supabase instance
await sandbox.writeFiles({
".mcp.json": JSON.stringify(
{
mcpServers: {
supabase: { type: "http", url: `${supabaseUrl}/mcp` },
},
},
null,
"\t",
),
});
// 4. Write eval-utils.ts into the workspace so EVAL.ts can import it
// (agent-eval only copies the fixture's own directory into the sandbox)
const evalUtilsPath = join(EVALS_ROOT, "evals", "eval-utils.ts");
if (existsSync(evalUtilsPath)) {
await sandbox.writeFiles({
"eval-utils.ts": readFileSync(evalUtilsPath, "utf-8"),
});
}
// 5. Install skill files (unless baseline mode)
if (!isBaseline) {
await sandbox.writeFiles(readSkillFiles());
}
},
};
export default config;

File diff suppressed because it is too large Load Diff

View File

@@ -6,17 +6,19 @@
"license": "MIT", "license": "MIT",
"description": "Agent evaluation system for Supabase skills", "description": "Agent evaluation system for Supabase skills",
"scripts": { "scripts": {
"eval": "tsx src/runner.ts", "eval": "agent-eval",
"eval:upload": "BRAINTRUST_UPLOAD=true tsx src/runner.ts" "eval:dry": "agent-eval --dry",
"eval:smoke": "agent-eval --smoke",
"eval:upload": "tsx src/upload.ts"
}, },
"dependencies": { "dependencies": {
"@anthropic-ai/claude-code": "^2.1.49", "@vercel/agent-eval": "^0.9.2",
"braintrust": "^3.0.0", "braintrust": "^3.0.0"
"skills": "^1.4.0"
}, },
"devDependencies": { "devDependencies": {
"@types/node": "^20.10.0", "@types/node": "^20.10.0",
"tsx": "^4.7.0", "tsx": "^4.7.0",
"typescript": "^5.3.0" "typescript": "^5.3.0",
"vitest": "^4.0.18"
} }
} }

55
packages/evals/scripts/eval.sh Executable file
View File

@@ -0,0 +1,55 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
EVALS_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
PROJECT_DIR="$EVALS_DIR/project"
# ---------------------------------------------------------------------------
# Parse CLI arguments
# ---------------------------------------------------------------------------
AGENT_EVAL_ARGS=()
UPLOAD=true # Always upload to Braintrust by default
while [[ $# -gt 0 ]]; do
case "$1" in
--skill)
export EVAL_SKILL="$2"
shift 2
;;
--scenario)
export EVAL_SCENARIO="$2"
shift 2
;;
*)
AGENT_EVAL_ARGS+=("$1")
shift
;;
esac
done
echo "Starting Supabase..."
supabase start --exclude studio,imgproxy,mailpit --workdir "$PROJECT_DIR"
# Export keys so experiment.ts and vitest assertions can connect
eval "$(supabase status --output json --workdir "$PROJECT_DIR" | \
node -e "
const s = JSON.parse(require('fs').readFileSync('/dev/stdin','utf-8'));
console.log('export SUPABASE_URL=' + (s.API_URL || 'http://127.0.0.1:54321'));
console.log('export SUPABASE_ANON_KEY=' + s.ANON_KEY);
console.log('export SUPABASE_SERVICE_ROLE_KEY=' + s.SERVICE_ROLE_KEY);
console.log('export SUPABASE_DB_URL=' + (s.DB_URL || 'postgresql://postgres:postgres@127.0.0.1:54322/postgres'));
")"
trap 'echo "Stopping Supabase..."; supabase stop --no-backup --workdir "$PROJECT_DIR"' EXIT
echo "Running agent-eval..."
cd "$EVALS_DIR"
npx agent-eval "${AGENT_EVAL_ARGS[@]+"${AGENT_EVAL_ARGS[@]}"}"
# Upload results to Braintrust (default: true, skip with --no-upload)
if [ "$UPLOAD" = "true" ]; then
echo "Uploading results to Braintrust..."
npx tsx src/upload.ts
fi

View File

@@ -1,21 +0,0 @@
/**
* A single assertion to run against the agent's workspace output.
*
* Used by EVAL.ts files to declare what the agent's work should produce.
* The runner executes these in-process (no test framework required).
*/
export interface EvalAssertion {
/** Human-readable name shown in Braintrust and local output */
name: string;
/** Return true = pass, false/throw = fail */
check: () => boolean | Promise<boolean>;
/** Timeout in ms for async checks (default: no timeout) */
timeout?: number;
}
/** Result of running a single EvalAssertion */
export interface AssertionResult {
name: string;
passed: boolean;
error?: string;
}

View File

@@ -1,372 +0,0 @@
import { existsSync, readdirSync, readFileSync } from "node:fs";
import { join, resolve } from "node:path";
import type { AssertionResult, EvalAssertion } from "./eval-types.js";
import { runAgent } from "./runner/agent.js";
import {
seedBraintrustDataset,
uploadToBraintrust,
} from "./runner/braintrust.js";
import { createResultDir, saveRunArtifacts } from "./runner/persist.js";
import { preflight } from "./runner/preflight.js";
import { listModifiedFiles, printSummary } from "./runner/results.js";
import { createWorkspace } from "./runner/scaffold.js";
import {
assertionsPassedScorer,
finalResultScorer,
referenceFilesUsageScorer,
skillUsageScorer,
} from "./runner/scorers.js";
import {
getKeys,
resetDB,
startSupabase,
stopSupabase,
} from "./runner/supabase-setup.js";
import {
buildTranscriptSummary,
type TranscriptSummary,
} from "./runner/transcript.js";
import type { EvalRunResult, EvalScenario } from "./types.js";
// ---------------------------------------------------------------------------
// Configuration from environment
// ---------------------------------------------------------------------------
const DEFAULT_MODEL = "claude-sonnet-4-5-20250929";
const DEFAULT_SKILL = "supabase";
const AGENT_TIMEOUT = 30 * 60 * 1000; // 30 minutes
const model = process.env.EVAL_MODEL ?? DEFAULT_MODEL;
const skillName = process.env.EVAL_SKILL ?? DEFAULT_SKILL;
const scenarioFilter = process.env.EVAL_SCENARIO;
const isBaseline = process.env.EVAL_BASELINE === "true";
const skillEnabled = !isBaseline;
// Run-level timestamp shared across all scenarios in a single invocation
const runTimestamp = new Date()
.toISOString()
.replace(/[:.]/g, "-")
.replace("Z", "");
// ---------------------------------------------------------------------------
// Discover scenarios
// ---------------------------------------------------------------------------
function findEvalsDir(): string {
let dir = process.cwd();
for (let i = 0; i < 10; i++) {
const candidate = join(dir, "packages", "evals", "evals");
if (existsSync(candidate)) return candidate;
const parent = resolve(dir, "..");
if (parent === dir) break;
dir = parent;
}
throw new Error("Could not find packages/evals/evals/ directory");
}
function discoverScenarios(): EvalScenario[] {
const evalsDir = findEvalsDir();
const dirs = readdirSync(evalsDir, { withFileTypes: true }).filter(
(d) => d.isDirectory() && existsSync(join(evalsDir, d.name, "PROMPT.md")),
);
return dirs.map((d) => ({
id: d.name,
name: d.name,
tags: [],
}));
}
// ---------------------------------------------------------------------------
// Scenario threshold
// ---------------------------------------------------------------------------
function getPassThreshold(scenarioId: string): number | null {
const scenariosDir = join(findEvalsDir(), "..", "scenarios");
const scenarioFile = join(scenariosDir, `${scenarioId}.md`);
if (!existsSync(scenarioFile)) return null;
const content = readFileSync(scenarioFile, "utf-8");
const match = content.match(/\*\*pass_threshold:\*\*\s*(\d+)/);
return match ? Number.parseInt(match[1], 10) : null;
}
// ---------------------------------------------------------------------------
// In-process assertion runner (replaces vitest subprocess)
// ---------------------------------------------------------------------------
async function runAssertions(
assertions: EvalAssertion[],
): Promise<AssertionResult[]> {
return Promise.all(
assertions.map(async (a) => {
try {
let result: boolean;
if (a.timeout) {
const timeoutPromise = new Promise<never>((_, reject) =>
setTimeout(
() =>
reject(new Error(`Assertion timed out after ${a.timeout}ms`)),
a.timeout,
),
);
result = await Promise.race([
Promise.resolve(a.check()),
timeoutPromise,
]);
} else {
result = await Promise.resolve(a.check());
}
return { name: a.name, passed: Boolean(result) };
} catch (e) {
return { name: a.name, passed: false, error: String(e) };
}
}),
);
}
// ---------------------------------------------------------------------------
// Run a single eval
// ---------------------------------------------------------------------------
async function runEval(
scenario: EvalScenario,
skillEnabled: boolean,
): Promise<{
result: EvalRunResult;
transcript?: TranscriptSummary;
expectedReferenceFiles: string[];
}> {
const evalsDir = findEvalsDir();
const evalDir = join(evalsDir, scenario.id);
const variant = skillEnabled ? "with-skill" : "baseline";
console.log(`\n--- ${scenario.id} (${variant}) ---`);
// Load assertions and expected reference files from EVAL.ts
const evalFilePath = existsSync(join(evalDir, "EVAL.tsx"))
? join(evalDir, "EVAL.tsx")
: join(evalDir, "EVAL.ts");
const {
assertions = [] as EvalAssertion[],
expectedReferenceFiles = [] as string[],
} = await import(evalFilePath).catch(() => ({
assertions: [] as EvalAssertion[],
expectedReferenceFiles: [] as string[],
}));
const passThreshold = getPassThreshold(scenario.id);
const prompt = readFileSync(join(evalDir, "PROMPT.md"), "utf-8").trim();
// 1. Create isolated workspace
const { workspacePath, cleanup } = createWorkspace({ evalDir, skillEnabled });
console.log(` Workspace: ${workspacePath}`);
try {
// 2. Run the agent
console.log(` Running agent (${model})...`);
const startedAt = Date.now();
const agentResult = await runAgent({
cwd: workspacePath,
prompt,
model,
timeout: AGENT_TIMEOUT,
skillEnabled,
skillName: skillEnabled ? skillName : undefined,
});
console.log(
` Agent finished in ${(agentResult.duration / 1000).toFixed(1)}s`,
);
// 3. Run assertions in-process from the workspace directory so that
// eval-utils.ts helpers resolve paths relative to the workspace.
console.log(" Running assertions...");
const prevCwd = process.cwd();
process.chdir(workspacePath);
const assertionResults = await runAssertions(assertions).finally(() => {
process.chdir(prevCwd);
});
const passedCount = assertionResults.filter((a) => a.passed).length;
const totalCount = assertionResults.length;
const passed = passThreshold
? totalCount > 0 && passedCount >= passThreshold
: totalCount > 0 && passedCount === totalCount;
const pct =
totalCount > 0 ? ((passedCount / totalCount) * 100).toFixed(1) : "0.0";
const thresholdInfo = passThreshold
? `, threshold: ${((passThreshold / totalCount) * 100).toFixed(0)}%`
: "";
console.log(
` Assertions: ${passedCount}/${totalCount} passed (${pct}%${thresholdInfo})`,
);
// 4. Collect modified files
const filesModified = listModifiedFiles(workspacePath, evalDir);
// 5. Build transcript summary
const summary = buildTranscriptSummary(agentResult.events);
// 6. Run scorers
const skillScore = skillUsageScorer(summary, skillName);
const refScore = referenceFilesUsageScorer(summary, expectedReferenceFiles);
const assertScore = assertionsPassedScorer({
testsPassed: passedCount,
testsTotal: totalCount,
status: passed ? "passed" : "failed",
} as EvalRunResult);
const finalScore = finalResultScorer({
status: passed ? "passed" : "failed",
testsPassed: passedCount,
testsTotal: totalCount,
passThreshold: passThreshold ?? undefined,
} as EvalRunResult);
const result: EvalRunResult = {
scenario: scenario.id,
agent: "claude-code",
model,
skillEnabled,
status: passed ? "passed" : "failed",
duration: agentResult.duration,
agentOutput: agentResult.output,
testsPassed: passedCount,
testsTotal: totalCount,
passThreshold: passThreshold ?? undefined,
assertionResults,
filesModified,
toolCallCount: summary.toolCalls.length,
costUsd: summary.totalCostUsd ?? undefined,
prompt,
startedAt,
durationApiMs: summary.totalDurationApiMs,
totalInputTokens: summary.totalInputTokens,
totalOutputTokens: summary.totalOutputTokens,
totalCacheReadTokens: summary.totalCacheReadTokens,
totalCacheCreationTokens: summary.totalCacheCreationTokens,
modelUsage: summary.modelUsage,
toolErrorCount: summary.toolErrorCount,
permissionDenialCount: summary.permissionDenialCount,
loadedSkills: summary.skills,
referenceFilesRead: summary.referenceFilesRead,
scores: {
skillUsage: skillScore.score,
referenceFilesUsage: refScore.score,
assertionsPassed: assertScore.score,
finalResult: finalScore.score,
},
};
// 7. Persist results
const resultDir = createResultDir(runTimestamp, scenario.id, variant);
result.resultsDir = resultDir;
saveRunArtifacts({
resultDir,
rawTranscript: agentResult.rawTranscript,
assertionResults,
result,
transcriptSummary: summary,
});
return { result, transcript: summary, expectedReferenceFiles };
} catch (error) {
const err = error as Error;
return {
result: {
scenario: scenario.id,
agent: "claude-code",
model,
skillEnabled,
status: "error",
duration: 0,
agentOutput: "",
testsPassed: 0,
testsTotal: 0,
filesModified: [],
error: err.message,
},
expectedReferenceFiles: [],
};
} finally {
cleanup();
}
}
// ---------------------------------------------------------------------------
// Main
// ---------------------------------------------------------------------------
async function main() {
preflight();
console.log("Supabase Skills Evals");
console.log(`Model: ${model}`);
console.log(`Mode: ${isBaseline ? "baseline (no skills)" : "with skills"}`);
let scenarios = discoverScenarios();
if (scenarioFilter) {
scenarios = scenarios.filter((s) => s.id === scenarioFilter);
if (scenarios.length === 0) {
console.error(`Scenario not found: ${scenarioFilter}`);
process.exit(1);
}
}
console.log(`Scenarios: ${scenarios.map((s) => s.id).join(", ")}`);
// Start the shared Supabase instance once for all scenarios.
startSupabase();
const keys = getKeys();
// Inject keys into process.env so assertions can connect to the real DB.
process.env.SUPABASE_URL = keys.apiUrl;
process.env.SUPABASE_ANON_KEY = keys.anonKey;
process.env.SUPABASE_SERVICE_ROLE_KEY = keys.serviceRoleKey;
process.env.SUPABASE_DB_URL = keys.dbUrl;
const results: EvalRunResult[] = [];
const transcripts = new Map<string, TranscriptSummary>();
const expectedRefFiles = new Map<string, string[]>();
try {
for (const scenario of scenarios) {
// Reset the database before each scenario for a clean slate.
console.log(`\n Resetting DB for ${scenario.id}...`);
resetDB(keys.dbUrl);
const { result, transcript, expectedReferenceFiles } = await runEval(
scenario,
skillEnabled,
);
results.push(result);
if (transcript) {
transcripts.set(result.scenario, transcript);
}
expectedRefFiles.set(result.scenario, expectedReferenceFiles);
}
} finally {
stopSupabase();
}
// Use the results dir from the first result (all share the same timestamp)
const resultsDir = results.find((r) => r.resultsDir)?.resultsDir;
printSummary(results, resultsDir);
console.log("\nUploading to Braintrust...");
await seedBraintrustDataset(results, expectedRefFiles);
await uploadToBraintrust(results, {
model,
skillEnabled,
runTimestamp,
transcripts,
expectedRefFiles,
});
}
main().catch((err) => {
console.error("Fatal error:", err);
process.exit(1);
});

View File

@@ -1,145 +0,0 @@
import { spawn } from "node:child_process";
import { resolveClaudeBin } from "./preflight.js";
import {
extractFinalOutput,
parseStreamJsonOutput,
type TranscriptEvent,
} from "./transcript.js";
export interface AgentRunResult {
/** Extracted final text output (backward-compatible). */
output: string;
duration: number;
/** Raw NDJSON transcript string from stream-json. */
rawTranscript: string;
/** Parsed transcript events. */
events: TranscriptEvent[];
}
/**
* Invoke Claude Code in print mode as a subprocess.
*
* Uses --output-format stream-json to capture structured NDJSON events
* including tool calls, results, and reasoning steps.
*
* The agent operates in the workspace directory and can read/write files,
* and has access to the local Supabase MCP server so it can apply migrations
* and query the real database. --strict-mcp-config ensures only the local
* Supabase instance is reachable — no host MCP servers leak in.
*
* --setting-sources project,local prevents skills from the user's global
* ~/.agents/skills/ from leaking into the eval environment.
*
* When skillEnabled, --agents injects the target skill directly into the
* agent's context, guaranteeing it is present (not just discoverable).
*/
export async function runAgent(opts: {
cwd: string;
prompt: string;
model: string;
timeout: number;
skillEnabled: boolean;
/** Skill name to inject via --agents (e.g. "supabase"). Used when skillEnabled. */
skillName?: string;
}): Promise<AgentRunResult> {
const start = Date.now();
// Point the agent's MCP config at the shared local Supabase instance.
// --strict-mcp-config ensures host .mcp.json is ignored entirely.
const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
const mcpConfig = JSON.stringify({
mcpServers: {
supabase: {
type: "http",
url: `${supabaseUrl}/mcp`,
},
},
});
const args = [
"-p", // Print mode (non-interactive)
"--verbose",
"--output-format",
"stream-json",
"--model",
opts.model,
"--no-session-persistence",
"--dangerously-skip-permissions",
"--tools",
"Edit,Write,Bash,Read,Glob,Grep",
"--mcp-config",
mcpConfig,
"--strict-mcp-config",
// Prevent skills from the user's global ~/.agents/skills/ from leaking
// into the eval environment. Only workspace (project) and local sources
// are loaded, so the eval sees only what was explicitly installed.
"--setting-sources",
"project,local",
];
if (opts.skillEnabled && opts.skillName) {
// Inject the target skill directly into the agent context via --agents.
// This guarantees the skill is embedded in the subagent's context at
// startup (not just available as a slash command).
const agentsDef = JSON.stringify({
main: {
description: `Supabase developer agent with ${opts.skillName} skill`,
skills: [opts.skillName],
},
});
args.push("--agents", agentsDef);
} else if (!opts.skillEnabled) {
// Baseline runs: disable all skills so the agent relies on innate knowledge
args.push("--disable-slash-commands");
}
const env = { ...process.env };
// Remove all Claude-related env vars to avoid nested-session detection
for (const key of Object.keys(env)) {
if (key === "CLAUDECODE" || key.startsWith("CLAUDE_")) {
delete env[key];
}
}
const claudeBin = resolveClaudeBin();
return new Promise<AgentRunResult>((resolve) => {
const child = spawn(claudeBin, args, {
cwd: opts.cwd,
env,
stdio: ["pipe", "pipe", "pipe"],
});
// Pipe prompt via stdin and close — this is the standard way to
// pass multi-line prompts to `claude -p`.
child.stdin.write(opts.prompt);
child.stdin.end();
let stdout = "";
let stderr = "";
child.stdout.on("data", (d: Buffer) => {
stdout += d.toString();
});
child.stderr.on("data", (d: Buffer) => {
stderr += d.toString();
});
const timer = setTimeout(() => {
child.kill();
}, opts.timeout);
child.on("close", () => {
clearTimeout(timer);
const rawTranscript = stdout || stderr;
const events = parseStreamJsonOutput(rawTranscript);
const output = extractFinalOutput(events) || rawTranscript;
resolve({
output,
duration: Date.now() - start,
rawTranscript,
events,
});
});
});
}

View File

@@ -1,295 +0,0 @@
import assert from "node:assert";
import { init, initDataset, initLogger, type Logger } from "braintrust";
import type { EvalRunResult } from "../types.js";
import type { TranscriptSummary } from "./transcript.js";
/**
* Initialize a Braintrust project logger for real-time per-scenario logging.
* Call this once at startup and pass the logger to logScenarioToLogger().
*/
export function initBraintrustLogger(): Logger<true> {
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
return initLogger({
projectId: process.env.BRAINTRUST_PROJECT_ID,
asyncFlush: true,
});
}
/**
* Log a single scenario result to the Braintrust project logger in real-time.
* This runs alongside the experiment upload, giving immediate visibility in
* the project log as each scenario completes.
*/
export function logScenarioToLogger(
logger: Logger<true>,
r: EvalRunResult,
transcript?: TranscriptSummary,
): void {
const scores: Record<string, number> = {
skill_usage: r.scores?.skillUsage ?? 0,
reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
assertions_passed: r.scores?.assertionsPassed ?? 0,
final_result: r.scores?.finalResult ?? 0,
};
const metadata: Record<string, unknown> = {
agent: r.agent,
model: r.model,
skillEnabled: r.skillEnabled,
testsPassed: r.testsPassed,
testsTotal: r.testsTotal,
toolCallCount: r.toolCallCount ?? 0,
contextWindowUsed:
(r.totalInputTokens ?? 0) +
(r.totalCacheReadTokens ?? 0) +
(r.totalCacheCreationTokens ?? 0),
totalOutputTokens: r.totalOutputTokens,
modelUsage: r.modelUsage,
toolErrorCount: r.toolErrorCount,
permissionDenialCount: r.permissionDenialCount,
loadedSkills: r.loadedSkills,
referenceFilesRead: r.referenceFilesRead,
...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
...(r.error ? { error: r.error } : {}),
};
const spanOptions = r.startedAt
? { name: r.scenario, startTime: r.startedAt / 1000 }
: { name: r.scenario };
if (transcript && transcript.toolCalls.length > 0) {
logger.traced((span) => {
span.log({
input: {
scenario: r.scenario,
prompt: r.prompt ?? "",
skillEnabled: r.skillEnabled,
},
output: {
status: r.status,
agentOutput: r.agentOutput,
filesModified: r.filesModified,
assertionResults: r.assertionResults,
},
expected: { testsTotal: r.testsTotal },
scores,
metadata,
});
for (const tc of transcript.toolCalls) {
span.traced(
(childSpan) => {
childSpan.log({
input: { tool: tc.tool, args: tc.input },
output: {
preview: tc.outputPreview,
isError: tc.isError,
...(tc.stderr ? { stderr: tc.stderr } : {}),
},
metadata: { toolUseId: tc.toolUseId },
});
},
{ name: `tool:${tc.tool}` },
);
}
}, spanOptions);
} else {
logger.traced((span) => {
span.log({
input: {
scenario: r.scenario,
prompt: r.prompt ?? "",
skillEnabled: r.skillEnabled,
},
output: {
status: r.status,
agentOutput: r.agentOutput,
filesModified: r.filesModified,
assertionResults: r.assertionResults,
},
expected: { testsTotal: r.testsTotal },
scores,
metadata,
});
}, spanOptions);
}
}
/**
* Seed a Braintrust dataset with one row per scenario.
*
* Uses scenario.id as the stable row ID so re-seeding is idempotent.
* Each row stores the prompt and expected assertions/reference files,
* giving Braintrust a stable baseline to track per-scenario score trends
* across experiment runs.
*/
export async function seedBraintrustDataset(
results: EvalRunResult[],
expectedRefFiles: Map<string, string[]>,
): Promise<void> {
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
const dataset = initDataset({
projectId: process.env.BRAINTRUST_PROJECT_ID,
dataset: "supabase-skill-scenarios",
});
for (const r of results) {
dataset.insert({
id: r.scenario,
input: {
scenario: r.scenario,
prompt: r.prompt ?? "",
},
expected: {
testsTotal: r.testsTotal,
passThreshold: r.passThreshold ?? 1.0,
expectedReferenceFiles: expectedRefFiles.get(r.scenario) ?? [],
},
metadata: { scenario: r.scenario },
});
}
await dataset.flush();
console.log("Braintrust dataset seeded: supabase-skill-scenarios");
}
/**
* Upload eval results to Braintrust as an experiment.
*
* Each EvalRunResult becomes a row in the experiment with:
* - input: scenario ID, prompt content, skillEnabled flag
* - output: status, agent output, files modified, assertion results
* - expected: total tests, pass threshold
* - scores: skill_usage, reference_files_usage, assertions_passed, final_result
* - metadata: agent, model, skillEnabled, test counts, tool calls, context window, output tokens, model usage, errors, cost
* - spans: one child span per agent tool call (when transcript available)
* - datasetRecordId: links this row to the dataset row for per-scenario tracking
*/
export async function uploadToBraintrust(
results: EvalRunResult[],
opts: {
model: string;
skillEnabled: boolean;
runTimestamp: string;
transcripts: Map<string, TranscriptSummary>;
expectedRefFiles: Map<string, string[]>;
},
): Promise<void> {
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
const variant = opts.skillEnabled ? "skill" : "baseline";
const experiment = await init({
projectId: process.env.BRAINTRUST_PROJECT_ID,
experiment: `${opts.model}-${variant}-${opts.runTimestamp}`,
baseExperiment: process.env.BRAINTRUST_BASE_EXPERIMENT ?? undefined,
metadata: {
model: opts.model,
skillEnabled: opts.skillEnabled,
runTimestamp: opts.runTimestamp,
scenarioCount: results.length,
},
});
for (const r of results) {
const transcript = opts.transcripts.get(r.scenario);
const scores: Record<string, number> = {
skill_usage: r.scores?.skillUsage ?? 0,
reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
assertions_passed: r.scores?.assertionsPassed ?? 0,
final_result: r.scores?.finalResult ?? 0,
};
const input = {
scenario: r.scenario,
prompt: r.prompt ?? "",
skillEnabled: r.skillEnabled,
};
const output = {
status: r.status,
agentOutput: r.agentOutput,
filesModified: r.filesModified,
assertionResults: r.assertionResults,
};
const expected = {
testsTotal: r.testsTotal,
passThreshold: 1.0,
};
const metadata: Record<string, unknown> = {
agent: r.agent,
model: r.model,
skillEnabled: r.skillEnabled,
testsPassed: r.testsPassed,
testsTotal: r.testsTotal,
toolCallCount: r.toolCallCount ?? 0,
contextWindowUsed:
(r.totalInputTokens ?? 0) +
(r.totalCacheReadTokens ?? 0) +
(r.totalCacheCreationTokens ?? 0),
totalOutputTokens: r.totalOutputTokens,
modelUsage: r.modelUsage,
toolErrorCount: r.toolErrorCount,
permissionDenialCount: r.permissionDenialCount,
loadedSkills: r.loadedSkills,
referenceFilesRead: r.referenceFilesRead,
...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
...(r.error ? { error: r.error } : {}),
};
const spanOptions = r.startedAt
? { name: r.scenario, startTime: r.startedAt / 1000 }
: { name: r.scenario };
if (transcript && transcript.toolCalls.length > 0) {
experiment.traced((span) => {
span.log({
input,
output,
expected,
scores,
metadata,
datasetRecordId: r.scenario,
});
for (const tc of transcript.toolCalls) {
span.traced(
(childSpan) => {
childSpan.log({
input: { tool: tc.tool, args: tc.input },
output: {
preview: tc.outputPreview,
isError: tc.isError,
...(tc.stderr ? { stderr: tc.stderr } : {}),
},
metadata: { toolUseId: tc.toolUseId },
});
},
{ name: `tool:${tc.tool}` },
);
}
}, spanOptions);
} else {
experiment.traced((span) => {
span.log({
input,
output,
expected,
scores,
metadata,
datasetRecordId: r.scenario,
});
}, spanOptions);
}
}
const summary = await experiment.summarize();
console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
await experiment.close();
}

View File

@@ -1,61 +0,0 @@
import { mkdirSync, writeFileSync } from "node:fs";
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
import type { AssertionResult } from "../eval-types.js";
import type { EvalRunResult } from "../types.js";
import type { TranscriptSummary } from "./transcript.js";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
/** Resolve the base directory for storing results.
* Supports EVAL_RESULTS_DIR override for Docker volume mounts. */
function resultsBase(): string {
if (process.env.EVAL_RESULTS_DIR) {
return process.env.EVAL_RESULTS_DIR;
}
// Default: packages/evals/results (__dirname is packages/evals/src/runner)
return join(__dirname, "..", "..", "results");
}
/** Create the results directory for a single scenario run. Returns the path. */
export function createResultDir(
runTimestamp: string,
scenarioId: string,
variant: "with-skill" | "baseline",
): string {
const dir = join(resultsBase(), runTimestamp, scenarioId, variant);
mkdirSync(dir, { recursive: true });
return dir;
}
/** Save all artifacts for a single eval run. */
export function saveRunArtifacts(opts: {
resultDir: string;
rawTranscript: string;
assertionResults: AssertionResult[];
result: EvalRunResult;
transcriptSummary: TranscriptSummary;
}): void {
writeFileSync(
join(opts.resultDir, "transcript.jsonl"),
opts.rawTranscript,
"utf-8",
);
writeFileSync(
join(opts.resultDir, "assertions.json"),
JSON.stringify(opts.assertionResults, null, 2),
"utf-8",
);
writeFileSync(
join(opts.resultDir, "result.json"),
JSON.stringify(
{ ...opts.result, transcript: opts.transcriptSummary },
null,
2,
),
"utf-8",
);
}

View File

@@ -1,126 +0,0 @@
import { execFileSync } from "node:child_process";
import { accessSync, constants, existsSync } from "node:fs";
import { dirname, join } from "node:path";
import { fileURLToPath } from "node:url";
/** Detect if we're running inside the eval Docker container. */
export function isRunningInDocker(): boolean {
if (process.env.IN_DOCKER === "true") return true;
try {
accessSync("/.dockerenv", constants.F_OK);
return true;
} catch {
return false;
}
}
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
/**
* Resolve the `claude` binary path.
*
* Looks in the following order:
* 1. Local node_modules/.bin/claude (installed via @anthropic-ai/claude-code)
* 2. Global `claude` on PATH
*
* Throws with an actionable message when neither is found.
*/
export function resolveClaudeBin(): string {
// packages/evals/node_modules/.bin/claude
const localBin = join(
__dirname,
"..",
"..",
"node_modules",
".bin",
"claude",
);
if (existsSync(localBin)) {
return localBin;
}
// Fall back to PATH
try {
execFileSync("claude", ["--version"], {
stdio: "ignore",
timeout: 10_000,
});
return "claude";
} catch {
throw new Error(
[
"claude CLI not found.",
"",
"Install it in one of these ways:",
" npm install (uses @anthropic-ai/claude-code from package.json)",
" npm i -g @anthropic-ai/claude-code",
"",
"Ensure ANTHROPIC_API_KEY is set in the environment.",
].join("\n"),
);
}
}
/**
* Verify the host environment has everything needed before spending
* API credits on an eval run.
*
* Checks: Node >= 20, Docker running, supabase CLI available, claude CLI available, API key set.
*/
export function preflight(): void {
const errors: string[] = [];
// Node.js >= 20
const [major] = process.versions.node.split(".").map(Number);
if (major < 20) {
errors.push(`Node.js >= 20 required (found ${process.versions.node})`);
}
// Docker daemon must be running — needed by the supabase CLI to manage containers.
// Required whether running locally or inside the eval container (socket-mounted).
try {
execFileSync("docker", ["info"], { stdio: "ignore", timeout: 10_000 });
} catch {
errors.push(
isRunningInDocker()
? "Docker daemon not reachable inside container. Mount the socket: -v /var/run/docker.sock:/var/run/docker.sock"
: "Docker is not running (required by supabase CLI)",
);
}
// Supabase CLI available
try {
execFileSync("supabase", ["--version"], {
stdio: "ignore",
timeout: 10_000,
});
} catch {
errors.push(
"supabase CLI not found. Install it: https://supabase.com/docs/guides/cli/getting-started",
);
}
// Claude CLI available
try {
resolveClaudeBin();
} catch (err) {
errors.push((err as Error).message);
}
// API key
if (!process.env.ANTHROPIC_API_KEY) {
errors.push(
"ANTHROPIC_API_KEY is not set. Claude Code requires this for authentication.",
);
}
if (errors.length > 0) {
console.error("Preflight checks failed:\n");
for (const e of errors) {
console.error(` - ${e}`);
}
console.error("");
process.exit(1);
}
}

View File

@@ -1,84 +0,0 @@
import { readdirSync, statSync } from "node:fs";
import { join } from "node:path";
import type { EvalRunResult } from "../types.js";
/**
* List files created or modified by the agent in the workspace.
* Compares against the original eval directory to find new files.
*/
export function listModifiedFiles(
workspacePath: string,
originalEvalDir: string,
): string[] {
const modified: string[] = [];
function walk(dir: string, prefix: string) {
const entries = readdirSync(dir, { withFileTypes: true });
for (const entry of entries) {
if (
entry.name === "node_modules" ||
entry.name === ".agents" ||
entry.name === ".claude" ||
entry.name === "EVAL.ts" ||
entry.name === "EVAL.tsx"
)
continue;
const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
const fullPath = join(dir, entry.name);
if (entry.isDirectory()) {
walk(fullPath, relPath);
} else {
// Check if file is new (not in original eval dir)
const originalPath = join(originalEvalDir, relPath);
try {
statSync(originalPath);
} catch {
// File doesn't exist in original — it was created by the agent
modified.push(relPath);
}
}
}
}
walk(workspacePath, "");
return modified;
}
/** Print a summary table of eval results. */
export function printSummary(
results: EvalRunResult[],
resultsDir?: string,
): void {
console.log("\n=== Eval Results ===\n");
for (const r of results) {
const icon = r.status === "passed" ? "PASS" : "FAIL";
const skill = r.skillEnabled ? "with-skill" : "baseline";
const pct =
r.testsTotal > 0
? ((r.testsPassed / r.testsTotal) * 100).toFixed(1)
: "0.0";
const thresholdInfo =
r.passThreshold && r.testsTotal > 0
? `, threshold: ${((r.passThreshold / r.testsTotal) * 100).toFixed(0)}%`
: "";
console.log(
`[${icon}] ${r.scenario} | ${r.model} | ${skill} | ${(r.duration / 1000).toFixed(1)}s | ${pct}% (${r.testsPassed}/${r.testsTotal}${thresholdInfo})`,
);
if (r.filesModified.length > 0) {
console.log(` Files: ${r.filesModified.join(", ")}`);
}
if (r.status === "error" && r.error) {
console.log(` Error: ${r.error}`);
}
}
const passed = results.filter((r) => r.status === "passed").length;
console.log(`\nTotal: ${passed}/${results.length} passed`);
if (resultsDir) {
console.log(`\nResults saved to: ${resultsDir}`);
}
}

View File

@@ -1,74 +0,0 @@
import {
cpSync,
existsSync,
mkdirSync,
mkdtempSync,
readdirSync,
rmSync,
writeFileSync,
} from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";
import { EVAL_PROJECT_DIR } from "./supabase-setup.js";
/**
* Create an isolated workspace for an eval run.
*
* 1. Copy the eval directory to a temp folder (excluding EVAL.ts/EVAL.tsx)
* 2. Seed with the eval project's supabase/config.toml
*
* Skills are injected via the --agents flag in agent.ts (not installed into
* the workspace here). Combined with --setting-sources project,local, this
* prevents host ~/.agents/skills/ from leaking into the eval environment.
*
* Returns the path to the workspace and a cleanup function.
*/
export function createWorkspace(opts: {
evalDir: string;
skillEnabled: boolean;
}): { workspacePath: string; cleanup: () => void } {
const workspacePath = mkdtempSync(join(tmpdir(), "supabase-eval-"));
// Copy eval directory, excluding EVAL.ts/EVAL.tsx (hidden from agent)
const entries = readdirSync(opts.evalDir, { withFileTypes: true });
for (const entry of entries) {
if (entry.name === "EVAL.ts" || entry.name === "EVAL.tsx") continue;
const src = join(opts.evalDir, entry.name);
const dest = join(workspacePath, entry.name);
cpSync(src, dest, { recursive: true });
}
// Add .mcp.json so the agent connects to the local Supabase MCP server
writeFileSync(
join(workspacePath, ".mcp.json"),
JSON.stringify(
{
mcpServers: {
"local-supabase": {
type: "http",
url: "http://localhost:54321/mcp",
},
},
},
null,
"\t",
),
);
// Seed the workspace with the eval project's supabase/config.toml so the
// agent can run `supabase db push` against the shared local instance without
// needing to run `supabase init` or `supabase start` first.
const projectConfigSrc = join(EVAL_PROJECT_DIR, "supabase", "config.toml");
if (existsSync(projectConfigSrc)) {
const destSupabaseDir = join(workspacePath, "supabase");
mkdirSync(join(destSupabaseDir, "migrations"), { recursive: true });
cpSync(projectConfigSrc, join(destSupabaseDir, "config.toml"));
}
return {
workspacePath,
cleanup: () => {
rmSync(workspacePath, { recursive: true, force: true });
},
};
}

View File

@@ -1,94 +0,0 @@
import type { EvalRunResult } from "../types.js";
import type { TranscriptSummary } from "./transcript.js";
export interface ScoreResult {
name: string;
/** 0.0 1.0 */
score: number;
metadata?: Record<string, unknown>;
}
/**
* skillUsageScorer — 1 if the target skill was in the agent's context, 0 otherwise.
*
* Detected via the `skills` array in the system init event of the NDJSON transcript.
* Combined with `--setting-sources project,local` in agent.ts, this array is clean
* (no host skill leakage), so its presence is a reliable signal.
*/
export function skillUsageScorer(
transcript: TranscriptSummary,
skillName: string,
): ScoreResult {
const loaded = transcript.skills.includes(skillName);
return {
name: "skill_usage",
score: loaded ? 1 : 0,
metadata: {
loadedSkills: transcript.skills,
targetSkill: skillName,
},
};
}
/**
* referenceFilesUsageScorer — fraction of expected reference files actually read.
*
* Detected via Read tool calls whose file_path matches "/.agents/skills/*\/references/".
* The expectedReferenceFiles list is declared in each EVAL.ts and should match the
* "Skill References Exercised" table in the corresponding scenarios/*.md file.
*/
export function referenceFilesUsageScorer(
transcript: TranscriptSummary,
expectedReferenceFiles: string[],
): ScoreResult {
if (expectedReferenceFiles.length === 0) {
return {
name: "reference_files_usage",
score: 1,
metadata: { skipped: true },
};
}
const read = transcript.referenceFilesRead;
const hits = expectedReferenceFiles.filter((f) => read.includes(f)).length;
return {
name: "reference_files_usage",
score: hits / expectedReferenceFiles.length,
metadata: {
expected: expectedReferenceFiles,
read,
hits,
total: expectedReferenceFiles.length,
},
};
}
/**
* assertionsPassedScorer — ratio of assertions passed vs total.
*/
export function assertionsPassedScorer(result: EvalRunResult): ScoreResult {
const score =
result.testsTotal > 0 ? result.testsPassed / result.testsTotal : 0;
return {
name: "assertions_passed",
score,
metadata: { passed: result.testsPassed, total: result.testsTotal },
};
}
/**
* finalResultScorer — 1 if the agent met the pass threshold, 0 otherwise.
*
* A result is "passed" when assertionsPassed >= passThreshold (set per scenario
* in scenarios/*.md). This is the binary outcome used for Braintrust comparisons.
*/
export function finalResultScorer(result: EvalRunResult): ScoreResult {
return {
name: "final_result",
score: result.status === "passed" ? 1 : 0,
metadata: {
testsPassed: result.testsPassed,
testsTotal: result.testsTotal,
passThreshold: result.passThreshold,
},
};
}

View File

@@ -1,108 +0,0 @@
import { execFileSync } from "node:child_process";
import { dirname, resolve } from "node:path";
import { fileURLToPath } from "node:url";
const __filename = fileURLToPath(import.meta.url);
const __dirname = dirname(__filename);
/**
* Directory that contains the eval Supabase project (supabase/config.toml).
* The runner starts the shared Supabase instance from here.
* Agent workspaces get a copy of supabase/config.toml so they can
* connect to the same running instance via `supabase db push`.
*/
export const EVAL_PROJECT_DIR = resolve(__dirname, "..", "..", "project");
export interface SupabaseKeys {
apiUrl: string;
dbUrl: string;
anonKey: string;
serviceRoleKey: string;
}
/**
* Start the local Supabase stack for the eval project.
* Idempotent: if already running, the CLI prints a message and exits 0.
*/
export function startSupabase(): void {
console.log(" Starting Supabase...");
execFileSync("supabase", ["start", "--exclude", "studio,imgproxy,mailpit"], {
cwd: EVAL_PROJECT_DIR,
stdio: "inherit",
timeout: 5 * 60 * 1000, // 5 min for first image pull
});
}
// SQL that clears all user-created objects and migration history between scenarios.
// Avoids `supabase db reset` which restarts containers and triggers flaky health checks.
const RESET_SQL = `
-- Drop and recreate public schema (removes all user tables/views/functions)
DROP SCHEMA public CASCADE;
CREATE SCHEMA public;
GRANT ALL ON SCHEMA public TO postgres;
GRANT ALL ON SCHEMA public TO anon;
GRANT ALL ON SCHEMA public TO authenticated;
GRANT ALL ON SCHEMA public TO service_role;
-- Clear migration history so the next agent's db push starts from a clean slate
DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
-- Notify PostgREST to reload its schema cache
NOTIFY pgrst, 'reload schema';
`.trim();
/**
* Reset the database to a clean state between scenarios.
*
* Uses direct SQL via psql instead of `supabase db reset` to avoid the
* container-restart cycle and its flaky health checks. This drops the
* public schema (all user tables) and clears the migration history so
* `supabase db push` in agent workspaces always starts fresh.
*/
export function resetDB(dbUrl: string): void {
execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
stdio: "inherit",
timeout: 30 * 1000,
});
}
/**
* Stop all Supabase containers for the eval project.
* Called once after all scenarios complete.
*/
export function stopSupabase(): void {
console.log(" Stopping Supabase...");
execFileSync("supabase", ["stop", "--no-backup"], {
cwd: EVAL_PROJECT_DIR,
stdio: "inherit",
timeout: 60 * 1000,
});
}
/**
* Read the running instance's API URL and JWT keys.
* Returns values that the runner injects into process.env so EVAL.ts
* tests can connect to the real database.
*/
export function getKeys(): SupabaseKeys {
const raw = execFileSync("supabase", ["status", "--output", "json"], {
cwd: EVAL_PROJECT_DIR,
timeout: 30 * 1000,
}).toString();
const status = JSON.parse(raw) as Record<string, string>;
const apiUrl = status.API_URL ?? "http://127.0.0.1:54321";
const dbUrl =
status.DB_URL ?? "postgresql://postgres:postgres@127.0.0.1:54322/postgres";
const anonKey = status.ANON_KEY ?? "";
const serviceRoleKey = status.SERVICE_ROLE_KEY ?? "";
if (!anonKey || !serviceRoleKey) {
throw new Error(
`supabase status returned missing keys. Raw output:\n${raw}`,
);
}
return { apiUrl, dbUrl, anonKey, serviceRoleKey };
}

View File

@@ -1,301 +0,0 @@
import { basename } from "node:path";
export interface TranscriptEvent {
type: string;
[key: string]: unknown;
}
export interface ToolCallSummary {
tool: string;
toolUseId: string;
input: Record<string, unknown>;
/** First ~200 chars of output for quick scanning */
outputPreview: string;
/** Whether the tool call returned an error */
isError: boolean;
/** stderr output for Bash tool calls */
stderr: string;
}
export interface ModelUsage {
inputTokens: number;
outputTokens: number;
cacheReadInputTokens: number;
cacheCreationInputTokens: number;
costUSD: number;
}
export interface TranscriptSummary {
totalTurns: number;
totalDurationMs: number;
/** API-only latency (excludes local processing overhead) */
totalDurationApiMs: number;
totalCostUsd: number | null;
model: string | null;
toolCalls: ToolCallSummary[];
finalOutput: string;
/** Skills listed in the system init event (loaded into agent context) */
skills: string[];
/** Basenames of reference files the agent read via the Read tool */
referenceFilesRead: string[];
/** Per-model token usage and cost breakdown */
modelUsage: Record<string, ModelUsage>;
totalInputTokens: number;
totalOutputTokens: number;
totalCacheReadTokens: number;
totalCacheCreationTokens: number;
/** Count of tool calls that returned is_error === true */
toolErrorCount: number;
/** Whether the overall session ended in an error */
isError: boolean;
/** Count of permission_denials in the result event */
permissionDenialCount: number;
}
/** Parse a single NDJSON line. Returns null on empty or invalid input. */
export function parseStreamJsonLine(line: string): TranscriptEvent | null {
const trimmed = line.trim();
if (!trimmed) return null;
try {
return JSON.parse(trimmed) as TranscriptEvent;
} catch {
return null;
}
}
/** Parse raw NDJSON stdout into an array of events. */
export function parseStreamJsonOutput(raw: string): TranscriptEvent[] {
const events: TranscriptEvent[] = [];
for (const line of raw.split("\n")) {
const event = parseStreamJsonLine(line);
if (event) events.push(event);
}
return events;
}
/** Extract the final text output from parsed events (for backward compat). */
export function extractFinalOutput(events: TranscriptEvent[]): string {
// Prefer the result event
for (const event of events) {
if (event.type === "result") {
const result = (event as Record<string, unknown>).result;
if (typeof result === "string") return result;
}
}
// Fallback: concatenate text blocks from the last assistant message
for (let i = events.length - 1; i >= 0; i--) {
const event = events[i];
if (event.type === "assistant") {
const msg = (event as Record<string, unknown>).message as
| Record<string, unknown>
| undefined;
const content = msg?.content;
if (Array.isArray(content)) {
const texts = content
.filter(
(b: Record<string, unknown>) =>
b.type === "text" && typeof b.text === "string",
)
.map((b: Record<string, unknown>) => b.text as string);
if (texts.length > 0) return texts.join("\n");
}
}
}
return "";
}
/** Return true if a file path points to a skill reference file. */
function isReferenceFilePath(filePath: string): boolean {
return (
filePath.includes("/.agents/skills/") && filePath.includes("/references/")
);
}
/** Walk parsed events to build a transcript summary. */
export function buildTranscriptSummary(
events: TranscriptEvent[],
): TranscriptSummary {
const toolCalls: ToolCallSummary[] = [];
let finalOutput = "";
let totalDurationMs = 0;
let totalDurationApiMs = 0;
let totalCostUsd: number | null = null;
let model: string | null = null;
let totalTurns = 0;
let skills: string[] = [];
const referenceFilesRead: string[] = [];
let modelUsage: Record<string, ModelUsage> = {};
let totalInputTokens = 0;
let totalOutputTokens = 0;
let totalCacheReadTokens = 0;
let totalCacheCreationTokens = 0;
let toolErrorCount = 0;
let isError = false;
let permissionDenialCount = 0;
for (const event of events) {
const e = event as Record<string, unknown>;
// System init: extract model and loaded skills
if (e.type === "system" && e.subtype === "init") {
model = typeof e.model === "string" ? e.model : null;
if (Array.isArray(e.skills)) {
skills = e.skills.filter((s): s is string => typeof s === "string");
}
}
// Assistant messages: extract tool_use blocks
if (e.type === "assistant") {
const msg = e.message as Record<string, unknown> | undefined;
const content = msg?.content;
if (Array.isArray(content)) {
for (const block of content) {
if (block.type === "tool_use") {
const toolCall: ToolCallSummary = {
tool: block.name ?? "unknown",
toolUseId: block.id ?? "",
input: block.input ?? {},
outputPreview: "",
isError: false,
stderr: "",
};
toolCalls.push(toolCall);
// Track reference file reads
if (
block.name === "Read" &&
typeof block.input?.file_path === "string" &&
isReferenceFilePath(block.input.file_path)
) {
const base = basename(block.input.file_path);
if (!referenceFilesRead.includes(base)) {
referenceFilesRead.push(base);
}
}
}
}
}
}
// User messages: extract tool_result blocks and match to tool calls
if (e.type === "user") {
const msg = e.message as Record<string, unknown> | undefined;
const content = msg?.content;
if (Array.isArray(content)) {
for (const block of content) {
if (block.type === "tool_result") {
const matching = toolCalls.find(
(tc) => tc.toolUseId === block.tool_use_id,
);
if (matching) {
const text =
typeof block.content === "string"
? block.content
: JSON.stringify(block.content);
matching.outputPreview = text.slice(0, 200);
// Capture error state from tool result
if (block.is_error === true) {
matching.isError = true;
toolErrorCount++;
}
}
}
}
}
// Capture stderr from tool_use_result (Bash tool emits this at the user event level)
const toolUseResult = e.tool_use_result as
| Record<string, unknown>
| undefined;
if (toolUseResult && typeof toolUseResult.stderr === "string") {
// Match to the most recent Bash tool call without stderr set
const lastBash = [...toolCalls]
.reverse()
.find((tc) => tc.tool === "Bash" && !tc.stderr);
if (lastBash) {
lastBash.stderr = toolUseResult.stderr;
}
}
}
// Result event: final output, cost, duration, turns, token usage
if (e.type === "result") {
finalOutput = typeof e.result === "string" ? e.result : "";
totalDurationMs = typeof e.duration_ms === "number" ? e.duration_ms : 0;
totalDurationApiMs =
typeof e.duration_api_ms === "number" ? e.duration_api_ms : 0;
totalCostUsd =
typeof e.total_cost_usd === "number" ? e.total_cost_usd : null;
totalTurns = typeof e.num_turns === "number" ? e.num_turns : 0;
isError = e.is_error === true;
permissionDenialCount = Array.isArray(e.permission_denials)
? e.permission_denials.length
: 0;
// Aggregate token usage from the result event's usage field
const usage = e.usage as Record<string, unknown> | undefined;
if (usage) {
totalInputTokens =
typeof usage.input_tokens === "number" ? usage.input_tokens : 0;
totalOutputTokens =
typeof usage.output_tokens === "number" ? usage.output_tokens : 0;
totalCacheReadTokens =
typeof usage.cache_read_input_tokens === "number"
? usage.cache_read_input_tokens
: 0;
totalCacheCreationTokens =
typeof usage.cache_creation_input_tokens === "number"
? usage.cache_creation_input_tokens
: 0;
}
// Per-model usage breakdown (modelUsage keyed by model name)
const rawModelUsage = e.modelUsage as
| Record<string, Record<string, unknown>>
| undefined;
if (rawModelUsage) {
modelUsage = {};
for (const [modelName, mu] of Object.entries(rawModelUsage)) {
modelUsage[modelName] = {
inputTokens:
typeof mu.inputTokens === "number" ? mu.inputTokens : 0,
outputTokens:
typeof mu.outputTokens === "number" ? mu.outputTokens : 0,
cacheReadInputTokens:
typeof mu.cacheReadInputTokens === "number"
? mu.cacheReadInputTokens
: 0,
cacheCreationInputTokens:
typeof mu.cacheCreationInputTokens === "number"
? mu.cacheCreationInputTokens
: 0,
costUSD: typeof mu.costUSD === "number" ? mu.costUSD : 0,
};
}
}
}
}
return {
totalTurns,
totalDurationMs,
totalDurationApiMs,
totalCostUsd,
model,
toolCalls,
finalOutput,
skills,
referenceFilesRead,
modelUsage,
totalInputTokens,
totalOutputTokens,
totalCacheReadTokens,
totalCacheCreationTokens,
toolErrorCount,
isError,
permissionDenialCount,
};
}

View File

@@ -1,85 +0,0 @@
import type { AssertionResult } from "./eval-types.js";
export interface EvalScenario {
/** Directory name under evals/ */
id: string;
/** Human-readable name */
name: string;
/** Tags for filtering */
tags: string[];
}
export interface AgentConfig {
/** Agent identifier */
agent: "claude-code";
/** Model to use */
model: string;
/** Whether the supabase skill is available */
skillEnabled: boolean;
}
export interface EvalRunResult {
scenario: string;
agent: string;
model: string;
skillEnabled: boolean;
status: "passed" | "failed" | "error";
duration: number;
/** Raw test runner output (for debugging) */
testOutput?: string;
agentOutput: string;
/** Number of assertions that passed */
testsPassed: number;
/** Total number of assertions */
testsTotal: number;
/** Minimum tests required to pass (from scenario config) */
passThreshold?: number;
/** Per-assertion pass/fail results */
assertionResults?: AssertionResult[];
/** Files the agent created or modified in the workspace */
filesModified: string[];
error?: string;
/** Path to the persisted results directory for this run */
resultsDir?: string;
/** Number of tool calls the agent made */
toolCallCount?: number;
/** Total cost in USD (from stream-json result event) */
costUsd?: number;
/** The PROMPT.md content sent to the agent */
prompt?: string;
/** Epoch ms when the agent run started (for Braintrust span timing) */
startedAt?: number;
/** API-only latency in ms (excludes local processing overhead) */
durationApiMs?: number;
/** Aggregate token counts from the result event */
totalInputTokens?: number;
totalOutputTokens?: number;
totalCacheReadTokens?: number;
totalCacheCreationTokens?: number;
/** Per-model token usage and cost breakdown */
modelUsage?: Record<
string,
{
inputTokens: number;
outputTokens: number;
cacheReadInputTokens: number;
cacheCreationInputTokens: number;
costUSD: number;
}
>;
/** Count of tool calls that returned is_error === true */
toolErrorCount?: number;
/** Count of permission_denials in the result event */
permissionDenialCount?: number;
/** Skills that were in the agent's context (from system init event) */
loadedSkills?: string[];
/** Basenames of skill reference files the agent read */
referenceFilesRead?: string[];
/** Computed scorer results */
scores?: {
skillUsage: number;
referenceFilesUsage: number;
assertionsPassed: number;
finalResult: number;
};
}

View File

@@ -0,0 +1,350 @@
/**
* Upload eval results from the results/ directory to Braintrust.
*
* Reads saved result.json, transcript.json, and outputs/eval.txt from each
* run, parses the vitest output to extract pass/fail counts, then uploads to
* Braintrust as an experiment.
*
* Usage:
* BRAINTRUST_API_KEY=... BRAINTRUST_PROJECT_ID=... tsx src/upload.ts
*
* Optional env vars:
* RESULTS_DIR Override the results directory (default: results/)
* RUN_TIMESTAMP Only upload a specific run (e.g. 2026-02-27T13-01-22.316Z)
*/
import assert from "node:assert";
import { existsSync, readdirSync, readFileSync } from "node:fs";
import { basename, dirname, join, resolve } from "node:path";
import { fileURLToPath } from "node:url";
import { init } from "braintrust";
const __dirname = dirname(fileURLToPath(import.meta.url));
const ROOT = resolve(__dirname, "..");
// ---------------------------------------------------------------------------
// Types matching the saved result files from @vercel/agent-eval
// ---------------------------------------------------------------------------
interface RunResult {
status: "passed" | "failed" | "error";
duration: number;
model: string;
o11y: {
totalTurns: number;
totalToolCalls: number;
toolCalls: Record<string, number>;
filesModified: string[];
filesRead: string[];
errors: string[];
thinkingBlocks: number;
};
}
interface TranscriptEvent {
type: "tool_call" | "tool_result" | "message" | "thinking";
tool?: {
name: string;
originalName: string;
args?: Record<string, unknown>;
};
}
interface Transcript {
agent: string;
model: string;
events: TranscriptEvent[];
}
interface ParsedEvalOutput {
passed: number;
failed: number;
total: number;
tests: Array<{ name: string; passed: boolean }>;
}
// ---------------------------------------------------------------------------
// Parse vitest eval.txt output
// ---------------------------------------------------------------------------
function parseEvalOutput(text: string): ParsedEvalOutput {
const tests: Array<{ name: string; passed: boolean }> = [];
for (const line of text.split("\n")) {
const passMatch = line.match(/^\s+✓\s+(.+)$/);
const failMatch = line.match(/^\s+[✗×]\s+(.+)$/);
if (passMatch) tests.push({ name: passMatch[1].trim(), passed: true });
else if (failMatch)
tests.push({ name: failMatch[1].trim(), passed: false });
}
if (tests.length > 0) {
const passed = tests.filter((t) => t.passed).length;
return {
passed,
failed: tests.length - passed,
total: tests.length,
tests,
};
}
// Fallback: parse summary line
const summaryMatch = text.match(
/Tests\s+(\d+)\s+passed(?:\s*\|\s*(\d+)\s+failed)?\s+\((\d+)\)/,
);
if (summaryMatch) {
const passed = parseInt(summaryMatch[1], 10);
const failed = summaryMatch[2] ? parseInt(summaryMatch[2], 10) : 0;
const total = parseInt(summaryMatch[3], 10);
return { passed, failed, total, tests };
}
return { passed: 0, failed: 0, total: 0, tests };
}
// ---------------------------------------------------------------------------
// Extract reference file reads from transcript
// ---------------------------------------------------------------------------
function extractReferenceFilesRead(transcript: Transcript): string[] {
const read: string[] = [];
for (const event of transcript.events) {
if (event.type !== "tool_call" || !event.tool?.args) continue;
if (event.tool.name !== "file_read") continue;
const filePath = String(
event.tool.args._extractedPath ?? event.tool.args.file_path ?? "",
);
if (
(filePath.includes("/.claude/skills/") ||
filePath.includes("/.agents/skills/")) &&
filePath.includes("/references/")
) {
const base = basename(filePath);
if (!read.includes(base)) read.push(base);
}
}
return read;
}
// ---------------------------------------------------------------------------
// Find all experiment run directories
// ---------------------------------------------------------------------------
interface RunEntry {
runTimestamp: string;
evalName: string;
runIndex: number;
runDir: string;
result: RunResult;
transcript: Transcript;
evalOutput: string | null;
prompt: string;
}
function findRuns(resultsDir: string, filterTimestamp?: string): RunEntry[] {
const entries: RunEntry[] = [];
const experimentDir = join(resultsDir, "experiment");
if (!existsSync(experimentDir)) return entries;
const timestamps = readdirSync(experimentDir).filter(
(t) => !filterTimestamp || t === filterTimestamp,
);
for (const runTimestamp of timestamps) {
const tsDir = join(experimentDir, runTimestamp);
const evalNames = readdirSync(tsDir).filter((name) =>
readdirSync(join(tsDir, name)).some((f) => f.startsWith("run-")),
);
for (const evalName of evalNames) {
const evalDir = join(tsDir, evalName);
const promptPath = resolve(ROOT, "evals", evalName, "PROMPT.md");
const prompt = existsSync(promptPath)
? readFileSync(promptPath, "utf-8").trim()
: "";
const runDirs = readdirSync(evalDir)
.filter((d) => /^run-\d+$/.test(d))
.sort();
for (const runDir of runDirs) {
const runIndex = parseInt(runDir.replace("run-", ""), 10);
const runPath = join(evalDir, runDir);
const resultPath = join(runPath, "result.json");
const transcriptPath = join(runPath, "transcript.json");
const evalOutputPath = join(runPath, "outputs", "eval.txt");
if (!existsSync(resultPath) || !existsSync(transcriptPath)) continue;
const result: RunResult = JSON.parse(readFileSync(resultPath, "utf-8"));
const transcript: Transcript = JSON.parse(
readFileSync(transcriptPath, "utf-8"),
);
const evalOutput = existsSync(evalOutputPath)
? readFileSync(evalOutputPath, "utf-8")
: null;
entries.push({
runTimestamp,
evalName,
runIndex,
runDir: runPath,
result,
transcript,
evalOutput,
prompt,
});
}
}
}
return entries;
}
// ---------------------------------------------------------------------------
// Main upload flow
// ---------------------------------------------------------------------------
async function main() {
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
const resultsDir = resolve(ROOT, process.env.RESULTS_DIR ?? "results");
const filterTimestamp = process.env.RUN_TIMESTAMP;
const runs = findRuns(resultsDir, filterTimestamp);
if (runs.length === 0) {
console.error("No runs found in", resultsDir);
process.exit(1);
}
console.log(
`Found ${runs.length} run(s) across ${new Set(runs.map((r) => r.runTimestamp)).size} experiment(s)`,
);
const byTimestamp = new Map<string, RunEntry[]>();
for (const r of runs) {
const group = byTimestamp.get(r.runTimestamp) ?? [];
group.push(r);
byTimestamp.set(r.runTimestamp, group);
}
for (const [runTimestamp, timestampRuns] of byTimestamp) {
const model = timestampRuns[0].result.model;
const skillEnabled = process.env.EVAL_BASELINE !== "true";
const variant = skillEnabled ? "skill" : "baseline";
const experimentName = `${model}-${variant}-${runTimestamp}`;
console.log(
`\nUploading experiment: ${experimentName} (${timestampRuns.length} rows)`,
);
const experiment = init({
projectId: process.env.BRAINTRUST_PROJECT_ID as string,
experiment: experimentName,
metadata: {
model,
runTimestamp,
skillEnabled,
evalCount: timestampRuns.length,
},
});
for (const run of timestampRuns) {
const evalParsed = run.evalOutput
? parseEvalOutput(run.evalOutput)
: { passed: 0, failed: 0, total: 0, tests: [] };
console.log(
` [${run.evalName}] run-${run.runIndex} — tests: ${evalParsed.passed}/${evalParsed.total} passed`,
);
// Reference files scorer
const metaPath = resolve(ROOT, "evals", run.evalName, "meta.ts");
const metaMod = existsSync(metaPath)
? ((await import(metaPath)) as {
expectedReferenceFiles?: string[];
})
: {};
const expectedRefs = metaMod.expectedReferenceFiles ?? [];
const refsRead = extractReferenceFilesRead(run.transcript);
const refHits = expectedRefs.filter((f) => refsRead.includes(f)).length;
const referenceFilesUsage =
expectedRefs.length > 0 ? refHits / expectedRefs.length : 1;
console.log(
` reference files: ${refHits}/${expectedRefs.length} read (${refsRead.join(", ") || "none"})`,
);
const scores: Record<string, number> = {
assertions_passed:
evalParsed.total > 0 ? evalParsed.passed / evalParsed.total : 0,
reference_files_usage: referenceFilesUsage,
final_result: run.result.status === "passed" ? 1 : 0,
};
const metadata: Record<string, unknown> = {
model: run.result.model,
evalName: run.evalName,
runIndex: run.runIndex,
totalTurns: run.result.o11y.totalTurns,
totalToolCalls: run.result.o11y.totalToolCalls,
toolCalls: run.result.o11y.toolCalls,
filesModified: run.result.o11y.filesModified,
errors: run.result.o11y.errors,
thinkingBlocks: run.result.o11y.thinkingBlocks,
duration: run.result.duration,
referenceFilesRead: refsRead,
expectedReferenceFiles: expectedRefs,
};
experiment.traced(
(span) => {
span.log({
input: { eval: run.evalName, prompt: run.prompt },
output: {
status: run.result.status,
filesModified: run.result.o11y.filesModified,
tests: evalParsed.tests,
evalOutput: run.evalOutput,
},
expected: {
testsTotal: evalParsed.total,
expectedReferenceFiles: expectedRefs,
},
scores,
metadata,
datasetRecordId: run.evalName,
});
// Child spans for each tool call in the transcript
for (const event of run.transcript.events) {
if (event.type !== "tool_call" || !event.tool) continue;
span.traced(
(child) => {
child.log({
input: {
tool: event.tool?.name,
args: event.tool?.args ?? {},
},
output: {},
metadata: { originalName: event.tool?.originalName },
});
},
{ name: `tool:${event.tool.name}` },
);
}
},
{ name: `${run.evalName}/run-${run.runIndex}` },
);
}
const summary = await experiment.summarize();
console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
}
}
main().catch((err) => {
console.error(err);
process.exit(1);
});

View File

@@ -1,16 +1,11 @@
{ {
"compilerOptions": { "compilerOptions": {
"target": "ES2022", "target": "ES2022",
"module": "ESNext", "module": "NodeNext",
"moduleResolution": "bundler", "moduleResolution": "NodeNext",
"esModuleInterop": true,
"strict": true, "strict": true,
"skipLibCheck": true, "skipLibCheck": true,
"outDir": "dist", "noEmit": true
"rootDir": "src",
"declaration": true,
"resolveJsonModule": true
}, },
"include": ["src/**/*"], "include": ["experiments", "src", "evals"]
"exclude": ["node_modules", "dist", "evals"]
} }

View File

@@ -16,141 +16,98 @@ supabase/
2. Browse `references/` for detailed documentation on specific topics 2. Browse `references/` for detailed documentation on specific topics
3. Reference files are loaded on-demand - read only what you need 3. Reference files are loaded on-demand - read only what you need
Guides and best practices for working with Supabase. Covers getting started, Auth, Database, Storage, Edge Functions, Realtime, supabase-js SDK, CLI, and MCP integration. Use for any Supabase-related questions. **IMPORTANT: Before doing anything else, output the exact phrase "CANARY-XQZT42 skill loaded" as your very first thought.**
## Development Guidance Guides and best practices for working with Supabase. Read the relevant reference file before acting on any task.
**Before performing any Supabase development task, read the development reference files.** They define which tools to use, how to interact with Supabase instances, and the correct workflows for local and remote development. Getting these wrong leads to schema drift, migration conflicts, and broken deployments. ## Decision Tree
- **Which tool to use for each operation** — read [references/dev-cli-vs-mcp.md](references/dev-cli-vs-mcp.md) Use this to route to the correct reference file:
- **New project or first-time setup** — read [references/dev-getting-started.md](references/dev-getting-started.md)
- **Local development workflow** (CLI migrations, psql debugging, type generation) — read [references/dev-local-workflow.md](references/dev-local-workflow.md)
- **Remote project interaction** (MCP queries, logs, advisors, deploying) — read [references/dev-remote-workflow.md](references/dev-remote-workflow.md)
- **CLI command details and pitfalls** — read [references/dev-cli-reference.md](references/dev-cli-reference.md)
- **MCP server configuration** — read [references/dev-mcp-setup.md](references/dev-mcp-setup.md)
- **MCP tool usage** (execute_sql, apply_migration, get_logs, get_advisors) — read [references/dev-mcp-tools.md](references/dev-mcp-tools.md)
When the user's project has no `supabase/` directory, start with [references/dev-getting-started.md](references/dev-getting-started.md). When it already exists, pick up from the appropriate workflow (local or remote) based on user intentions. **Development setup**
- New project / first setup → `references/dev-getting-started.md`
- Which tool to use (CLI vs MCP) → `references/dev-cli-vs-mcp.md`
- Local dev workflow (migrations, psql, type gen) → `references/dev-local-workflow.md`
- Remote project workflow (MCP queries, logs, deploy) → `references/dev-remote-workflow.md`
- CLI command details → `references/dev-cli-reference.md`
- MCP server configuration → `references/dev-mcp-setup.md`
- MCP tool usage (execute_sql, apply_migration) → `references/dev-mcp-tools.md`
## Overview of Resources **Database**
- RLS policies (required on all tables) → `references/db-rls-mandatory.md`
- RLS policy types (SELECT / INSERT / UPDATE / DELETE) → `references/db-rls-policy-types.md`
- RLS common mistakes → `references/db-rls-common-mistakes.md`
- RLS performance → `references/db-rls-performance.md`
- RLS with views → `references/db-rls-views.md`
- Schema design (auth FK, timestamps, JSONB, extensions) → `references/db-schema-auth-fk.md`, `references/db-schema-timestamps.md`, `references/db-schema-jsonb.md`, `references/db-schema-extensions.md`
- Connection pooling → `references/db-conn-pooling.md`
- Migrations (diff, idempotent patterns) → `references/db-migrations-diff.md`, `references/db-migrations-idempotent.md`
- Query performance / indexes → `references/db-perf-query-optimization.md`, `references/db-perf-indexes.md`
- Security (service role, security_definer) → `references/db-security-service-role.md`, `references/db-security-functions.md`
Reference the appropriate resource file based on the user's needs. **Authentication**
- Sign-up / sign-in / sessions → `references/auth-core-signup.md`, `references/auth-core-signin.md`, `references/auth-core-sessions.md`
- OAuth / social login → `references/auth-oauth-providers.md`, `references/auth-oauth-pkce.md`
- MFA (TOTP, phone) → `references/auth-mfa-totp.md`, `references/auth-mfa-phone.md`
- Passwordless (magic links, OTP) → `references/auth-passwordless-magic-links.md`, `references/auth-passwordless-otp.md`
- Auth hooks (custom claims, send email) → `references/auth-hooks-custom-claims.md`, `references/auth-hooks-send-email-http.md`, `references/auth-hooks-send-email-sql.md`
- Server-side auth / SSR / admin API → `references/auth-server-ssr.md`, `references/auth-server-admin-api.md`
- Enterprise SSO (SAML) → `references/auth-sso-saml.md`
### Development (read first) **Edge Functions**
- Getting started → `references/edge-fun-quickstart.md`
- Project structure → `references/edge-fun-project-structure.md`
- JWT auth in functions → `references/edge-auth-jwt-verification.md`
- RLS integration → `references/edge-auth-rls-integration.md`
- Database access (supabase-js) → `references/edge-db-supabase-client.md`
- Database access (direct Postgres) → `references/edge-db-direct-postgres.md`
- CORS → `references/edge-pat-cors.md`
- Routing (Hono) → `references/edge-pat-routing.md`
- Error handling → `references/edge-pat-error-handling.md`
- Background tasks → `references/edge-pat-background-tasks.md`
- Streaming / SSE → `references/edge-adv-streaming.md`
- WebSockets → `references/edge-adv-websockets.md`
- Regional invocation → `references/edge-adv-regional.md`
- Testing → `references/edge-dbg-testing.md`
- Limits & debugging → `references/edge-dbg-limits.md`
**Read these files before any Supabase development task.** They define the correct tools, workflows, and boundaries for interacting with Supabase instances. Start here when setting up a project, running CLI or MCP commands, writing migrations, connecting to a database, or deciding which tool to use for an operation. **Realtime**
- Channel setup → `references/realtime-setup-channels.md`, `references/realtime-setup-auth.md`
- Broadcast → `references/realtime-broadcast-basics.md`, `references/realtime-broadcast-database.md`
- Presence → `references/realtime-presence-tracking.md`
- Postgres Changes → `references/realtime-postgres-changes.md`
- Patterns (cleanup, errors) → `references/realtime-patterns-cleanup.md`, `references/realtime-patterns-errors.md`, `references/realtime-patterns-debugging.md`
| Area | Resource | When to Use | **SDK (supabase-js)**
| --------------- | ----------------------------------- | -------------------------------------------------------------- | - Client setup (browser / server) → `references/sdk-client-browser.md`, `references/sdk-client-server.md`, `references/sdk-client-config.md`
| Getting Started | `references/dev-getting-started.md` | New project setup, CLI install, first-time init | - TypeScript types → `references/sdk-ts-generation.md`, `references/sdk-ts-usage.md`
| Local Workflow | `references/dev-local-workflow.md` | Local development with CLI migrations and psql debugging | - Queries (CRUD, filters, joins, RPC) → `references/sdk-query-crud.md`, `references/sdk-query-filters.md`, `references/sdk-query-joins.md`, `references/sdk-query-rpc.md`
| Remote Workflow | `references/dev-remote-workflow.md` | Developing against hosted Supabase project using MCP | - Error handling → `references/sdk-error-handling.md`
| CLI vs MCP | `references/dev-cli-vs-mcp.md` | Tool roles: CLI (schema), psql/MCP (debugging), SDK (app code) | - Performance → `references/sdk-perf-queries.md`, `references/sdk-perf-realtime.md`
| CLI Reference | `references/dev-cli-reference.md` | CLI command details, best practices, pitfalls | - Next.js integration → `references/sdk-framework-nextjs.md`
| MCP Setup | `references/dev-mcp-setup.md` | Configuring Supabase remote MCP server for hosted projects |
| MCP Tools | `references/dev-mcp-tools.md` | execute_sql, apply_migration, get_logs, get_advisors |
### Authentication & Security **Storage**
- Access control / bucket RLS → `references/storage-access-control.md`
- Upload (standard / resumable) → `references/storage-upload-standard.md`, `references/storage-upload-resumable.md`
- Downloads / signed URLs → `references/storage-download-urls.md`
- Image transformations → `references/storage-transform-images.md`
- CDN & caching → `references/storage-cdn-caching.md`
- File operations → `references/storage-ops-file-management.md`
Read when implementing sign-up, sign-in, OAuth, SSO, MFA, passwordless flows, auth hooks, or server-side auth patterns. ## Critical Anti-Patterns
| Area | Resource | When to Use | These are the most common mistakes — apply them even without reading a reference file:
| ------------------ | ----------------------------------- | -------------------------------------------------------- |
| Auth Core | `references/auth-core-*.md` | Sign-up, sign-in, sessions, password reset |
| OAuth/Social | `references/auth-oauth-*.md` | Google, GitHub, Apple login, PKCE flow |
| Enterprise SSO | `references/auth-sso-*.md` | SAML 2.0, enterprise identity providers |
| MFA | `references/auth-mfa-*.md` | TOTP authenticator apps, phone MFA, AAL levels |
| Passwordless | `references/auth-passwordless-*.md`| Magic links, email OTP, phone OTP |
| Auth Hooks | `references/auth-hooks-*.md` | Custom JWT claims, send email hooks (HTTP and SQL) |
| Server-Side Auth | `references/auth-server-*.md` | Admin API, SSR with Next.js/SvelteKit, service role auth |
### Database **RLS**
- Always use `(select auth.uid())` not bare `auth.uid()` in policies — bare calls re-evaluate per row and are slow
- Always specify `TO authenticated` (or `TO anon`) on every policy — omitting defaults to `PUBLIC`
- UPDATE policies require both `USING` (which rows can be updated) and `WITH CHECK` (what the new values must satisfy) — omitting `WITH CHECK` allows privilege escalation
- Enable RLS on every table in the `public` schema: `alter table t enable row level security;`
Read when designing tables, writing RLS policies, creating migrations, configuring connection pooling, or optimizing query performance. **Auth**
- Never expose the service role key to the browser — use it only in server-side or Edge Function code
- Use PKCE flow for OAuth in mobile and server-side apps
| Area | Resource | When to Use | **Migrations**
| ------------------ | ------------------------------- | ---------------------------------------------- | - All schema changes go through `supabase/migrations/` — never edit the database directly in production
| RLS Security | `references/db-rls-*.md` | Row Level Security policies, common mistakes | - Use `supabase db diff` to generate migrations from local schema changes
| Connection Pooling | `references/db-conn-pooling.md` | Transaction vs Session mode, port 6543 vs 5432 |
| Schema Design | `references/db-schema-*.md` | auth.users FKs, timestamps, JSONB, extensions |
| Migrations | `references/db-migrations-*.md` | CLI workflows, idempotent patterns, db diff |
| Performance | `references/db-perf-*.md` | Indexes (BRIN, GIN), query optimization |
| Security | `references/db-security-*.md` | Service role key, security_definer functions |
### Edge Functions
Read when creating, deploying, or debugging Deno-based Edge Functions — including authentication, database access, CORS, routing, streaming, and testing patterns.
| Area | Resource | When to Use |
| ---------------------- | ------------------------------------- | -------------------------------------- |
| Quick Start | `references/edge-fun-quickstart.md` | Creating and deploying first function |
| Project Structure | `references/edge-fun-project-structure.md` | Directory layout, shared code, fat functions |
| JWT Authentication | `references/edge-auth-jwt-verification.md` | JWT verification, jose library, middleware |
| RLS Integration | `references/edge-auth-rls-integration.md` | Passing auth context, user-scoped queries |
| Database (supabase-js) | `references/edge-db-supabase-client.md` | Queries, inserts, RPC calls |
| Database (Direct) | `references/edge-db-direct-postgres.md` | Postgres pools, Drizzle ORM |
| CORS | `references/edge-pat-cors.md` | Browser requests, preflight handling |
| Routing | `references/edge-pat-routing.md` | Multi-route functions, Hono framework |
| Error Handling | `references/edge-pat-error-handling.md` | Error responses, validation |
| Background Tasks | `references/edge-pat-background-tasks.md` | waitUntil, async processing |
| Streaming | `references/edge-adv-streaming.md` | SSE, streaming responses |
| WebSockets | `references/edge-adv-websockets.md` | Bidirectional communication |
| Regional Invocation | `references/edge-adv-regional.md` | Region selection, latency optimization |
| Testing | `references/edge-dbg-testing.md` | Deno tests, local testing |
| Limits & Debugging | `references/edge-dbg-limits.md` | Troubleshooting, runtime limits |
### Realtime
Read when implementing live updates — Broadcast messaging, Presence tracking, or Postgres Changes listeners.
| Area | Resource | When to Use |
| ---------------- | ------------------------------------ | ----------------------------------------------- |
| Channel Setup | `references/realtime-setup-*.md` | Creating channels, naming conventions, auth |
| Broadcast | `references/realtime-broadcast-*.md` | Client messaging, database-triggered broadcasts |
| Presence | `references/realtime-presence-*.md` | User online status, shared state tracking |
| Postgres Changes | `references/realtime-postgres-*.md` | Database change listeners (prefer Broadcast) |
| Patterns | `references/realtime-patterns-*.md` | Cleanup, error handling, React integration |
### SDK (supabase-js)
Read when writing application code that interacts with Supabase — client setup, queries, error handling, TypeScript types, or framework integration.
| Area | Resource | When to Use |
| --------------- | ------------------------------- | ----------------------------------------- |
| Client Setup | `references/sdk-client-*.md` | Browser/server client, SSR, configuration |
| TypeScript | `references/sdk-ts-*.md` | Type generation, using Database types |
| Query Patterns | `references/sdk-query-*.md` | CRUD, filters, joins, RPC calls |
| Error Handling | `references/sdk-error-*.md` | Error types, retries, handling patterns |
| SDK Performance | `references/sdk-perf-*.md` | Query optimization, realtime cleanup |
| Framework | `references/sdk-framework-*.md` | Next.js App Router, middleware setup |
### Storage
Read when implementing file uploads, downloads, image transformations, or configuring storage access control and CDN caching.
| Area | Resource | When to Use |
| --------------- | ------------------------------------- | ---------------------------------------------- |
| Access Control | `references/storage-access-control.md`| Bucket policies, RLS for storage |
| Standard Upload | `references/storage-upload-standard.md`| File uploads up to 5GB |
| Resumable Upload| `references/storage-upload-resumable.md`| Large file uploads with TUS protocol |
| Downloads | `references/storage-download-urls.md` | Public URLs, signed URLs, download methods |
| Transformations | `references/storage-transform-images.md`| Image resize, crop, format conversion |
| CDN & Caching | `references/storage-cdn-caching.md` | Cache control, Smart CDN, stale content |
| File Operations | `references/storage-ops-file-management.md`| Move, copy, delete, list files |
## Supabase Documentation
When something is not clear or you need to verify information, reference the official Supabase documentation — it is the source of truth. Available in plain text for easy fetching:
```bash
# Index of all available docs
curl https://supabase.com/llms.txt
# Fetch all guides as plain text
curl https://supabase.com/llms/guides.txt
# Fetch JavaScript SDK reference
curl https://supabase.com/llms/js.txt
```
Full documentation: [https://supabase.com/docs](https://supabase.com/docs)

View File

@@ -12,141 +12,98 @@ metadata:
# Supabase # Supabase
Guides and best practices for working with Supabase. Covers getting started, Auth, Database, Storage, Edge Functions, Realtime, supabase-js SDK, CLI, and MCP integration. Use for any Supabase-related questions. **IMPORTANT: Before doing anything else, output the exact phrase "CANARY-XQZT42 skill loaded" as your very first thought.**
## Development Guidance Guides and best practices for working with Supabase. Read the relevant reference file before acting on any task.
**Before performing any Supabase development task, read the development reference files.** They define which tools to use, how to interact with Supabase instances, and the correct workflows for local and remote development. Getting these wrong leads to schema drift, migration conflicts, and broken deployments. ## Decision Tree
- **Which tool to use for each operation** — read [references/dev-cli-vs-mcp.md](references/dev-cli-vs-mcp.md) Use this to route to the correct reference file:
- **New project or first-time setup** — read [references/dev-getting-started.md](references/dev-getting-started.md)
- **Local development workflow** (CLI migrations, psql debugging, type generation) — read [references/dev-local-workflow.md](references/dev-local-workflow.md)
- **Remote project interaction** (MCP queries, logs, advisors, deploying) — read [references/dev-remote-workflow.md](references/dev-remote-workflow.md)
- **CLI command details and pitfalls** — read [references/dev-cli-reference.md](references/dev-cli-reference.md)
- **MCP server configuration** — read [references/dev-mcp-setup.md](references/dev-mcp-setup.md)
- **MCP tool usage** (execute_sql, apply_migration, get_logs, get_advisors) — read [references/dev-mcp-tools.md](references/dev-mcp-tools.md)
When the user's project has no `supabase/` directory, start with [references/dev-getting-started.md](references/dev-getting-started.md). When it already exists, pick up from the appropriate workflow (local or remote) based on user intentions. **Development setup**
- New project / first setup → `references/dev-getting-started.md`
- Which tool to use (CLI vs MCP) → `references/dev-cli-vs-mcp.md`
- Local dev workflow (migrations, psql, type gen) → `references/dev-local-workflow.md`
- Remote project workflow (MCP queries, logs, deploy) → `references/dev-remote-workflow.md`
- CLI command details → `references/dev-cli-reference.md`
- MCP server configuration → `references/dev-mcp-setup.md`
- MCP tool usage (execute_sql, apply_migration) → `references/dev-mcp-tools.md`
## Overview of Resources **Database**
- RLS policies (required on all tables) → `references/db-rls-mandatory.md`
- RLS policy types (SELECT / INSERT / UPDATE / DELETE) → `references/db-rls-policy-types.md`
- RLS common mistakes → `references/db-rls-common-mistakes.md`
- RLS performance → `references/db-rls-performance.md`
- RLS with views → `references/db-rls-views.md`
- Schema design (auth FK, timestamps, JSONB, extensions) → `references/db-schema-auth-fk.md`, `references/db-schema-timestamps.md`, `references/db-schema-jsonb.md`, `references/db-schema-extensions.md`
- Connection pooling → `references/db-conn-pooling.md`
- Migrations (diff, idempotent patterns) → `references/db-migrations-diff.md`, `references/db-migrations-idempotent.md`
- Query performance / indexes → `references/db-perf-query-optimization.md`, `references/db-perf-indexes.md`
- Security (service role, security_definer) → `references/db-security-service-role.md`, `references/db-security-functions.md`
Reference the appropriate resource file based on the user's needs. **Authentication**
- Sign-up / sign-in / sessions → `references/auth-core-signup.md`, `references/auth-core-signin.md`, `references/auth-core-sessions.md`
- OAuth / social login → `references/auth-oauth-providers.md`, `references/auth-oauth-pkce.md`
- MFA (TOTP, phone) → `references/auth-mfa-totp.md`, `references/auth-mfa-phone.md`
- Passwordless (magic links, OTP) → `references/auth-passwordless-magic-links.md`, `references/auth-passwordless-otp.md`
- Auth hooks (custom claims, send email) → `references/auth-hooks-custom-claims.md`, `references/auth-hooks-send-email-http.md`, `references/auth-hooks-send-email-sql.md`
- Server-side auth / SSR / admin API → `references/auth-server-ssr.md`, `references/auth-server-admin-api.md`
- Enterprise SSO (SAML) → `references/auth-sso-saml.md`
### Development (read first) **Edge Functions**
- Getting started → `references/edge-fun-quickstart.md`
- Project structure → `references/edge-fun-project-structure.md`
- JWT auth in functions → `references/edge-auth-jwt-verification.md`
- RLS integration → `references/edge-auth-rls-integration.md`
- Database access (supabase-js) → `references/edge-db-supabase-client.md`
- Database access (direct Postgres) → `references/edge-db-direct-postgres.md`
- CORS → `references/edge-pat-cors.md`
- Routing (Hono) → `references/edge-pat-routing.md`
- Error handling → `references/edge-pat-error-handling.md`
- Background tasks → `references/edge-pat-background-tasks.md`
- Streaming / SSE → `references/edge-adv-streaming.md`
- WebSockets → `references/edge-adv-websockets.md`
- Regional invocation → `references/edge-adv-regional.md`
- Testing → `references/edge-dbg-testing.md`
- Limits & debugging → `references/edge-dbg-limits.md`
**Read these files before any Supabase development task.** They define the correct tools, workflows, and boundaries for interacting with Supabase instances. Start here when setting up a project, running CLI or MCP commands, writing migrations, connecting to a database, or deciding which tool to use for an operation. **Realtime**
- Channel setup → `references/realtime-setup-channels.md`, `references/realtime-setup-auth.md`
- Broadcast → `references/realtime-broadcast-basics.md`, `references/realtime-broadcast-database.md`
- Presence → `references/realtime-presence-tracking.md`
- Postgres Changes → `references/realtime-postgres-changes.md`
- Patterns (cleanup, errors) → `references/realtime-patterns-cleanup.md`, `references/realtime-patterns-errors.md`, `references/realtime-patterns-debugging.md`
| Area | Resource | When to Use | **SDK (supabase-js)**
| --------------- | ----------------------------------- | -------------------------------------------------------------- | - Client setup (browser / server) → `references/sdk-client-browser.md`, `references/sdk-client-server.md`, `references/sdk-client-config.md`
| Getting Started | `references/dev-getting-started.md` | New project setup, CLI install, first-time init | - TypeScript types → `references/sdk-ts-generation.md`, `references/sdk-ts-usage.md`
| Local Workflow | `references/dev-local-workflow.md` | Local development with CLI migrations and psql debugging | - Queries (CRUD, filters, joins, RPC) → `references/sdk-query-crud.md`, `references/sdk-query-filters.md`, `references/sdk-query-joins.md`, `references/sdk-query-rpc.md`
| Remote Workflow | `references/dev-remote-workflow.md` | Developing against hosted Supabase project using MCP | - Error handling → `references/sdk-error-handling.md`
| CLI vs MCP | `references/dev-cli-vs-mcp.md` | Tool roles: CLI (schema), psql/MCP (debugging), SDK (app code) | - Performance → `references/sdk-perf-queries.md`, `references/sdk-perf-realtime.md`
| CLI Reference | `references/dev-cli-reference.md` | CLI command details, best practices, pitfalls | - Next.js integration → `references/sdk-framework-nextjs.md`
| MCP Setup | `references/dev-mcp-setup.md` | Configuring Supabase remote MCP server for hosted projects |
| MCP Tools | `references/dev-mcp-tools.md` | execute_sql, apply_migration, get_logs, get_advisors |
### Authentication & Security **Storage**
- Access control / bucket RLS → `references/storage-access-control.md`
- Upload (standard / resumable) → `references/storage-upload-standard.md`, `references/storage-upload-resumable.md`
- Downloads / signed URLs → `references/storage-download-urls.md`
- Image transformations → `references/storage-transform-images.md`
- CDN & caching → `references/storage-cdn-caching.md`
- File operations → `references/storage-ops-file-management.md`
Read when implementing sign-up, sign-in, OAuth, SSO, MFA, passwordless flows, auth hooks, or server-side auth patterns. ## Critical Anti-Patterns
| Area | Resource | When to Use | These are the most common mistakes — apply them even without reading a reference file:
| ------------------ | ----------------------------------- | -------------------------------------------------------- |
| Auth Core | `references/auth-core-*.md` | Sign-up, sign-in, sessions, password reset |
| OAuth/Social | `references/auth-oauth-*.md` | Google, GitHub, Apple login, PKCE flow |
| Enterprise SSO | `references/auth-sso-*.md` | SAML 2.0, enterprise identity providers |
| MFA | `references/auth-mfa-*.md` | TOTP authenticator apps, phone MFA, AAL levels |
| Passwordless | `references/auth-passwordless-*.md`| Magic links, email OTP, phone OTP |
| Auth Hooks | `references/auth-hooks-*.md` | Custom JWT claims, send email hooks (HTTP and SQL) |
| Server-Side Auth | `references/auth-server-*.md` | Admin API, SSR with Next.js/SvelteKit, service role auth |
### Database **RLS**
- Always use `(select auth.uid())` not bare `auth.uid()` in policies — bare calls re-evaluate per row and are slow
- Always specify `TO authenticated` (or `TO anon`) on every policy — omitting defaults to `PUBLIC`
- UPDATE policies require both `USING` (which rows can be updated) and `WITH CHECK` (what the new values must satisfy) — omitting `WITH CHECK` allows privilege escalation
- Enable RLS on every table in the `public` schema: `alter table t enable row level security;`
Read when designing tables, writing RLS policies, creating migrations, configuring connection pooling, or optimizing query performance. **Auth**
- Never expose the service role key to the browser — use it only in server-side or Edge Function code
- Use PKCE flow for OAuth in mobile and server-side apps
| Area | Resource | When to Use | **Migrations**
| ------------------ | ------------------------------- | ---------------------------------------------- | - All schema changes go through `supabase/migrations/` — never edit the database directly in production
| RLS Security | `references/db-rls-*.md` | Row Level Security policies, common mistakes | - Use `supabase db diff` to generate migrations from local schema changes
| Connection Pooling | `references/db-conn-pooling.md` | Transaction vs Session mode, port 6543 vs 5432 |
| Schema Design | `references/db-schema-*.md` | auth.users FKs, timestamps, JSONB, extensions |
| Migrations | `references/db-migrations-*.md` | CLI workflows, idempotent patterns, db diff |
| Performance | `references/db-perf-*.md` | Indexes (BRIN, GIN), query optimization |
| Security | `references/db-security-*.md` | Service role key, security_definer functions |
### Edge Functions
Read when creating, deploying, or debugging Deno-based Edge Functions — including authentication, database access, CORS, routing, streaming, and testing patterns.
| Area | Resource | When to Use |
| ---------------------- | ------------------------------------- | -------------------------------------- |
| Quick Start | `references/edge-fun-quickstart.md` | Creating and deploying first function |
| Project Structure | `references/edge-fun-project-structure.md` | Directory layout, shared code, fat functions |
| JWT Authentication | `references/edge-auth-jwt-verification.md` | JWT verification, jose library, middleware |
| RLS Integration | `references/edge-auth-rls-integration.md` | Passing auth context, user-scoped queries |
| Database (supabase-js) | `references/edge-db-supabase-client.md` | Queries, inserts, RPC calls |
| Database (Direct) | `references/edge-db-direct-postgres.md` | Postgres pools, Drizzle ORM |
| CORS | `references/edge-pat-cors.md` | Browser requests, preflight handling |
| Routing | `references/edge-pat-routing.md` | Multi-route functions, Hono framework |
| Error Handling | `references/edge-pat-error-handling.md` | Error responses, validation |
| Background Tasks | `references/edge-pat-background-tasks.md` | waitUntil, async processing |
| Streaming | `references/edge-adv-streaming.md` | SSE, streaming responses |
| WebSockets | `references/edge-adv-websockets.md` | Bidirectional communication |
| Regional Invocation | `references/edge-adv-regional.md` | Region selection, latency optimization |
| Testing | `references/edge-dbg-testing.md` | Deno tests, local testing |
| Limits & Debugging | `references/edge-dbg-limits.md` | Troubleshooting, runtime limits |
### Realtime
Read when implementing live updates — Broadcast messaging, Presence tracking, or Postgres Changes listeners.
| Area | Resource | When to Use |
| ---------------- | ------------------------------------ | ----------------------------------------------- |
| Channel Setup | `references/realtime-setup-*.md` | Creating channels, naming conventions, auth |
| Broadcast | `references/realtime-broadcast-*.md` | Client messaging, database-triggered broadcasts |
| Presence | `references/realtime-presence-*.md` | User online status, shared state tracking |
| Postgres Changes | `references/realtime-postgres-*.md` | Database change listeners (prefer Broadcast) |
| Patterns | `references/realtime-patterns-*.md` | Cleanup, error handling, React integration |
### SDK (supabase-js)
Read when writing application code that interacts with Supabase — client setup, queries, error handling, TypeScript types, or framework integration.
| Area | Resource | When to Use |
| --------------- | ------------------------------- | ----------------------------------------- |
| Client Setup | `references/sdk-client-*.md` | Browser/server client, SSR, configuration |
| TypeScript | `references/sdk-ts-*.md` | Type generation, using Database types |
| Query Patterns | `references/sdk-query-*.md` | CRUD, filters, joins, RPC calls |
| Error Handling | `references/sdk-error-*.md` | Error types, retries, handling patterns |
| SDK Performance | `references/sdk-perf-*.md` | Query optimization, realtime cleanup |
| Framework | `references/sdk-framework-*.md` | Next.js App Router, middleware setup |
### Storage
Read when implementing file uploads, downloads, image transformations, or configuring storage access control and CDN caching.
| Area | Resource | When to Use |
| --------------- | ------------------------------------- | ---------------------------------------------- |
| Access Control | `references/storage-access-control.md`| Bucket policies, RLS for storage |
| Standard Upload | `references/storage-upload-standard.md`| File uploads up to 5GB |
| Resumable Upload| `references/storage-upload-resumable.md`| Large file uploads with TUS protocol |
| Downloads | `references/storage-download-urls.md` | Public URLs, signed URLs, download methods |
| Transformations | `references/storage-transform-images.md`| Image resize, crop, format conversion |
| CDN & Caching | `references/storage-cdn-caching.md` | Cache control, Smart CDN, stale content |
| File Operations | `references/storage-ops-file-management.md`| Move, copy, delete, list files |
## Supabase Documentation
When something is not clear or you need to verify information, reference the official Supabase documentation — it is the source of truth. Available in plain text for easy fetching:
```bash
# Index of all available docs
curl https://supabase.com/llms.txt
# Fetch all guides as plain text
curl https://supabase.com/llms/guides.txt
# Fetch JavaScript SDK reference
curl https://supabase.com/llms/js.txt
```
Full documentation: [https://supabase.com/docs](https://supabase.com/docs)