mirror of
https://github.com/supabase/agent-skills.git
synced 2026-03-27 10:09:26 +08:00
use agent-evals package
This commit is contained in:
@@ -1,57 +1,56 @@
|
||||
# Evals — Agent Guide
|
||||
|
||||
This package evaluates whether AI agents correctly implement Supabase tasks
|
||||
when using skill documentation. Modeled after
|
||||
[Vercel's next-evals-oss](https://github.com/vercel-labs/next-evals-oss): each
|
||||
eval is a self-contained project with a task prompt, the agent works on it, and
|
||||
hidden tests check the result. Binary pass/fail.
|
||||
when using skill documentation. Built on
|
||||
[@vercel/agent-eval](https://github.com/vercel/agent-eval): each eval is a
|
||||
self-contained scenario with a task prompt, the agent works in a Docker sandbox,
|
||||
and hidden vitest assertions check the result. Binary pass/fail.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
1. Create temp dir with project skeleton (PROMPT.md, supabase/ dir)
|
||||
2. Install skills via `skills add` CLI (or skip for baseline)
|
||||
3. Run: claude -p "prompt" --cwd /tmp/eval-xxx
|
||||
4. Agent reads skill, creates migrations/code in the workspace
|
||||
5. Copy hidden EVAL.ts into workspace, run vitest
|
||||
6. Capture pass/fail
|
||||
1. eval.sh starts Supabase, exports keys
|
||||
2. agent-eval reads experiments/experiment.ts
|
||||
3. For each scenario:
|
||||
a. setup() resets DB, writes config + skills into Docker sandbox
|
||||
b. Agent (Claude Code) runs PROMPT.md in the sandbox
|
||||
c. EVAL.ts (vitest) asserts against agent output
|
||||
4. Results saved to results/experiment/{timestamp}/{scenario}/run-{N}/
|
||||
5. Optional: upload.ts pushes results to Braintrust
|
||||
```
|
||||
|
||||
The agent is **Claude Code** invoked via `claude -p` (print mode). It operates
|
||||
on a real filesystem in a temp directory and can read/write files freely.
|
||||
The agent is **Claude Code** running inside a Docker sandbox managed by
|
||||
`@vercel/agent-eval`. It operates on a real filesystem and can read/write files
|
||||
freely.
|
||||
|
||||
**Important**: MCP servers are disabled via `--strict-mcp-config` with an empty
|
||||
config. This ensures the agent uses only local tools (Bash, Edit, Write, Read,
|
||||
Glob, Grep) and cannot access remote services like Supabase MCP or Neon. All
|
||||
work must happen on the local filesystem — e.g., creating migration files in
|
||||
`supabase/migrations/`, not applying them to a remote project.
|
||||
|
||||
## Eval Structure
|
||||
|
||||
Each eval lives in `evals/{scenario-name}/`:
|
||||
## File Structure
|
||||
|
||||
```
|
||||
evals/auth-rls-new-project/
|
||||
PROMPT.md # Task description (visible to agent)
|
||||
EVAL.ts # Vitest assertions (hidden from agent during run)
|
||||
package.json # Minimal project manifest
|
||||
supabase/
|
||||
config.toml # Pre-initialized supabase config
|
||||
migrations/ # Empty — agent creates files here
|
||||
packages/evals/
|
||||
experiments/
|
||||
experiment.ts # ExperimentConfig — agent, sandbox, setup() hook
|
||||
scripts/
|
||||
eval.sh # Supabase lifecycle wrapper (start → eval → stop)
|
||||
src/
|
||||
upload.ts # Standalone Braintrust result uploader
|
||||
evals/
|
||||
eval-utils.ts # Shared helpers (findMigrationFiles, getMigrationSQL, etc.)
|
||||
{scenario}/
|
||||
PROMPT.md # Task description (visible to agent)
|
||||
EVAL.ts # Vitest assertions (hidden from agent during run)
|
||||
meta.ts # expectedReferenceFiles for scoring
|
||||
package.json # Minimal manifest with vitest devDep
|
||||
project/
|
||||
supabase/
|
||||
config.toml # Shared Supabase config seeded into each sandbox
|
||||
scenarios/ # Workflow scenario proposals
|
||||
results/ # Output from eval runs (gitignored)
|
||||
```
|
||||
|
||||
**EVAL.ts** is never copied to the workspace until after the agent finishes.
|
||||
This prevents the agent from "teaching to the test."
|
||||
|
||||
## Running Evals
|
||||
|
||||
Eval tasks in `mise.toml` have `sources` defined, so mise skips them when
|
||||
source files haven't changed. Use `--force` to bypass caching when you need
|
||||
to re-run evals regardless (e.g., after changing environment variables or
|
||||
re-running the same scenario):
|
||||
|
||||
```bash
|
||||
# Run all scenarios with skills (default)
|
||||
# Run all scenarios with skills
|
||||
mise run eval
|
||||
|
||||
# Force re-run (bypass source caching)
|
||||
@@ -66,64 +65,52 @@ EVAL_MODEL=claude-opus-4-6 mise run eval
|
||||
# Run without skills (baseline)
|
||||
EVAL_BASELINE=true mise run eval
|
||||
|
||||
# Install only a specific skill
|
||||
EVAL_SKILL=supabase mise run eval
|
||||
# Dry run (no API calls)
|
||||
mise run eval:dry
|
||||
|
||||
# Upload results to Braintrust
|
||||
mise run eval:upload
|
||||
|
||||
# Force upload (bypass cache)
|
||||
mise run --force eval:upload
|
||||
```
|
||||
|
||||
## Baseline Mode
|
||||
|
||||
Set `EVAL_BASELINE=true` to run scenarios **without** skills. By default,
|
||||
scenarios run with skills installed via the `skills` CLI.
|
||||
Set `EVAL_BASELINE=true` to run scenarios **without** skills injected. By
|
||||
default, skill files from `skills/supabase/` are written into the sandbox.
|
||||
|
||||
To compare with-skill vs baseline, run evals twice:
|
||||
Compare with-skill vs baseline:
|
||||
|
||||
```bash
|
||||
mise run eval # with skills
|
||||
EVAL_BASELINE=true mise run eval # without skills (baseline)
|
||||
```
|
||||
|
||||
Compare the results to measure how much skills improve agent output.
|
||||
|
||||
## Adding Scenarios
|
||||
|
||||
1. Create `evals/{scenario-name}/` with `PROMPT.md`, `EVAL.ts`, `package.json`
|
||||
2. Add any starter files the agent should see (e.g., `supabase/config.toml`)
|
||||
3. Write vitest assertions in `EVAL.ts` that check the agent's output files
|
||||
4. Document the scenario in `scenarios/SCENARIOS.md`
|
||||
1. Create `evals/{scenario-name}/` with:
|
||||
- `PROMPT.md` — task description for the agent
|
||||
- `EVAL.ts` — vitest assertions checking agent output
|
||||
- `meta.ts` — export `expectedReferenceFiles` array for scoring
|
||||
- `package.json` — `{ "private": true, "type": "module", "devDependencies": { "vitest": "^2.0.0" } }`
|
||||
2. Add any starter files the agent should see (they get copied via `setup()`)
|
||||
3. Assertions use helpers from `../eval-utils.ts` (e.g., `findMigrationFiles`, `getMigrationSQL`)
|
||||
|
||||
## Environment
|
||||
|
||||
```
|
||||
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
|
||||
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-5-20250929)
|
||||
EVAL_SCENARIO=... # Optional: run single scenario
|
||||
EVAL_SKILL=... # Optional: install only this skill (e.g., "supabase")
|
||||
EVAL_BASELINE=true # Optional: run without skills (baseline mode)
|
||||
BRAINTRUST_UPLOAD=true # Optional: upload results to Braintrust
|
||||
BRAINTRUST_API_KEY=... # Required when BRAINTRUST_UPLOAD=true
|
||||
BRAINTRUST_PROJECT_ID=... # Required when BRAINTRUST_UPLOAD=true
|
||||
BRAINTRUST_BASE_EXPERIMENT=... # Optional: compare against a named experiment
|
||||
ANTHROPIC_API_KEY=sk-ant-... # Required: Claude Code authentication
|
||||
EVAL_MODEL=... # Optional: override model (default: claude-sonnet-4-6)
|
||||
EVAL_SCENARIO=... # Optional: run single scenario
|
||||
EVAL_BASELINE=true # Optional: run without skills
|
||||
BRAINTRUST_API_KEY=... # Required for eval:upload
|
||||
BRAINTRUST_PROJECT_ID=... # Required for eval:upload
|
||||
```
|
||||
|
||||
## Key Files
|
||||
## Docker Evals
|
||||
|
||||
```
|
||||
src/
|
||||
runner.ts # Main orchestrator
|
||||
types.ts # Core interfaces
|
||||
runner/
|
||||
scaffold.ts # Creates temp workspace from eval template
|
||||
agent.ts # Invokes claude -p as subprocess
|
||||
test.ts # Runs vitest EVAL.ts against workspace
|
||||
results.ts # Collects results and prints summary
|
||||
evals/
|
||||
auth-rls-new-project/ # Scenario 1
|
||||
scenarios/
|
||||
SCENARIOS.md # Scenario descriptions
|
||||
Build and run evals inside Docker (e.g., for CI):
|
||||
|
||||
```bash
|
||||
mise run eval:docker:build # Build the eval Docker image
|
||||
mise run eval:docker # Run evals in Docker
|
||||
mise run eval:docker:shell # Debug shell in eval container
|
||||
```
|
||||
|
||||
@@ -1,85 +1,74 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-schema-auth-fk.md",
|
||||
"db-security-functions.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates profiles table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+table/.test(sql) && /profiles/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "FK references auth.users",
|
||||
check: () =>
|
||||
/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "ON DELETE CASCADE present",
|
||||
check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "RLS enabled on profiles",
|
||||
check: () =>
|
||||
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "trigger function uses SECURITY DEFINER",
|
||||
check: () => /security\s+definer/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "trigger function sets search_path",
|
||||
check: () =>
|
||||
/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "trigger created on auth.users",
|
||||
check: () =>
|
||||
/create\s+trigger[\s\S]*?on\s+auth\.users/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "policies scoped to authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates Supabase best practices",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
|
||||
/security\s+definer/.test(sql),
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
/create\s+trigger[\s\S]*?on\s+auth\.users/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 5;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("creates profiles table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+table/.test(sql) && /profiles/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("FK references auth.users", () => {
|
||||
expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("ON DELETE CASCADE present", () => {
|
||||
expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("RLS enabled on profiles", () => {
|
||||
expect(
|
||||
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("trigger function uses SECURITY DEFINER", () => {
|
||||
expect(/security\s+definer/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("trigger function sets search_path", () => {
|
||||
expect(
|
||||
/set\s+search_path\s*=\s*''/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("trigger created on auth.users", () => {
|
||||
expect(
|
||||
/create\s+trigger[\s\S]*?on\s+auth\.users/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("policies scoped to authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates Supabase best practices", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
/alter\s+table.*profiles.*enable\s+row\s+level\s+security/.test(sql),
|
||||
/security\s+definer/.test(sql),
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
/create\s+trigger[\s\S]*?on\s+auth\.users/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 5).toBe(true);
|
||||
});
|
||||
|
||||
6
packages/evals/evals/auth-fk-cascade-delete/meta.ts
Normal file
6
packages/evals/evals/auth-fk-cascade-delete/meta.ts
Normal file
@@ -0,0 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-schema-auth-fk.md",
|
||||
"db-security-functions.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "auth-fk-cascade-delete",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,16 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"dev-getting-started.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
|
||||
import { existsSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import {
|
||||
anonSeeesNoRows,
|
||||
@@ -19,132 +9,116 @@ import {
|
||||
getSupabaseDir,
|
||||
queryTable,
|
||||
tableExists,
|
||||
} from "../eval-utils.ts";
|
||||
} from "./eval-utils.ts";
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "supabase project initialized (config.toml exists)",
|
||||
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
|
||||
},
|
||||
{
|
||||
name: "migration file exists in supabase/migrations/",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates tasks table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+table/.test(sql) && /tasks/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "enables RLS on tasks table",
|
||||
check: () =>
|
||||
/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "has foreign key to auth.users",
|
||||
check: () =>
|
||||
/references\s+auth\.users/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "uses ON DELETE CASCADE for auth FK",
|
||||
check: () => /on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "uses (select auth.uid()) not bare auth.uid() in policies",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses timestamptz not plain timestamp for time columns",
|
||||
check: () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp =
|
||||
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("due_date")
|
||||
) {
|
||||
return !hasPlainTimestamp.test(sql);
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "creates index on user_id column",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+index/.test(sql) && /user_id/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "does not use SERIAL or BIGSERIAL for primary key",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return !/\bserial\b/.test(sql) && !/\bbigserial\b/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "migration is idempotent (uses IF NOT EXISTS)",
|
||||
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates Supabase best practices",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const signals = [
|
||||
/enable\s+row\s+level\s+security/,
|
||||
/\(select\s+auth\.uid\(\)\)/,
|
||||
/to\s+authenticated/,
|
||||
/on\s+delete\s+cascade/,
|
||||
/create\s+index/,
|
||||
];
|
||||
return signals.filter((r) => r.test(sql)).length >= 4;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "tasks table exists in the database after migration",
|
||||
check: () => tableExists("tasks"),
|
||||
timeout: 10_000,
|
||||
},
|
||||
{
|
||||
name: "tasks table is queryable with service role",
|
||||
check: async () => {
|
||||
const { error } = await queryTable("tasks", "service_role");
|
||||
return error === null;
|
||||
},
|
||||
timeout: 10_000,
|
||||
},
|
||||
{
|
||||
name: "tasks table returns no rows for anon (RLS is active)",
|
||||
check: () => anonSeeesNoRows("tasks"),
|
||||
timeout: 10_000,
|
||||
},
|
||||
];
|
||||
test("supabase project initialized (config.toml exists)", () => {
|
||||
expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
|
||||
});
|
||||
|
||||
test("migration file exists in supabase/migrations/", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
test("creates tasks table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+table/.test(sql) && /tasks/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("enables RLS on tasks table", () => {
|
||||
expect(
|
||||
/alter\s+table.*tasks.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("has foreign key to auth.users", () => {
|
||||
expect(/references\s+auth\.users/.test(getMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("uses ON DELETE CASCADE for auth FK", () => {
|
||||
expect(/on\s+delete\s+cascade/.test(getMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses timestamptz not plain timestamp for time columns", () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("due_date")
|
||||
) {
|
||||
expect(hasPlainTimestamp.test(sql)).toBe(false);
|
||||
} else {
|
||||
expect(true).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
test("creates index on user_id column", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+index/.test(sql) && /user_id/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("does not use SERIAL or BIGSERIAL for primary key", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/\bserial\b/.test(sql)).toBe(false);
|
||||
expect(/\bbigserial\b/.test(sql)).toBe(false);
|
||||
});
|
||||
|
||||
test("migration is idempotent (uses IF NOT EXISTS)", () => {
|
||||
expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates Supabase best practices", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const signals = [
|
||||
/enable\s+row\s+level\s+security/,
|
||||
/\(select\s+auth\.uid\(\)\)/,
|
||||
/to\s+authenticated/,
|
||||
/on\s+delete\s+cascade/,
|
||||
/create\s+index/,
|
||||
];
|
||||
expect(signals.filter((r) => r.test(sql)).length >= 4).toBe(true);
|
||||
});
|
||||
|
||||
test("tasks table exists in the database after migration", async () => {
|
||||
expect(await tableExists("tasks")).toBe(true);
|
||||
}, 10_000);
|
||||
|
||||
test("tasks table is queryable with service role", async () => {
|
||||
const { error } = await queryTable("tasks", "service_role");
|
||||
expect(error === null).toBe(true);
|
||||
}, 10_000);
|
||||
|
||||
test("tasks table returns no rows for anon (RLS is active)", async () => {
|
||||
expect(await anonSeeesNoRows("tasks")).toBe(true);
|
||||
}, 10_000);
|
||||
|
||||
9
packages/evals/evals/auth-rls-new-project/meta.ts
Normal file
9
packages/evals/evals/auth-rls-new-project/meta.ts
Normal file
@@ -0,0 +1,9 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"dev-getting-started.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "auth-rls-new-project",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,11 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"dev-getting-started.md",
|
||||
"edge-fun-quickstart.md",
|
||||
];
|
||||
|
||||
import { readdirSync, readFileSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
const cwd = process.cwd();
|
||||
|
||||
@@ -27,102 +22,93 @@ function getReferenceContent(): string {
|
||||
return readFileSync(file, "utf-8");
|
||||
}
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "CLI_REFERENCE.md exists in project root",
|
||||
check: () => findReferenceFile() !== null,
|
||||
},
|
||||
{
|
||||
name: "no hallucinated functions log command",
|
||||
check: () => {
|
||||
const content = getReferenceContent();
|
||||
return (
|
||||
!/`supabase\s+functions\s+log`/.test(content) &&
|
||||
!/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) &&
|
||||
!/^\s*supabase\s+functions\s+log\b/m.test(content)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "no hallucinated db query command",
|
||||
check: () => {
|
||||
const content = getReferenceContent();
|
||||
return (
|
||||
!/`supabase\s+db\s+query`/.test(content) &&
|
||||
!/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) &&
|
||||
!/^\s*supabase\s+db\s+query\b/m.test(content)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "mentions supabase functions serve for local development",
|
||||
check: () =>
|
||||
/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "mentions supabase functions deploy",
|
||||
check: () =>
|
||||
/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "mentions psql or SQL Editor or connection string for ad-hoc SQL",
|
||||
check: () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
return (
|
||||
/\bpsql\b/.test(content) ||
|
||||
/sql\s+editor/.test(content) ||
|
||||
/connection\s+string/.test(content) ||
|
||||
/supabase\s+db\s+dump/.test(content)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "mentions supabase db push or supabase db reset for migrations",
|
||||
check: () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
return (
|
||||
/supabase\s+db\s+push/.test(content) ||
|
||||
/supabase\s+db\s+reset/.test(content)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "mentions supabase start for local stack",
|
||||
check: () => /supabase\s+start/.test(getReferenceContent().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "mentions Dashboard or Logs Explorer for production log viewing",
|
||||
check: () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
return /\bdashboard\b/.test(content) || /logs\s+explorer/.test(content);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: uses real CLI commands throughout",
|
||||
check: () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
const signals = [
|
||||
/supabase\s+start/,
|
||||
/supabase\s+stop/,
|
||||
/supabase\s+functions\s+serve/,
|
||||
/supabase\s+functions\s+deploy/,
|
||||
/supabase\s+db\s+(push|reset|diff)/,
|
||||
/\bpsql\b|\bsql\s+editor\b|\bconnection\s+string\b/,
|
||||
/\bdashboard\b|\blogs\s+explorer\b/,
|
||||
];
|
||||
const hallucinations = [
|
||||
/`supabase\s+functions\s+log`/,
|
||||
/^\s*npx\s+supabase\s+functions\s+log\b/m,
|
||||
/^\s*supabase\s+functions\s+log\b/m,
|
||||
/`supabase\s+db\s+query`/,
|
||||
/^\s*npx\s+supabase\s+db\s+query\b/m,
|
||||
/^\s*supabase\s+db\s+query\b/m,
|
||||
];
|
||||
const positiveMatches = signals.filter((r) => r.test(content)).length;
|
||||
const hallucinationMatches = hallucinations.filter((r) =>
|
||||
r.test(content),
|
||||
).length;
|
||||
return positiveMatches >= 5 && hallucinationMatches === 0;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("CLI_REFERENCE.md exists in project root", () => {
|
||||
expect(findReferenceFile() !== null).toBe(true);
|
||||
});
|
||||
|
||||
test("no hallucinated functions log command", () => {
|
||||
const content = getReferenceContent();
|
||||
expect(
|
||||
/`supabase\s+functions\s+log`/.test(content) ||
|
||||
/^\s*npx\s+supabase\s+functions\s+log\b/m.test(content) ||
|
||||
/^\s*supabase\s+functions\s+log\b/m.test(content),
|
||||
).toBe(false);
|
||||
});
|
||||
|
||||
test("no hallucinated db query command", () => {
|
||||
const content = getReferenceContent();
|
||||
expect(
|
||||
/`supabase\s+db\s+query`/.test(content) ||
|
||||
/^\s*npx\s+supabase\s+db\s+query\b/m.test(content) ||
|
||||
/^\s*supabase\s+db\s+query\b/m.test(content),
|
||||
).toBe(false);
|
||||
});
|
||||
|
||||
test("mentions supabase functions serve for local development", () => {
|
||||
expect(
|
||||
/supabase\s+functions\s+serve/.test(getReferenceContent().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("mentions supabase functions deploy", () => {
|
||||
expect(
|
||||
/supabase\s+functions\s+deploy/.test(getReferenceContent().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("mentions psql or SQL Editor or connection string for ad-hoc SQL", () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
expect(
|
||||
/\bpsql\b/.test(content) ||
|
||||
/sql\s+editor/.test(content) ||
|
||||
/connection\s+string/.test(content) ||
|
||||
/supabase\s+db\s+dump/.test(content),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("mentions supabase db push or supabase db reset for migrations", () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
expect(
|
||||
/supabase\s+db\s+push/.test(content) ||
|
||||
/supabase\s+db\s+reset/.test(content),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("mentions supabase start for local stack", () => {
|
||||
expect(/supabase\s+start/.test(getReferenceContent().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("mentions Dashboard or Logs Explorer for production log viewing", () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
expect(/\bdashboard\b/.test(content) || /logs\s+explorer/.test(content)).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("overall quality: uses real CLI commands throughout", () => {
|
||||
const content = getReferenceContent().toLowerCase();
|
||||
const signals = [
|
||||
/supabase\s+start/,
|
||||
/supabase\s+stop/,
|
||||
/supabase\s+functions\s+serve/,
|
||||
/supabase\s+functions\s+deploy/,
|
||||
/supabase\s+db\s+(push|reset|diff)/,
|
||||
/\bpsql\b|\bsql\s+editor\b|\bconnection\s+string\b/,
|
||||
/\bdashboard\b|\blogs\s+explorer\b/,
|
||||
];
|
||||
const hallucinations = [
|
||||
/`supabase\s+functions\s+log`/,
|
||||
/^\s*npx\s+supabase\s+functions\s+log\b/m,
|
||||
/^\s*supabase\s+functions\s+log\b/m,
|
||||
/`supabase\s+db\s+query`/,
|
||||
/^\s*npx\s+supabase\s+db\s+query\b/m,
|
||||
/^\s*supabase\s+db\s+query\b/m,
|
||||
];
|
||||
const positiveMatches = signals.filter((r) => r.test(content)).length;
|
||||
const hallucinationMatches = hallucinations.filter((r) =>
|
||||
r.test(content),
|
||||
).length;
|
||||
expect(positiveMatches >= 5 && hallucinationMatches === 0).toBe(true);
|
||||
});
|
||||
|
||||
4
packages/evals/evals/cli-hallucinated-commands/meta.ts
Normal file
4
packages/evals/evals/cli-hallucinated-commands/meta.ts
Normal file
@@ -0,0 +1,4 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"dev-getting-started.md",
|
||||
"edge-fun-quickstart.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "cli-hallucinated-commands",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,354 +1,322 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-security-functions.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-schema-realtime.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"realtime-setup-auth.md",
|
||||
"realtime-broadcast-database.md",
|
||||
"realtime-setup-channels.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates rooms table",
|
||||
check: () =>
|
||||
/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "creates room_members table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/create\s+table[\s\S]*?room_members/.test(sql) ||
|
||||
/create\s+table[\s\S]*?room_users/.test(sql) ||
|
||||
/create\s+table[\s\S]*?memberships/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "creates content table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/create\s+table[\s\S]*?content/.test(sql) ||
|
||||
/create\s+table[\s\S]*?items/.test(sql) ||
|
||||
/create\s+table[\s\S]*?documents/.test(sql) ||
|
||||
/create\s+table[\s\S]*?posts/.test(sql) ||
|
||||
/create\s+table[\s\S]*?messages/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "room_members has role column with owner/editor/viewer",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/role/.test(sql) &&
|
||||
/owner/.test(sql) &&
|
||||
/editor/.test(sql) &&
|
||||
/viewer/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "enables RLS on all application tables",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const roomsRls =
|
||||
/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
const membershipRls =
|
||||
/alter\s+table[\s\S]*?room_members[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?room_users[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
const contentRls =
|
||||
/alter\s+table[\s\S]*?content[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?items[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?posts[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
return roomsRls && membershipRls && contentRls;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "FK to auth.users with ON DELETE CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "content has room_id FK referencing rooms",
|
||||
check: () =>
|
||||
/room_id[\s\S]*?references[\s\S]*?rooms/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "policies use (select auth.uid())",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
if (policyBlocks.length === 0) return false;
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const appPolicies = policyBlocks.filter(
|
||||
(p) => !p.includes("realtime.messages"),
|
||||
);
|
||||
return (
|
||||
appPolicies.length > 0 &&
|
||||
appPolicies.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "private schema with security_definer helper function",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/create\s+schema[\s\S]*?private/.test(sql) &&
|
||||
/private\./.test(sql) &&
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "role-based write policies: content INSERT/UPDATE restricted to owner or editor",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const writePolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
(/for\s+(insert|update|all)/.test(p) || /insert|update/.test(p)) &&
|
||||
(p.includes("content") ||
|
||||
p.includes("items") ||
|
||||
p.includes("documents") ||
|
||||
p.includes("posts") ||
|
||||
p.includes("messages")),
|
||||
);
|
||||
return writePolicies.some(
|
||||
(p) => p.includes("owner") || p.includes("editor"),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "viewer role is read-only (no write access to content)",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const contentWritePolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
/for\s+(insert|update|delete)/.test(p) &&
|
||||
(p.includes("content") ||
|
||||
p.includes("items") ||
|
||||
p.includes("documents") ||
|
||||
p.includes("posts") ||
|
||||
p.includes("messages")),
|
||||
);
|
||||
if (contentWritePolicies.length === 0) return true;
|
||||
return !contentWritePolicies.some((p) => {
|
||||
const mentionsRole =
|
||||
p.includes("owner") || p.includes("editor") || p.includes("viewer");
|
||||
if (!mentionsRole) return true;
|
||||
return (
|
||||
p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
|
||||
);
|
||||
});
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "indexes on membership lookup columns",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/create\s+index/.test(sql)) return false;
|
||||
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
indexBlocks.filter(
|
||||
(idx) =>
|
||||
idx.toLowerCase().includes("user_id") ||
|
||||
idx.toLowerCase().includes("room_id"),
|
||||
).length >= 1
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses timestamptz not plain timestamp",
|
||||
check: () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp =
|
||||
/(?:created_at|updated_at|invited_at|joined_at)\s+timestamp(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("_at ")
|
||||
) {
|
||||
return !hasPlainTimestamp.test(sql);
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "idempotent DDL",
|
||||
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "realtime publication enabled for content table",
|
||||
check: () =>
|
||||
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "broadcast trigger for content changes",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
(/realtime\.broadcast_changes/.test(sql) ||
|
||||
/realtime\.send/.test(sql)) &&
|
||||
/create\s+trigger/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "broadcast trigger function uses security definer",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const functionBlocks =
|
||||
sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
|
||||
const realtimeFunctions = functionBlocks.filter(
|
||||
(f) =>
|
||||
f.toLowerCase().includes("realtime.broadcast_changes") ||
|
||||
f.toLowerCase().includes("realtime.send"),
|
||||
);
|
||||
if (realtimeFunctions.length === 0) return false;
|
||||
return realtimeFunctions.some(
|
||||
(f) =>
|
||||
/security\s+definer/.test(f.toLowerCase()) &&
|
||||
/set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "RLS policies on realtime.messages",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const realtimePolicies = policyBlocks.filter((p) =>
|
||||
p.includes("realtime.messages"),
|
||||
);
|
||||
if (realtimePolicies.length === 0) return false;
|
||||
return realtimePolicies.some(
|
||||
(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "realtime policy checks extension column",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const realtimePolicies = policyBlocks.filter((p) =>
|
||||
p.includes("realtime.messages"),
|
||||
);
|
||||
return realtimePolicies.some(
|
||||
(p) =>
|
||||
p.includes("extension") &&
|
||||
(p.includes("broadcast") || p.includes("presence")),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality score",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/alter\s+table[\s\S]*?(room_members|room_users|memberships)[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
/create\s+schema[\s\S]*?private/.test(sql),
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.filter((p) => !p.includes("realtime.messages")).length >
|
||||
0 &&
|
||||
policyBlocks
|
||||
.filter((p) => !p.includes("realtime.messages"))
|
||||
.every((p) => /to\s+authenticated/.test(p)),
|
||||
/create\s+index/.test(sql),
|
||||
/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
sql.includes("owner") &&
|
||||
sql.includes("editor") &&
|
||||
sql.includes("viewer"),
|
||||
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
|
||||
/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
|
||||
/create\s+trigger/.test(sql),
|
||||
policyBlocks.some((p) => p.includes("realtime.messages")),
|
||||
policyBlocks
|
||||
.filter((p) => p.includes("realtime.messages"))
|
||||
.some((p) => p.includes("extension")),
|
||||
/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 13;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("creates rooms table", () => {
|
||||
expect(
|
||||
/create\s+table[\s\S]*?rooms/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("creates room_members table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/create\s+table[\s\S]*?room_members/.test(sql) ||
|
||||
/create\s+table[\s\S]*?room_users/.test(sql) ||
|
||||
/create\s+table[\s\S]*?memberships/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("creates content table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/create\s+table[\s\S]*?content/.test(sql) ||
|
||||
/create\s+table[\s\S]*?items/.test(sql) ||
|
||||
/create\s+table[\s\S]*?documents/.test(sql) ||
|
||||
/create\s+table[\s\S]*?posts/.test(sql) ||
|
||||
/create\s+table[\s\S]*?messages/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("room_members has role column with owner/editor/viewer", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/role/.test(sql) &&
|
||||
/owner/.test(sql) &&
|
||||
/editor/.test(sql) &&
|
||||
/viewer/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("enables RLS on all application tables", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const roomsRls =
|
||||
/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
const membershipRls =
|
||||
/alter\s+table[\s\S]*?room_members[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?room_users[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
const contentRls =
|
||||
/alter\s+table[\s\S]*?content[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?items[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?posts[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) ||
|
||||
/alter\s+table[\s\S]*?messages[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
);
|
||||
expect(roomsRls && membershipRls && contentRls).toBe(true);
|
||||
});
|
||||
|
||||
test("FK to auth.users with ON DELETE CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("content has room_id FK referencing rooms", () => {
|
||||
expect(
|
||||
/room_id[\s\S]*?references[\s\S]*?rooms/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use (select auth.uid())", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
if (policyBlocks.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const appPolicies = policyBlocks.filter(
|
||||
(p) => !p.includes("realtime.messages"),
|
||||
);
|
||||
expect(
|
||||
appPolicies.length > 0 &&
|
||||
appPolicies.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("private schema with security_definer helper function", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/create\s+schema[\s\S]*?private/.test(sql) &&
|
||||
/private\./.test(sql) &&
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("role-based write policies: content INSERT/UPDATE restricted to owner or editor", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const writePolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
(/for\s+(insert|update|all)/.test(p) || /insert|update/.test(p)) &&
|
||||
(p.includes("content") ||
|
||||
p.includes("items") ||
|
||||
p.includes("documents") ||
|
||||
p.includes("posts") ||
|
||||
p.includes("messages")),
|
||||
);
|
||||
expect(
|
||||
writePolicies.some((p) => p.includes("owner") || p.includes("editor")),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("viewer role is read-only (no write access to content)", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const contentWritePolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
/for\s+(insert|update|delete)/.test(p) &&
|
||||
(p.includes("content") ||
|
||||
p.includes("items") ||
|
||||
p.includes("documents") ||
|
||||
p.includes("posts") ||
|
||||
p.includes("messages")),
|
||||
);
|
||||
if (contentWritePolicies.length === 0) {
|
||||
expect(true).toBe(true);
|
||||
return;
|
||||
}
|
||||
const result = !contentWritePolicies.some((p) => {
|
||||
const mentionsRole =
|
||||
p.includes("owner") || p.includes("editor") || p.includes("viewer");
|
||||
if (!mentionsRole) return true;
|
||||
return (
|
||||
p.includes("viewer") && !p.includes("owner") && !p.includes("editor")
|
||||
);
|
||||
});
|
||||
expect(result).toBe(true);
|
||||
});
|
||||
|
||||
test("indexes on membership lookup columns", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/create\s+index/.test(sql)) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
indexBlocks.filter(
|
||||
(idx) =>
|
||||
idx.toLowerCase().includes("user_id") ||
|
||||
idx.toLowerCase().includes("room_id"),
|
||||
).length >= 1,
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses timestamptz not plain timestamp", () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp =
|
||||
/(?:created_at|updated_at|invited_at|joined_at)\s+timestamp(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("_at ")
|
||||
) {
|
||||
expect(hasPlainTimestamp.test(sql)).toBe(false);
|
||||
} else {
|
||||
expect(true).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
test("idempotent DDL", () => {
|
||||
expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("realtime publication enabled for content table", () => {
|
||||
expect(
|
||||
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("broadcast trigger for content changes", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
(/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql)) &&
|
||||
/create\s+trigger/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("broadcast trigger function uses security definer", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const functionBlocks =
|
||||
sql.match(/create[\s\S]*?function[\s\S]*?\$\$[\s\S]*?\$\$/gi) ?? [];
|
||||
const realtimeFunctions = functionBlocks.filter(
|
||||
(f) =>
|
||||
f.toLowerCase().includes("realtime.broadcast_changes") ||
|
||||
f.toLowerCase().includes("realtime.send"),
|
||||
);
|
||||
if (realtimeFunctions.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
realtimeFunctions.some(
|
||||
(f) =>
|
||||
/security\s+definer/.test(f.toLowerCase()) &&
|
||||
/set\s+search_path\s*=\s*''/.test(f.toLowerCase()),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("RLS policies on realtime.messages", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const realtimePolicies = policyBlocks.filter((p) =>
|
||||
p.includes("realtime.messages"),
|
||||
);
|
||||
if (realtimePolicies.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
realtimePolicies.some(
|
||||
(p) => /to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("realtime policy checks extension column", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const realtimePolicies = policyBlocks.filter((p) =>
|
||||
p.includes("realtime.messages"),
|
||||
);
|
||||
expect(
|
||||
realtimePolicies.some(
|
||||
(p) =>
|
||||
p.includes("extension") &&
|
||||
(p.includes("broadcast") || p.includes("presence")),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality score", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table[\s\S]*?rooms[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/alter\s+table[\s\S]*?(room_members|room_users|memberships)[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/alter\s+table[\s\S]*?(content|items|documents|posts|messages)[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
/create\s+schema[\s\S]*?private/.test(sql),
|
||||
/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.filter((p) => !p.includes("realtime.messages")).length > 0 &&
|
||||
policyBlocks
|
||||
.filter((p) => !p.includes("realtime.messages"))
|
||||
.every((p) => /to\s+authenticated/.test(p)),
|
||||
/create\s+index/.test(sql),
|
||||
/timestamptz/.test(sql) || /timestamp\s+with\s+time\s+zone/.test(sql),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
sql.includes("owner") && sql.includes("editor") && sql.includes("viewer"),
|
||||
/alter\s+publication\s+supabase_realtime\s+add\s+table/.test(sql),
|
||||
/realtime\.broadcast_changes/.test(sql) || /realtime\.send/.test(sql),
|
||||
/create\s+trigger/.test(sql),
|
||||
policyBlocks.some((p) => p.includes("realtime.messages")),
|
||||
policyBlocks
|
||||
.filter((p) => p.includes("realtime.messages"))
|
||||
.some((p) => p.includes("extension")),
|
||||
/room_id[\s\S]*?references[\s\S]*?rooms/.test(sql),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 13).toBe(true);
|
||||
});
|
||||
|
||||
14
packages/evals/evals/collaborative-rooms-realtime/meta.ts
Normal file
14
packages/evals/evals/collaborative-rooms-realtime/meta.ts
Normal file
@@ -0,0 +1,14 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-security-functions.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-schema-realtime.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"realtime-setup-auth.md",
|
||||
"realtime-broadcast-database.md",
|
||||
"realtime-setup-channels.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "collaborative-rooms-realtime",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,12 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-conn-pooling.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-schema-auth-fk.md",
|
||||
];
|
||||
|
||||
import { existsSync, readdirSync, readFileSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
const cwd = process.cwd();
|
||||
|
||||
@@ -65,70 +59,60 @@ function getAllOutputContent(): string {
|
||||
return parts.join("\n");
|
||||
}
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "prisma schema file exists",
|
||||
check: () => findPrismaSchema() !== null,
|
||||
},
|
||||
{
|
||||
name: "prisma schema references pooler port 6543",
|
||||
check: () => /6543/.test(getAllOutputContent()),
|
||||
},
|
||||
{
|
||||
name: "pgbouncer=true param present",
|
||||
check: () =>
|
||||
/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "DIRECT_URL provided for migrations",
|
||||
check: () => {
|
||||
const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
|
||||
return /directurl/.test(allContent) || /direct_url/.test(allContent);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "datasource block references directUrl or DIRECT_URL env var",
|
||||
check: () => {
|
||||
const schema = getPrismaSchema().toLowerCase();
|
||||
const datasourceBlock =
|
||||
schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
|
||||
return (
|
||||
/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "connection limit set to 1 for serverless",
|
||||
check: () => {
|
||||
const content = getAllOutputContent().toLowerCase();
|
||||
return (
|
||||
/connection_limit\s*=\s*1/.test(content) ||
|
||||
/connection_limit:\s*1/.test(content) ||
|
||||
/connectionlimit\s*=\s*1/.test(content)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "explanation distinguishes port 6543 vs 5432",
|
||||
check: () => {
|
||||
const content = getAllOutputContent();
|
||||
return /6543/.test(content) && /5432/.test(content);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates correct Prisma + Supabase pooler setup",
|
||||
check: () => {
|
||||
const schema = getPrismaSchema().toLowerCase();
|
||||
const envContent = getAllEnvContent().toLowerCase();
|
||||
const allContent = `${schema}\n${envContent}`;
|
||||
const signals = [
|
||||
/6543/,
|
||||
/pgbouncer\s*=\s*true/,
|
||||
/directurl|direct_url/,
|
||||
/connection_limit\s*=\s*1|connection_limit:\s*1/,
|
||||
/5432/,
|
||||
];
|
||||
return signals.filter((r) => r.test(allContent)).length >= 4;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("prisma schema file exists", () => {
|
||||
expect(findPrismaSchema() !== null).toBe(true);
|
||||
});
|
||||
|
||||
test("prisma schema references pooler port 6543", () => {
|
||||
expect(/6543/.test(getAllOutputContent())).toBe(true);
|
||||
});
|
||||
|
||||
test("pgbouncer=true param present", () => {
|
||||
expect(/pgbouncer\s*=\s*true/.test(getAllOutputContent().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("DIRECT_URL provided for migrations", () => {
|
||||
const allContent = `${getPrismaSchema().toLowerCase()}\n${getAllEnvContent().toLowerCase()}`;
|
||||
expect(/directurl/.test(allContent) || /direct_url/.test(allContent)).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("datasource block references directUrl or DIRECT_URL env var", () => {
|
||||
const schema = getPrismaSchema().toLowerCase();
|
||||
const datasourceBlock =
|
||||
schema.match(/datasource\s+\w+\s*\{[\s\S]*?\}/)?.[0] ?? "";
|
||||
expect(
|
||||
/directurl/.test(datasourceBlock) || /direct_url/.test(datasourceBlock),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("connection limit set to 1 for serverless", () => {
|
||||
const content = getAllOutputContent().toLowerCase();
|
||||
expect(
|
||||
/connection_limit\s*=\s*1/.test(content) ||
|
||||
/connection_limit:\s*1/.test(content) ||
|
||||
/connectionlimit\s*=\s*1/.test(content),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("explanation distinguishes port 6543 vs 5432", () => {
|
||||
const content = getAllOutputContent();
|
||||
expect(/6543/.test(content) && /5432/.test(content)).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates correct Prisma + Supabase pooler setup", () => {
|
||||
const schema = getPrismaSchema().toLowerCase();
|
||||
const envContent = getAllEnvContent().toLowerCase();
|
||||
const allContent = `${schema}\n${envContent}`;
|
||||
const signals = [
|
||||
/6543/,
|
||||
/pgbouncer\s*=\s*true/,
|
||||
/directurl|direct_url/,
|
||||
/connection_limit\s*=\s*1|connection_limit:\s*1/,
|
||||
/5432/,
|
||||
];
|
||||
expect(signals.filter((r) => r.test(allContent)).length >= 4).toBe(true);
|
||||
});
|
||||
|
||||
5
packages/evals/evals/connection-pooling-prisma/meta.ts
Normal file
5
packages/evals/evals/connection-pooling-prisma/meta.ts
Normal file
@@ -0,0 +1,5 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-conn-pooling.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-schema-auth-fk.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "connection-pooling-prisma",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,14 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"edge-fun-quickstart.md",
|
||||
"edge-fun-project-structure.md",
|
||||
"edge-pat-cors.md",
|
||||
"edge-pat-error-handling.md",
|
||||
"dev-getting-started.md",
|
||||
];
|
||||
|
||||
import { existsSync, readdirSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import {
|
||||
findFunctionFile,
|
||||
@@ -17,7 +9,7 @@ import {
|
||||
getFunctionsDir,
|
||||
getSharedCode,
|
||||
getSupabaseDir,
|
||||
} from "../eval-utils.ts";
|
||||
} from "./eval-utils.ts";
|
||||
|
||||
const FUNCTION_NAME = "hello-world";
|
||||
|
||||
@@ -33,125 +25,113 @@ function getCatchBlockCode(): string {
|
||||
return code.slice(catchIndex);
|
||||
}
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "supabase project initialized",
|
||||
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
|
||||
},
|
||||
{
|
||||
name: "function directory exists",
|
||||
check: () => existsSync(join(getFunctionsDir(), FUNCTION_NAME)),
|
||||
},
|
||||
{
|
||||
name: "function index file exists",
|
||||
check: () => findFunctionFile(FUNCTION_NAME) !== null,
|
||||
},
|
||||
{
|
||||
name: "uses Deno.serve",
|
||||
check: () => /Deno\.serve/.test(getFunctionCode(FUNCTION_NAME)),
|
||||
},
|
||||
{
|
||||
name: "returns JSON response",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
return (
|
||||
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
|
||||
/Response\.json/i.test(allCode) ||
|
||||
/JSON\.stringify/i.test(allCode)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "handles OPTIONS preflight",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
return /['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "defines CORS headers",
|
||||
check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
|
||||
},
|
||||
{
|
||||
name: "CORS allows required headers",
|
||||
check: () => {
|
||||
const allCode = getAllCode().toLowerCase();
|
||||
return (
|
||||
/access-control-allow-headers/.test(allCode) &&
|
||||
/authorization/.test(allCode) &&
|
||||
/apikey/.test(allCode)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "error response has CORS headers",
|
||||
check: () => {
|
||||
const catchCode = getCatchBlockCode();
|
||||
if (catchCode.length === 0) return false;
|
||||
const sharedCode = getSharedCode();
|
||||
const directCors =
|
||||
/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
|
||||
const callsSharedHelper =
|
||||
/errorResponse|jsonResponse|json_response|error_response/i.test(
|
||||
catchCode,
|
||||
) && /Access-Control-Allow-Origin/i.test(sharedCode);
|
||||
return directCors || callsSharedHelper;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "has try-catch for error handling",
|
||||
check: () => {
|
||||
const code = getFunctionCode(FUNCTION_NAME);
|
||||
return /\btry\s*\{/.test(code) && /\bcatch\b/.test(code);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "returns proper error status code",
|
||||
check: () => {
|
||||
const catchCode = getCatchBlockCode();
|
||||
if (catchCode.length === 0) return false;
|
||||
return (
|
||||
/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
|
||||
/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "shared CORS module exists",
|
||||
check: () => findSharedCorsFile() !== null,
|
||||
},
|
||||
{
|
||||
name: "function imports from shared",
|
||||
check: () =>
|
||||
/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
|
||||
},
|
||||
{
|
||||
name: "function uses hyphenated name",
|
||||
check: () => {
|
||||
const dirs = existsSync(getFunctionsDir())
|
||||
? readdirSync(getFunctionsDir())
|
||||
: [];
|
||||
const helloDir = dirs.find(
|
||||
(d) => d.includes("hello") && d.includes("world"),
|
||||
);
|
||||
return helloDir !== undefined && /^hello-world$/.test(helloDir);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates Edge Function best practices",
|
||||
check: () => {
|
||||
const allCode = getAllCode().toLowerCase();
|
||||
const signals = [
|
||||
/deno\.serve/,
|
||||
/['"]options['"]/,
|
||||
/access-control-allow-origin/,
|
||||
/\btry\s*\{/,
|
||||
/status:\s*(400|500|4\d{2}|5\d{2})|[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/,
|
||||
/from\s+['"]\.\.\/(_shared|_utils)/,
|
||||
/authorization/,
|
||||
/apikey/,
|
||||
];
|
||||
return signals.filter((r) => r.test(allCode)).length >= 6;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("supabase project initialized", () => {
|
||||
expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
|
||||
});
|
||||
|
||||
test("function directory exists", () => {
|
||||
expect(existsSync(join(getFunctionsDir(), FUNCTION_NAME))).toBe(true);
|
||||
});
|
||||
|
||||
test("function index file exists", () => {
|
||||
expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
|
||||
});
|
||||
|
||||
test("uses Deno.serve", () => {
|
||||
expect(/Deno\.serve/.test(getFunctionCode(FUNCTION_NAME))).toBe(true);
|
||||
});
|
||||
|
||||
test("returns JSON response", () => {
|
||||
const allCode = getAllCode();
|
||||
expect(
|
||||
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
|
||||
/Response\.json/i.test(allCode) ||
|
||||
/JSON\.stringify/i.test(allCode),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("handles OPTIONS preflight", () => {
|
||||
const allCode = getAllCode();
|
||||
expect(/['"]OPTIONS['"]/.test(allCode) && /\.method/.test(allCode)).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("defines CORS headers", () => {
|
||||
expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
|
||||
});
|
||||
|
||||
test("CORS allows required headers", () => {
|
||||
const allCode = getAllCode().toLowerCase();
|
||||
expect(
|
||||
/access-control-allow-headers/.test(allCode) &&
|
||||
/authorization/.test(allCode) &&
|
||||
/apikey/.test(allCode),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("error response has CORS headers", () => {
|
||||
const catchCode = getCatchBlockCode();
|
||||
if (catchCode.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const sharedCode = getSharedCode();
|
||||
const directCors =
|
||||
/corsHeaders|cors_headers|Access-Control-Allow-Origin/i.test(catchCode);
|
||||
const callsSharedHelper =
|
||||
/errorResponse|jsonResponse|json_response|error_response/i.test(
|
||||
catchCode,
|
||||
) && /Access-Control-Allow-Origin/i.test(sharedCode);
|
||||
expect(directCors || callsSharedHelper).toBe(true);
|
||||
});
|
||||
|
||||
test("has try-catch for error handling", () => {
|
||||
const code = getFunctionCode(FUNCTION_NAME);
|
||||
expect(/\btry\s*\{/.test(code) && /\bcatch\b/.test(code)).toBe(true);
|
||||
});
|
||||
|
||||
test("returns proper error status code", () => {
|
||||
const catchCode = getCatchBlockCode();
|
||||
if (catchCode.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
/status:\s*(400|500|4\d{2}|5\d{2})/.test(catchCode) ||
|
||||
/[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/.test(catchCode),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("shared CORS module exists", () => {
|
||||
expect(findSharedCorsFile() !== null).toBe(true);
|
||||
});
|
||||
|
||||
test("function imports from shared", () => {
|
||||
expect(
|
||||
/from\s+['"]\.\.\/(_shared|_utils)/.test(getFunctionCode(FUNCTION_NAME)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("function uses hyphenated name", () => {
|
||||
const dirs = existsSync(getFunctionsDir())
|
||||
? readdirSync(getFunctionsDir())
|
||||
: [];
|
||||
const helloDir = dirs.find((d) => d.includes("hello") && d.includes("world"));
|
||||
expect(helloDir !== undefined && /^hello-world$/.test(helloDir)).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates Edge Function best practices", () => {
|
||||
const allCode = getAllCode().toLowerCase();
|
||||
const signals = [
|
||||
/deno\.serve/,
|
||||
/['"]options['"]/,
|
||||
/access-control-allow-origin/,
|
||||
/\btry\s*\{/,
|
||||
/status:\s*(400|500|4\d{2}|5\d{2})|[,(]\s*(400|500|4\d{2}|5\d{2})\s*[),]/,
|
||||
/from\s+['"]\.\.\/(_shared|_utils)/,
|
||||
/authorization/,
|
||||
/apikey/,
|
||||
];
|
||||
expect(signals.filter((r) => r.test(allCode)).length >= 6).toBe(true);
|
||||
});
|
||||
|
||||
7
packages/evals/evals/edge-function-hello-world/meta.ts
Normal file
7
packages/evals/evals/edge-function-hello-world/meta.ts
Normal file
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"edge-fun-quickstart.md",
|
||||
"edge-fun-project-structure.md",
|
||||
"edge-pat-cors.md",
|
||||
"edge-pat-error-handling.md",
|
||||
"dev-getting-started.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "edge-function-hello-world",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,100 +1,84 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-schema-extensions.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "extension installed in extensions schema",
|
||||
check: () =>
|
||||
/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "IF NOT EXISTS on extension creation",
|
||||
check: () =>
|
||||
/create\s+extension\s+if\s+not\s+exists/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "vector column with correct dimensions",
|
||||
check: () =>
|
||||
/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "HNSW index used instead of IVFFlat",
|
||||
check: () => /using\s+hnsw/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "RLS enabled on documents table",
|
||||
check: () =>
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "FK to auth.users with ON DELETE CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "idempotent table creation (IF NOT EXISTS)",
|
||||
check: () =>
|
||||
/create\s+table\s+if\s+not\s+exists/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates pgvector best practices",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(sql),
|
||||
/create\s+extension\s+if\s+not\s+exists/.test(sql),
|
||||
/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(sql),
|
||||
/using\s+hnsw/.test(sql),
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 6;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("extension installed in extensions schema", () => {
|
||||
expect(
|
||||
/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("IF NOT EXISTS on extension creation", () => {
|
||||
expect(
|
||||
/create\s+extension\s+if\s+not\s+exists/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("vector column with correct dimensions", () => {
|
||||
expect(
|
||||
/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("HNSW index used instead of IVFFlat", () => {
|
||||
expect(/using\s+hnsw/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("RLS enabled on documents table", () => {
|
||||
expect(
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("FK to auth.users with ON DELETE CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("idempotent table creation (IF NOT EXISTS)", () => {
|
||||
expect(
|
||||
/create\s+table\s+if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates pgvector best practices", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/create\s+extension[\s\S]*?with\s+schema\s+extensions/.test(sql),
|
||||
/create\s+extension\s+if\s+not\s+exists/.test(sql),
|
||||
/(?:extensions\.)?vector\s*\(\s*1536\s*\)/.test(sql),
|
||||
/using\s+hnsw/.test(sql),
|
||||
/alter\s+table[\s\S]*?documents[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 6).toBe(true);
|
||||
});
|
||||
|
||||
7
packages/evals/evals/extension-wrong-schema/meta.ts
Normal file
7
packages/evals/evals/extension-wrong-schema/meta.ts
Normal file
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-schema-extensions.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "extension-wrong-schema",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,14 +1,6 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-views.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-performance.md",
|
||||
"db-schema-timestamps.md",
|
||||
];
|
||||
|
||||
import { existsSync, readdirSync, readFileSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
const migrationsDir = join(process.cwd(), "supabase", "migrations");
|
||||
const STARTER_MIGRATION = "20240101000000_create_products.sql";
|
||||
@@ -29,86 +21,83 @@ function getAgentMigrationSQL(): string {
|
||||
return files.map((f) => readFileSync(f, "utf-8")).join("\n");
|
||||
}
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "new migration file exists",
|
||||
check: () => findAgentMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "ADD COLUMN IF NOT EXISTS for description",
|
||||
check: () =>
|
||||
/add\s+column\s+if\s+not\s+exists\s+description/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "ADD COLUMN IF NOT EXISTS for published_at",
|
||||
check: () =>
|
||||
/add\s+column\s+if\s+not\s+exists\s+published_at/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "published_at uses timestamptz not plain timestamp",
|
||||
check: () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
|
||||
sql,
|
||||
) &&
|
||||
!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
|
||||
sql,
|
||||
)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "view public_products is created",
|
||||
check: () =>
|
||||
/create\s+(or\s+replace\s+)?view\s+public_products/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "view uses security_invoker = true",
|
||||
check: () =>
|
||||
/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "SELECT policy on products for authenticated role",
|
||||
check: () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return policyBlocks.some(
|
||||
(p) =>
|
||||
p.includes("select") &&
|
||||
p.includes("products") &&
|
||||
/to\s+authenticated/.test(p),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "NOTIFY pgrst reload schema is present",
|
||||
check: () => /notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates PostgREST and schema best practices",
|
||||
check: () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/add\s+column\s+if\s+not\s+exists/.test(sql),
|
||||
/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
|
||||
sql,
|
||||
),
|
||||
/create\s+(or\s+replace\s+)?view\s+public_products/.test(sql),
|
||||
/security_invoker\s*=\s*true/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) => p.includes("select") && /to\s+authenticated/.test(p),
|
||||
),
|
||||
/notify\s+pgrst/.test(sql),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 5;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("new migration file exists", () => {
|
||||
expect(findAgentMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
test("ADD COLUMN IF NOT EXISTS for description", () => {
|
||||
expect(
|
||||
/add\s+column\s+if\s+not\s+exists\s+description/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("ADD COLUMN IF NOT EXISTS for published_at", () => {
|
||||
expect(
|
||||
/add\s+column\s+if\s+not\s+exists\s+published_at/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("published_at uses timestamptz not plain timestamp", () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
|
||||
sql,
|
||||
) &&
|
||||
!/published_at\s+timestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("view public_products is created", () => {
|
||||
expect(
|
||||
/create\s+(or\s+replace\s+)?view\s+public_products/.test(
|
||||
getAgentMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("view uses security_invoker = true", () => {
|
||||
expect(
|
||||
/security_invoker\s*=\s*true/.test(getAgentMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("SELECT policy on products for authenticated role", () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.some(
|
||||
(p) =>
|
||||
p.includes("select") &&
|
||||
p.includes("products") &&
|
||||
/to\s+authenticated/.test(p),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("NOTIFY pgrst reload schema is present", () => {
|
||||
expect(/notify\s+pgrst/.test(getAgentMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates PostgREST and schema best practices", () => {
|
||||
const sql = getAgentMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/add\s+column\s+if\s+not\s+exists/.test(sql),
|
||||
/published_at\s+timestamptz|published_at\s+timestamp\s+with\s+time\s+zone/.test(
|
||||
sql,
|
||||
),
|
||||
/create\s+(or\s+replace\s+)?view\s+public_products/.test(sql),
|
||||
/security_invoker\s*=\s*true/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) => p.includes("select") && /to\s+authenticated/.test(p),
|
||||
),
|
||||
/notify\s+pgrst/.test(sql),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 5).toBe(true);
|
||||
});
|
||||
|
||||
7
packages/evals/evals/postgrest-schema-cache/meta.ts
Normal file
7
packages/evals/evals/postgrest-schema-cache/meta.ts
Normal file
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-views.md",
|
||||
"db-migrations-idempotent.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-performance.md",
|
||||
"db-schema-timestamps.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "postgrest-schema-cache",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,122 +1,97 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-performance.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-schema-timestamps.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates orders table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+table/.test(sql) && /orders/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "enables RLS on orders table",
|
||||
check: () =>
|
||||
/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "has SELECT policy on orders",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return policyBlocks.some((p) => p.includes("for select"));
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "has UPDATE policy with WITH CHECK on orders",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
|
||||
return updatePolicy !== undefined && /with\s+check/.test(updatePolicy);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "all policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses (select auth.uid()) not bare auth.uid() in policies",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses timestamptz not plain timestamp for created_at",
|
||||
check: () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp =
|
||||
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (sql.includes("created_at")) {
|
||||
return !hasPlainTimestamp.test(sql);
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "FK to auth.users with ON DELETE CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates Supabase best practices",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(sql),
|
||||
policyBlocks.some((p) => p.includes("for select")),
|
||||
policyBlocks.some(
|
||||
(p) => p.includes("for update") && /with\s+check/.test(p),
|
||||
),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
|
||||
sql.replace(/--[^\n]*/g, ""),
|
||||
),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 5;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("creates orders table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+table/.test(sql) && /orders/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("enables RLS on orders table", () => {
|
||||
expect(
|
||||
/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("has SELECT policy on orders", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(policyBlocks.some((p) => p.includes("for select"))).toBe(true);
|
||||
});
|
||||
|
||||
test("has UPDATE policy with WITH CHECK on orders", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const updatePolicy = policyBlocks.find((p) => p.includes("for update"));
|
||||
expect(updatePolicy !== undefined && /with\s+check/.test(updatePolicy)).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("all policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses (select auth.uid()) not bare auth.uid() in policies", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("uses timestamptz not plain timestamp for created_at", () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (sql.includes("created_at")) {
|
||||
expect(hasPlainTimestamp.test(sql)).toBe(false);
|
||||
} else {
|
||||
expect(true).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
test("FK to auth.users with ON DELETE CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates Supabase best practices", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table.*orders.*enable\s+row\s+level\s+security/.test(sql),
|
||||
policyBlocks.some((p) => p.includes("for select")),
|
||||
policyBlocks.some(
|
||||
(p) => p.includes("for update") && /with\s+check/.test(p),
|
||||
),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
|
||||
sql.replace(/--[^\n]*/g, ""),
|
||||
),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 5).toBe(true);
|
||||
});
|
||||
|
||||
7
packages/evals/evals/rls-update-needs-select/meta.ts
Normal file
7
packages/evals/evals/rls-update-needs-select/meta.ts
Normal file
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-performance.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-schema-timestamps.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "rls-update-needs-select",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,123 +1,92 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-performance.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-schema-auth-fk.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists in supabase/migrations/", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists in supabase/migrations/",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates documents table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+table/.test(sql) && /documents/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "RLS enabled on documents table",
|
||||
check: () =>
|
||||
/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "uses app_metadata not user_metadata for role check",
|
||||
check: () => /app_metadata/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "user_metadata does not appear in policy USING clauses",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return policyBlocks.every((p) => !p.includes("user_metadata"));
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "has at least two SELECT policies (owner and admin)",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const hasOwnerPolicy = policyBlocks.some(
|
||||
(p) =>
|
||||
(p.includes("select") || !p.includes("insert")) &&
|
||||
(p.includes("user_id") ||
|
||||
p.includes("owner") ||
|
||||
p.includes("auth.uid")),
|
||||
);
|
||||
const hasAdminPolicy = policyBlocks.some((p) =>
|
||||
p.includes("app_metadata"),
|
||||
);
|
||||
return hasOwnerPolicy && hasAdminPolicy;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses (select auth.uid()) subselect form in policies",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "FK to auth.users with ON DELETE CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates Supabase best practices",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(sql),
|
||||
/app_metadata/.test(sql),
|
||||
policyBlocks.every((p) => !p.includes("user_metadata")),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) =>
|
||||
p.includes("user_id") ||
|
||||
p.includes("owner") ||
|
||||
p.includes("auth.uid"),
|
||||
) && policyBlocks.some((p) => p.includes("app_metadata")),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 5;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("creates documents table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+table/.test(sql) && /documents/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("RLS enabled on documents table", () => {
|
||||
expect(
|
||||
/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses app_metadata not user_metadata for role check", () => {
|
||||
expect(/app_metadata/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("user_metadata does not appear in policy USING clauses", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(policyBlocks.every((p) => !p.includes("user_metadata"))).toBe(true);
|
||||
});
|
||||
|
||||
test("has at least two SELECT policies (owner and admin)", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const hasOwnerPolicy = policyBlocks.some(
|
||||
(p) =>
|
||||
(p.includes("select") || !p.includes("insert")) &&
|
||||
(p.includes("user_id") || p.includes("owner") || p.includes("auth.uid")),
|
||||
);
|
||||
const hasAdminPolicy = policyBlocks.some((p) => p.includes("app_metadata"));
|
||||
expect(hasOwnerPolicy && hasAdminPolicy).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses (select auth.uid()) subselect form in policies", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("FK to auth.users with ON DELETE CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates Supabase best practices", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table.*documents.*enable\s+row\s+level\s+security/.test(sql),
|
||||
/app_metadata/.test(sql),
|
||||
policyBlocks.every((p) => !p.includes("user_metadata")),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) =>
|
||||
p.includes("user_id") || p.includes("owner") || p.includes("auth.uid"),
|
||||
) && policyBlocks.some((p) => p.includes("app_metadata")),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 5).toBe(true);
|
||||
});
|
||||
|
||||
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-performance.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-schema-auth-fk.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "rls-user-metadata-role-check",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,21 +1,13 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-security-service-role.md",
|
||||
"edge-fun-quickstart.md",
|
||||
"edge-db-supabase-client.md",
|
||||
"edge-pat-cors.md",
|
||||
"edge-pat-error-handling.md",
|
||||
];
|
||||
|
||||
import { existsSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import {
|
||||
findFunctionFile,
|
||||
getFunctionCode,
|
||||
getSharedCode,
|
||||
getSupabaseDir,
|
||||
} from "../eval-utils.ts";
|
||||
} from "./eval-utils.ts";
|
||||
|
||||
const FUNCTION_NAME = "admin-reports";
|
||||
|
||||
@@ -24,79 +16,71 @@ function getAllCode(): string {
|
||||
return `${code}\n${getSharedCode()}`;
|
||||
}
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "supabase project initialized (config.toml exists)",
|
||||
check: () => existsSync(join(getSupabaseDir(), "config.toml")),
|
||||
},
|
||||
{
|
||||
name: "edge function file exists",
|
||||
check: () => findFunctionFile(FUNCTION_NAME) !== null,
|
||||
},
|
||||
{
|
||||
name: "uses Deno.env.get for service role key",
|
||||
check: () =>
|
||||
test("supabase project initialized (config.toml exists)", () => {
|
||||
expect(existsSync(join(getSupabaseDir(), "config.toml"))).toBe(true);
|
||||
});
|
||||
|
||||
test("edge function file exists", () => {
|
||||
expect(findFunctionFile(FUNCTION_NAME) !== null).toBe(true);
|
||||
});
|
||||
|
||||
test("uses Deno.env.get for service role key", () => {
|
||||
expect(
|
||||
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
|
||||
getAllCode(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("no hardcoded service role key", () => {
|
||||
const allCode = getAllCode();
|
||||
const lines = allCode.split("\n");
|
||||
const nonCommentLines = lines.filter(
|
||||
(line) => !line.trimStart().startsWith("//"),
|
||||
);
|
||||
expect(
|
||||
nonCommentLines.some((line) =>
|
||||
/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
|
||||
),
|
||||
).toBe(false);
|
||||
});
|
||||
|
||||
test("createClient called with service role env var as second argument", () => {
|
||||
const allCode = getAllCode();
|
||||
expect(
|
||||
/createClient/i.test(allCode) &&
|
||||
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
|
||||
getAllCode(),
|
||||
allCode,
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "no hardcoded service role key",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
const lines = allCode.split("\n");
|
||||
const nonCommentLines = lines.filter(
|
||||
(line) => !line.trimStart().startsWith("//"),
|
||||
);
|
||||
return !nonCommentLines.some((line) =>
|
||||
/(['"`])eyJ[A-Za-z0-9_-]+\.\1?|(['"`])eyJ[A-Za-z0-9_-]+/.test(line),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "createClient called with service role env var as second argument",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
return (
|
||||
/createClient/i.test(allCode) &&
|
||||
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i.test(
|
||||
allCode,
|
||||
)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "service role key env var name does not use NEXT_PUBLIC_ prefix",
|
||||
check: () => !/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode()),
|
||||
},
|
||||
{
|
||||
name: "CORS headers present",
|
||||
check: () => /Access-Control-Allow-Origin/.test(getAllCode()),
|
||||
},
|
||||
{
|
||||
name: "returns JSON response",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
return (
|
||||
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
|
||||
/Response\.json/i.test(allCode) ||
|
||||
/JSON\.stringify/i.test(allCode)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality: demonstrates service role Edge Function best practices",
|
||||
check: () => {
|
||||
const allCode = getAllCode();
|
||||
const signals: RegExp[] = [
|
||||
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
|
||||
/Access-Control-Allow-Origin/,
|
||||
/createClient/i,
|
||||
/\btry\s*\{/,
|
||||
/Response\.json|JSON\.stringify/,
|
||||
/Deno\.serve/,
|
||||
];
|
||||
return signals.filter((r) => r.test(allCode)).length >= 5;
|
||||
},
|
||||
},
|
||||
];
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("service role key env var name does not use NEXT_PUBLIC_ prefix", () => {
|
||||
expect(/NEXT_PUBLIC_[^'"]*service[_-]?role/i.test(getAllCode())).toBe(false);
|
||||
});
|
||||
|
||||
test("CORS headers present", () => {
|
||||
expect(/Access-Control-Allow-Origin/.test(getAllCode())).toBe(true);
|
||||
});
|
||||
|
||||
test("returns JSON response", () => {
|
||||
const allCode = getAllCode();
|
||||
expect(
|
||||
/content-type['"]\s*:\s*['"]application\/json/i.test(allCode) ||
|
||||
/Response\.json/i.test(allCode) ||
|
||||
/JSON\.stringify/i.test(allCode),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality: demonstrates service role Edge Function best practices", () => {
|
||||
const allCode = getAllCode();
|
||||
const signals: RegExp[] = [
|
||||
/Deno\.env\.get\(\s*['"][^'"]*service[_-]?role[^'"]*['"]\s*\)/i,
|
||||
/Access-Control-Allow-Origin/,
|
||||
/createClient/i,
|
||||
/\btry\s*\{/,
|
||||
/Response\.json|JSON\.stringify/,
|
||||
/Deno\.serve/,
|
||||
];
|
||||
expect(signals.filter((r) => r.test(allCode)).length >= 5).toBe(true);
|
||||
});
|
||||
|
||||
7
packages/evals/evals/service-role-edge-function/meta.ts
Normal file
7
packages/evals/evals/service-role-edge-function/meta.ts
Normal file
@@ -0,0 +1,7 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-security-service-role.md",
|
||||
"edge-fun-quickstart.md",
|
||||
"edge-db-supabase-client.md",
|
||||
"edge-pat-cors.md",
|
||||
"edge-pat-error-handling.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "service-role-edge-function",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,253 +1,240 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"storage-access-control.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates avatars bucket",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (
|
||||
!/storage\.buckets/.test(sql) ||
|
||||
!/avatars/.test(sql) ||
|
||||
!/public/.test(sql)
|
||||
)
|
||||
return false;
|
||||
const avatarsBlock = sql.match(
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
|
||||
);
|
||||
return avatarsBlock !== null && /true/.test(avatarsBlock[0]);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "creates documents bucket",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/documents/.test(sql)) return false;
|
||||
const documentsBlock = sql.match(
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
|
||||
);
|
||||
return documentsBlock !== null && /false/.test(documentsBlock[0]);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "avatars bucket has mime type restriction",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/allowed_mime_types/.test(sql) &&
|
||||
/image\/jpeg/.test(sql) &&
|
||||
/image\/png/.test(sql) &&
|
||||
/image\/webp/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "avatars bucket has file size limit",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/file_size_limit/.test(sql)) return false;
|
||||
return (
|
||||
/2097152/.test(sql) ||
|
||||
/2\s*m/i.test(sql) ||
|
||||
/2\s*\*\s*1024\s*\*\s*1024/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "storage policy uses foldername or path for user isolation",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
|
||||
const usesPathMatch =
|
||||
/\(\s*storage\.foldername\s*\(/.test(sql) ||
|
||||
/\bname\b.*auth\.uid\(\)/.test(sql);
|
||||
return (
|
||||
(usesFoldername || usesPathMatch) &&
|
||||
/auth\.uid\(\)\s*::\s*text/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "storage policy uses TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const storagePolicies = policyBlocks.filter((p) =>
|
||||
p.toLowerCase().includes("storage.objects"),
|
||||
);
|
||||
const hasAuthenticatedPolicy = storagePolicies.some(
|
||||
(p) =>
|
||||
/to\s+(authenticated|public)/.test(p.toLowerCase()) ||
|
||||
/auth\.uid\(\)/.test(p.toLowerCase()),
|
||||
);
|
||||
if (!hasAuthenticatedPolicy) return false;
|
||||
const insertPolicies = storagePolicies.filter((p) =>
|
||||
/for\s+insert/.test(p.toLowerCase()),
|
||||
);
|
||||
return insertPolicies.every(
|
||||
(p) =>
|
||||
/to\s+authenticated/.test(p.toLowerCase()) ||
|
||||
/auth\.uid\(\)/.test(p.toLowerCase()),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "public read policy for avatars",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const avatarSelectPolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("storage.objects") &&
|
||||
/for\s+select/.test(p.toLowerCase()) &&
|
||||
p.toLowerCase().includes("avatars"),
|
||||
);
|
||||
if (avatarSelectPolicies.length === 0) return false;
|
||||
return avatarSelectPolicies.some((p) => {
|
||||
const lower = p.toLowerCase();
|
||||
const hasExplicitPublic =
|
||||
/to\s+public/.test(lower) || /to\s+anon/.test(lower);
|
||||
const hasNoToClause = !/\bto\s+\w+/.test(lower);
|
||||
const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
|
||||
return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
|
||||
});
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "documents bucket is fully private",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const documentPolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("storage.objects") &&
|
||||
p.toLowerCase().includes("documents"),
|
||||
);
|
||||
if (documentPolicies.length === 0) return false;
|
||||
return documentPolicies.every(
|
||||
(p) =>
|
||||
!/to\s+public/.test(p) &&
|
||||
!/to\s+anon/.test(p) &&
|
||||
(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "creates file_metadata table",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return /create\s+table/.test(sql) && /file_metadata/.test(sql);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "file_metadata has FK to auth.users with CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "RLS enabled on file_metadata",
|
||||
check: () =>
|
||||
/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "file_metadata policies use (select auth.uid())",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const metadataPolicies = policyBlocks.filter((p) =>
|
||||
p.toLowerCase().includes("file_metadata"),
|
||||
);
|
||||
for (const policy of metadataPolicies) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses timestamptz for time columns",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (
|
||||
!sql.includes("created_at") &&
|
||||
!sql.includes("updated_at") &&
|
||||
!sql.includes("uploaded_at")
|
||||
) {
|
||||
return true;
|
||||
}
|
||||
const columnDefs = sql.match(
|
||||
/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
|
||||
);
|
||||
if (!columnDefs) return true;
|
||||
return columnDefs.every((def) =>
|
||||
/timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "index on file_metadata user_id",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/create\s+index/.test(sql) &&
|
||||
/file_metadata/.test(sql) &&
|
||||
/user_id/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "idempotent DDL",
|
||||
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "overall quality score",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const signals = [
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?documents/,
|
||||
/allowed_mime_types/,
|
||||
/file_size_limit/,
|
||||
/storage\.foldername/,
|
||||
/auth\.uid\(\)\s*::\s*text/,
|
||||
/to\s+authenticated/,
|
||||
/to\s+(public|anon)/,
|
||||
/enable\s+row\s+level\s+security/,
|
||||
/on\s+delete\s+cascade/,
|
||||
/\(select\s+auth\.uid\(\)\)/,
|
||||
/create\s+index/,
|
||||
/timestamptz/,
|
||||
/if\s+not\s+exists/,
|
||||
/create\s+table[\s\S]*?file_metadata/,
|
||||
];
|
||||
return signals.filter((r) => r.test(sql)).length >= 11;
|
||||
},
|
||||
},
|
||||
];
|
||||
test("creates avatars bucket", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (
|
||||
!/storage\.buckets/.test(sql) ||
|
||||
!/avatars/.test(sql) ||
|
||||
!/public/.test(sql)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const avatarsBlock = sql.match(
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?avatars[\s\S]*?;/,
|
||||
);
|
||||
expect(avatarsBlock !== null && /true/.test(avatarsBlock[0])).toBe(true);
|
||||
});
|
||||
|
||||
test("creates documents bucket", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/documents/.test(sql)) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const documentsBlock = sql.match(
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?documents[\s\S]*?;/,
|
||||
);
|
||||
expect(documentsBlock !== null && /false/.test(documentsBlock[0])).toBe(true);
|
||||
});
|
||||
|
||||
test("avatars bucket has mime type restriction", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/allowed_mime_types/.test(sql) &&
|
||||
/image\/jpeg/.test(sql) &&
|
||||
/image\/png/.test(sql) &&
|
||||
/image\/webp/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("avatars bucket has file size limit", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/file_size_limit/.test(sql)) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
/2097152/.test(sql) ||
|
||||
/2\s*m/i.test(sql) ||
|
||||
/2\s*\*\s*1024\s*\*\s*1024/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("storage policy uses foldername or path for user isolation", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const usesFoldername = /storage\.foldername\s*\(\s*name\s*\)/.test(sql);
|
||||
const usesPathMatch =
|
||||
/\(\s*storage\.foldername\s*\(/.test(sql) ||
|
||||
/\bname\b.*auth\.uid\(\)/.test(sql);
|
||||
expect(
|
||||
(usesFoldername || usesPathMatch) && /auth\.uid\(\)\s*::\s*text/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("storage policy uses TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const storagePolicies = policyBlocks.filter((p) =>
|
||||
p.toLowerCase().includes("storage.objects"),
|
||||
);
|
||||
const hasAuthenticatedPolicy = storagePolicies.some(
|
||||
(p) =>
|
||||
/to\s+(authenticated|public)/.test(p.toLowerCase()) ||
|
||||
/auth\.uid\(\)/.test(p.toLowerCase()),
|
||||
);
|
||||
if (!hasAuthenticatedPolicy) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const insertPolicies = storagePolicies.filter((p) =>
|
||||
/for\s+insert/.test(p.toLowerCase()),
|
||||
);
|
||||
expect(
|
||||
insertPolicies.every(
|
||||
(p) =>
|
||||
/to\s+authenticated/.test(p.toLowerCase()) ||
|
||||
/auth\.uid\(\)/.test(p.toLowerCase()),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("public read policy for avatars", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const avatarSelectPolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("storage.objects") &&
|
||||
/for\s+select/.test(p.toLowerCase()) &&
|
||||
p.toLowerCase().includes("avatars"),
|
||||
);
|
||||
if (avatarSelectPolicies.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
avatarSelectPolicies.some((p) => {
|
||||
const lower = p.toLowerCase();
|
||||
const hasExplicitPublic =
|
||||
/to\s+public/.test(lower) || /to\s+anon/.test(lower);
|
||||
const hasNoToClause = !/\bto\s+\w+/.test(lower);
|
||||
const hasNoAuthRestriction = !/auth\.uid\(\)/.test(lower);
|
||||
return hasExplicitPublic || (hasNoToClause && hasNoAuthRestriction);
|
||||
}),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("documents bucket is fully private", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const documentPolicies = policyBlocks.filter(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("storage.objects") &&
|
||||
p.toLowerCase().includes("documents"),
|
||||
);
|
||||
if (documentPolicies.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
documentPolicies.every(
|
||||
(p) =>
|
||||
!/to\s+public/.test(p) &&
|
||||
!/to\s+anon/.test(p) &&
|
||||
(/to\s+authenticated/.test(p) || /auth\.uid\(\)/.test(p)),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("creates file_metadata table", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(/create\s+table/.test(sql) && /file_metadata/.test(sql)).toBe(true);
|
||||
});
|
||||
|
||||
test("file_metadata has FK to auth.users with CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("RLS enabled on file_metadata", () => {
|
||||
expect(
|
||||
/alter\s+table.*file_metadata.*enable\s+row\s+level\s+security/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("file_metadata policies use (select auth.uid())", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const metadataPolicies = policyBlocks.filter((p) =>
|
||||
p.toLowerCase().includes("file_metadata"),
|
||||
);
|
||||
for (const policy of metadataPolicies) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("uses timestamptz for time columns", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (
|
||||
!sql.includes("created_at") &&
|
||||
!sql.includes("updated_at") &&
|
||||
!sql.includes("uploaded_at")
|
||||
) {
|
||||
expect(true).toBe(true);
|
||||
return;
|
||||
}
|
||||
const columnDefs = sql.match(
|
||||
/(?:created_at|updated_at|uploaded_at)\s+timestamp\b/g,
|
||||
);
|
||||
if (!columnDefs) {
|
||||
expect(true).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(
|
||||
columnDefs.every((def) =>
|
||||
/timestamptz|timestamp\s+with\s+time\s+zone/.test(def),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("index on file_metadata user_id", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/create\s+index/.test(sql) &&
|
||||
/file_metadata/.test(sql) &&
|
||||
/user_id/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("idempotent DDL", () => {
|
||||
expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality score", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const signals = [
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?avatars/,
|
||||
/insert\s+into\s+storage\.buckets[\s\S]*?documents/,
|
||||
/allowed_mime_types/,
|
||||
/file_size_limit/,
|
||||
/storage\.foldername/,
|
||||
/auth\.uid\(\)\s*::\s*text/,
|
||||
/to\s+authenticated/,
|
||||
/to\s+(public|anon)/,
|
||||
/enable\s+row\s+level\s+security/,
|
||||
/on\s+delete\s+cascade/,
|
||||
/\(select\s+auth\.uid\(\)\)/,
|
||||
/create\s+index/,
|
||||
/timestamptz/,
|
||||
/if\s+not\s+exists/,
|
||||
/create\s+table[\s\S]*?file_metadata/,
|
||||
];
|
||||
expect(signals.filter((r) => r.test(sql)).length >= 11).toBe(true);
|
||||
});
|
||||
|
||||
10
packages/evals/evals/storage-rls-user-folders/meta.ts
Normal file
10
packages/evals/evals/storage-rls-user-folders/meta.ts
Normal file
@@ -0,0 +1,10 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"storage-access-control.md",
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "storage-rls-user-folders",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
@@ -1,216 +1,193 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-security-functions.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
import { expect, test } from "vitest";
|
||||
|
||||
import type { EvalAssertion } from "../../src/eval-types.js";
|
||||
import { findMigrationFiles, getMigrationSQL } from "./eval-utils.ts";
|
||||
|
||||
import { findMigrationFiles, getMigrationSQL } from "../eval-utils.ts";
|
||||
test("migration file exists", () => {
|
||||
expect(findMigrationFiles().length > 0).toBe(true);
|
||||
});
|
||||
|
||||
export const assertions: EvalAssertion[] = [
|
||||
{
|
||||
name: "migration file exists",
|
||||
check: () => findMigrationFiles().length > 0,
|
||||
},
|
||||
{
|
||||
name: "creates organizations table",
|
||||
check: () =>
|
||||
/create\s+table[\s\S]*?organizations/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
test("creates organizations table", () => {
|
||||
expect(
|
||||
/create\s+table[\s\S]*?organizations/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("creates memberships table", () => {
|
||||
expect(
|
||||
/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("creates projects table", () => {
|
||||
expect(
|
||||
/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("enables RLS on all tables", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "creates memberships table",
|
||||
check: () =>
|
||||
/create\s+table[\s\S]*?memberships/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "creates projects table",
|
||||
check: () =>
|
||||
/create\s+table[\s\S]*?projects/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "enables RLS on all tables",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "FK to auth.users with ON DELETE CASCADE",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "org_id FK on projects",
|
||||
check: () =>
|
||||
/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("FK to auth.users with ON DELETE CASCADE", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("org_id FK on projects", () => {
|
||||
expect(
|
||||
/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(
|
||||
getMigrationSQL().toLowerCase(),
|
||||
),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("private schema created", () => {
|
||||
expect(
|
||||
/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("security_definer helper function", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
expect(
|
||||
/private\./.test(sql) &&
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use (select auth.uid())", () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
if (policyBlocks.length === 0) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
}
|
||||
expect(true).toBe(true);
|
||||
});
|
||||
|
||||
test("policies use TO authenticated", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("index on membership lookup columns", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/create\s+index/.test(sql)) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
|
||||
expect(
|
||||
indexBlocks.filter(
|
||||
(idx) =>
|
||||
idx.includes("user_id") ||
|
||||
idx.includes("org_id") ||
|
||||
idx.includes("organization_id"),
|
||||
).length >= 1,
|
||||
).toBe(true);
|
||||
});
|
||||
|
||||
test("uses timestamptz", () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp = /\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("_at ")
|
||||
) {
|
||||
expect(hasPlainTimestamp.test(sql)).toBe(false);
|
||||
} else {
|
||||
expect(true).toBe(true);
|
||||
}
|
||||
});
|
||||
|
||||
test("idempotent DDL", () => {
|
||||
expect(/if\s+not\s+exists/.test(getMigrationSQL().toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("stable or immutable on helper function", () => {
|
||||
expect(/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase())).toBe(
|
||||
true,
|
||||
);
|
||||
});
|
||||
|
||||
test("delete policy restricted to owner role", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const deletePolicy = policyBlocks.find(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("delete") && p.toLowerCase().includes("project"),
|
||||
);
|
||||
if (!deletePolicy) {
|
||||
expect(false).toBe(true);
|
||||
return;
|
||||
}
|
||||
expect(/owner|admin/.test(deletePolicy.toLowerCase())).toBe(true);
|
||||
});
|
||||
|
||||
test("overall quality score", () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
},
|
||||
{
|
||||
name: "private schema created",
|
||||
check: () =>
|
||||
/create\s+schema[\s\S]*?private/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "security_definer helper function",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
return (
|
||||
/private\./.test(sql) &&
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql)
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use (select auth.uid())",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
if (policyBlocks.length === 0) return false;
|
||||
for (const policy of policyBlocks) {
|
||||
if (
|
||||
policy.includes("auth.uid()") &&
|
||||
!/\(\s*select\s+auth\.uid\(\)\s*\)/i.test(policy)
|
||||
) {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "policies use TO authenticated",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p))
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "index on membership lookup columns",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
if (!/create\s+index/.test(sql)) return false;
|
||||
const indexBlocks = sql.match(/create\s+index[\s\S]*?;/gi) ?? [];
|
||||
return (
|
||||
indexBlocks.filter(
|
||||
(idx) =>
|
||||
idx.includes("user_id") ||
|
||||
idx.includes("org_id") ||
|
||||
idx.includes("organization_id"),
|
||||
).length >= 1
|
||||
);
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "uses timestamptz",
|
||||
check: () => {
|
||||
const rawSql = getMigrationSQL().toLowerCase();
|
||||
const sql = rawSql.replace(/--[^\n]*/g, "");
|
||||
const hasPlainTimestamp =
|
||||
/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/;
|
||||
if (
|
||||
sql.includes("created_at") ||
|
||||
sql.includes("updated_at") ||
|
||||
sql.includes("_at ")
|
||||
) {
|
||||
return !hasPlainTimestamp.test(sql);
|
||||
}
|
||||
return true;
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "idempotent DDL",
|
||||
check: () => /if\s+not\s+exists/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "stable or immutable on helper function",
|
||||
check: () =>
|
||||
/\bstable\b|\bimmutable\b/.test(getMigrationSQL().toLowerCase()),
|
||||
},
|
||||
{
|
||||
name: "delete policy restricted to owner role",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const deletePolicy = policyBlocks.find(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("delete") &&
|
||||
p.toLowerCase().includes("project"),
|
||||
);
|
||||
if (!deletePolicy) return false;
|
||||
return /owner|admin/.test(deletePolicy.toLowerCase());
|
||||
},
|
||||
},
|
||||
{
|
||||
name: "overall quality score",
|
||||
check: () => {
|
||||
const sql = getMigrationSQL().toLowerCase();
|
||||
const policyBlocks = sql.match(/create\s+policy[\s\S]*?;/gi) ?? [];
|
||||
const signals = [
|
||||
/alter\s+table[\s\S]*?organizations[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?memberships[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
) &&
|
||||
/alter\s+table[\s\S]*?projects[\s\S]*?enable\s+row\s+level\s+security/.test(
|
||||
sql,
|
||||
),
|
||||
/references\s+auth\.users/.test(sql) &&
|
||||
/on\s+delete\s+cascade/.test(sql),
|
||||
/create\s+schema[\s\S]*?private/.test(sql),
|
||||
/security\s+definer/.test(sql) &&
|
||||
/set\s+search_path\s*=\s*''/.test(sql),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/create\s+index/.test(sql),
|
||||
!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
|
||||
sql.replace(/--[^\n]*/g, ""),
|
||||
),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("delete") &&
|
||||
p.toLowerCase().includes("project") &&
|
||||
/owner|admin/.test(p.toLowerCase()),
|
||||
),
|
||||
/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(sql),
|
||||
policyBlocks.length >= 3,
|
||||
/role/.test(sql),
|
||||
/private\./.test(sql),
|
||||
/\bstable\b|\bimmutable\b/.test(sql),
|
||||
];
|
||||
return signals.filter(Boolean).length >= 11;
|
||||
},
|
||||
},
|
||||
];
|
||||
/references\s+auth\.users/.test(sql) && /on\s+delete\s+cascade/.test(sql),
|
||||
/create\s+schema[\s\S]*?private/.test(sql),
|
||||
/security\s+definer/.test(sql) && /set\s+search_path\s*=\s*''/.test(sql),
|
||||
/\(\s*select\s+auth\.uid\(\)\s*\)/.test(sql),
|
||||
policyBlocks.length > 0 &&
|
||||
policyBlocks.every((p) => /to\s+authenticated/.test(p)),
|
||||
/create\s+index/.test(sql),
|
||||
!/\btimestamp\b(?!\s*tz)(?!\s+with\s+time\s+zone)/.test(
|
||||
sql.replace(/--[^\n]*/g, ""),
|
||||
),
|
||||
/if\s+not\s+exists/.test(sql),
|
||||
policyBlocks.some(
|
||||
(p) =>
|
||||
p.toLowerCase().includes("delete") &&
|
||||
p.toLowerCase().includes("project") &&
|
||||
/owner|admin/.test(p.toLowerCase()),
|
||||
),
|
||||
/org[anization_]*id[\s\S]*?references[\s\S]*?organizations/.test(sql),
|
||||
policyBlocks.length >= 3,
|
||||
/role/.test(sql),
|
||||
/private\./.test(sql),
|
||||
/\bstable\b|\bimmutable\b/.test(sql),
|
||||
];
|
||||
expect(signals.filter(Boolean).length >= 11).toBe(true);
|
||||
});
|
||||
|
||||
11
packages/evals/evals/team-rls-security-definer/meta.ts
Normal file
11
packages/evals/evals/team-rls-security-definer/meta.ts
Normal file
@@ -0,0 +1,11 @@
|
||||
export const expectedReferenceFiles = [
|
||||
"db-rls-mandatory.md",
|
||||
"db-rls-policy-types.md",
|
||||
"db-rls-common-mistakes.md",
|
||||
"db-rls-performance.md",
|
||||
"db-security-functions.md",
|
||||
"db-schema-auth-fk.md",
|
||||
"db-schema-timestamps.md",
|
||||
"db-perf-indexes.md",
|
||||
"db-migrations-idempotent.md",
|
||||
];
|
||||
@@ -1,5 +1,8 @@
|
||||
{
|
||||
"name": "team-rls-security-definer",
|
||||
"private": true,
|
||||
"type": "module"
|
||||
"type": "module",
|
||||
"devDependencies": {
|
||||
"vitest": "^2.0.0"
|
||||
}
|
||||
}
|
||||
|
||||
125
packages/evals/experiments/experiment.ts
Normal file
125
packages/evals/experiments/experiment.ts
Normal file
@@ -0,0 +1,125 @@
|
||||
import { execFileSync } from "node:child_process";
|
||||
import { existsSync, readdirSync, readFileSync } from "node:fs";
|
||||
import { dirname, join, resolve } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
import type { ExperimentConfig } from "@vercel/agent-eval";
|
||||
|
||||
const __dirname = dirname(fileURLToPath(import.meta.url));
|
||||
const EVALS_ROOT = resolve(__dirname, "..");
|
||||
const REPO_ROOT = resolve(EVALS_ROOT, "..", "..");
|
||||
const PROJECT_DIR = join(EVALS_ROOT, "project");
|
||||
|
||||
const SKILL_NAME = process.env.EVAL_SKILL ?? "supabase";
|
||||
const SKILL_DIR = join(REPO_ROOT, "skills", SKILL_NAME);
|
||||
|
||||
const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
|
||||
const isBaseline = process.env.EVAL_BASELINE === "true";
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Skill file loader — reads all skill files to inject into the sandbox
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
function readSkillFiles(): Record<string, string> {
|
||||
const files: Record<string, string> = {};
|
||||
|
||||
for (const name of ["SKILL.md", "AGENTS.md"]) {
|
||||
const src = join(SKILL_DIR, name);
|
||||
if (existsSync(src)) {
|
||||
const content = readFileSync(src, "utf-8");
|
||||
files[`.agents/skills/${SKILL_NAME}/${name}`] = content;
|
||||
files[`.claude/skills/${SKILL_NAME}/${name}`] = content;
|
||||
}
|
||||
}
|
||||
|
||||
const refsDir = join(SKILL_DIR, "references");
|
||||
if (existsSync(refsDir)) {
|
||||
for (const f of readdirSync(refsDir)) {
|
||||
const content = readFileSync(join(refsDir, f), "utf-8");
|
||||
files[`.agents/skills/${SKILL_NAME}/references/${f}`] = content;
|
||||
files[`.claude/skills/${SKILL_NAME}/references/${f}`] = content;
|
||||
}
|
||||
}
|
||||
|
||||
return files;
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// DB reset — clears all user-created objects between scenarios
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const RESET_SQL = `
|
||||
DROP SCHEMA public CASCADE;
|
||||
CREATE SCHEMA public;
|
||||
GRANT ALL ON SCHEMA public TO postgres;
|
||||
GRANT ALL ON SCHEMA public TO anon;
|
||||
GRANT ALL ON SCHEMA public TO authenticated;
|
||||
GRANT ALL ON SCHEMA public TO service_role;
|
||||
DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
|
||||
NOTIFY pgrst, 'reload schema';
|
||||
`.trim();
|
||||
|
||||
function resetDB(): void {
|
||||
const dbUrl =
|
||||
process.env.SUPABASE_DB_URL ??
|
||||
"postgresql://postgres:postgres@127.0.0.1:54322/postgres";
|
||||
execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
|
||||
stdio: "inherit",
|
||||
timeout: 30_000,
|
||||
});
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Experiment configuration
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const config: ExperimentConfig = {
|
||||
agent: "claude-code",
|
||||
model: "claude-sonnet-4-6",
|
||||
runs: 1,
|
||||
earlyExit: true,
|
||||
timeout: 1800,
|
||||
sandbox: "docker",
|
||||
evals: process.env.EVAL_SCENARIO ?? "*",
|
||||
|
||||
setup: async (sandbox) => {
|
||||
// 1. Reset DB for a clean slate
|
||||
resetDB();
|
||||
|
||||
// 2. Seed supabase config so the agent can run `supabase db push`
|
||||
const configPath = join(PROJECT_DIR, "supabase", "config.toml");
|
||||
if (existsSync(configPath)) {
|
||||
await sandbox.writeFiles({
|
||||
"supabase/config.toml": readFileSync(configPath, "utf-8"),
|
||||
});
|
||||
}
|
||||
|
||||
// 3. Write MCP config pointing to host Supabase instance
|
||||
await sandbox.writeFiles({
|
||||
".mcp.json": JSON.stringify(
|
||||
{
|
||||
mcpServers: {
|
||||
supabase: { type: "http", url: `${supabaseUrl}/mcp` },
|
||||
},
|
||||
},
|
||||
null,
|
||||
"\t",
|
||||
),
|
||||
});
|
||||
|
||||
// 4. Write eval-utils.ts into the workspace so EVAL.ts can import it
|
||||
// (agent-eval only copies the fixture's own directory into the sandbox)
|
||||
const evalUtilsPath = join(EVALS_ROOT, "evals", "eval-utils.ts");
|
||||
if (existsSync(evalUtilsPath)) {
|
||||
await sandbox.writeFiles({
|
||||
"eval-utils.ts": readFileSync(evalUtilsPath, "utf-8"),
|
||||
});
|
||||
}
|
||||
|
||||
// 5. Install skill files (unless baseline mode)
|
||||
if (!isBaseline) {
|
||||
await sandbox.writeFiles(readSkillFiles());
|
||||
}
|
||||
},
|
||||
};
|
||||
|
||||
export default config;
|
||||
2356
packages/evals/package-lock.json
generated
2356
packages/evals/package-lock.json
generated
File diff suppressed because it is too large
Load Diff
@@ -6,17 +6,19 @@
|
||||
"license": "MIT",
|
||||
"description": "Agent evaluation system for Supabase skills",
|
||||
"scripts": {
|
||||
"eval": "tsx src/runner.ts",
|
||||
"eval:upload": "BRAINTRUST_UPLOAD=true tsx src/runner.ts"
|
||||
"eval": "agent-eval",
|
||||
"eval:dry": "agent-eval --dry",
|
||||
"eval:smoke": "agent-eval --smoke",
|
||||
"eval:upload": "tsx src/upload.ts"
|
||||
},
|
||||
"dependencies": {
|
||||
"@anthropic-ai/claude-code": "^2.1.49",
|
||||
"braintrust": "^3.0.0",
|
||||
"skills": "^1.4.0"
|
||||
"@vercel/agent-eval": "^0.9.2",
|
||||
"braintrust": "^3.0.0"
|
||||
},
|
||||
"devDependencies": {
|
||||
"@types/node": "^20.10.0",
|
||||
"tsx": "^4.7.0",
|
||||
"typescript": "^5.3.0"
|
||||
"typescript": "^5.3.0",
|
||||
"vitest": "^4.0.18"
|
||||
}
|
||||
}
|
||||
|
||||
55
packages/evals/scripts/eval.sh
Executable file
55
packages/evals/scripts/eval.sh
Executable file
@@ -0,0 +1,55 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
|
||||
EVALS_DIR="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
PROJECT_DIR="$EVALS_DIR/project"
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Parse CLI arguments
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
AGENT_EVAL_ARGS=()
|
||||
UPLOAD=true # Always upload to Braintrust by default
|
||||
|
||||
while [[ $# -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--skill)
|
||||
export EVAL_SKILL="$2"
|
||||
shift 2
|
||||
;;
|
||||
--scenario)
|
||||
export EVAL_SCENARIO="$2"
|
||||
shift 2
|
||||
;;
|
||||
*)
|
||||
AGENT_EVAL_ARGS+=("$1")
|
||||
shift
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
echo "Starting Supabase..."
|
||||
supabase start --exclude studio,imgproxy,mailpit --workdir "$PROJECT_DIR"
|
||||
|
||||
# Export keys so experiment.ts and vitest assertions can connect
|
||||
eval "$(supabase status --output json --workdir "$PROJECT_DIR" | \
|
||||
node -e "
|
||||
const s = JSON.parse(require('fs').readFileSync('/dev/stdin','utf-8'));
|
||||
console.log('export SUPABASE_URL=' + (s.API_URL || 'http://127.0.0.1:54321'));
|
||||
console.log('export SUPABASE_ANON_KEY=' + s.ANON_KEY);
|
||||
console.log('export SUPABASE_SERVICE_ROLE_KEY=' + s.SERVICE_ROLE_KEY);
|
||||
console.log('export SUPABASE_DB_URL=' + (s.DB_URL || 'postgresql://postgres:postgres@127.0.0.1:54322/postgres'));
|
||||
")"
|
||||
|
||||
trap 'echo "Stopping Supabase..."; supabase stop --no-backup --workdir "$PROJECT_DIR"' EXIT
|
||||
|
||||
echo "Running agent-eval..."
|
||||
cd "$EVALS_DIR"
|
||||
npx agent-eval "${AGENT_EVAL_ARGS[@]+"${AGENT_EVAL_ARGS[@]}"}"
|
||||
|
||||
# Upload results to Braintrust (default: true, skip with --no-upload)
|
||||
if [ "$UPLOAD" = "true" ]; then
|
||||
echo "Uploading results to Braintrust..."
|
||||
npx tsx src/upload.ts
|
||||
fi
|
||||
@@ -1,21 +0,0 @@
|
||||
/**
|
||||
* A single assertion to run against the agent's workspace output.
|
||||
*
|
||||
* Used by EVAL.ts files to declare what the agent's work should produce.
|
||||
* The runner executes these in-process (no test framework required).
|
||||
*/
|
||||
export interface EvalAssertion {
|
||||
/** Human-readable name shown in Braintrust and local output */
|
||||
name: string;
|
||||
/** Return true = pass, false/throw = fail */
|
||||
check: () => boolean | Promise<boolean>;
|
||||
/** Timeout in ms for async checks (default: no timeout) */
|
||||
timeout?: number;
|
||||
}
|
||||
|
||||
/** Result of running a single EvalAssertion */
|
||||
export interface AssertionResult {
|
||||
name: string;
|
||||
passed: boolean;
|
||||
error?: string;
|
||||
}
|
||||
@@ -1,372 +0,0 @@
|
||||
import { existsSync, readdirSync, readFileSync } from "node:fs";
|
||||
import { join, resolve } from "node:path";
|
||||
import type { AssertionResult, EvalAssertion } from "./eval-types.js";
|
||||
import { runAgent } from "./runner/agent.js";
|
||||
import {
|
||||
seedBraintrustDataset,
|
||||
uploadToBraintrust,
|
||||
} from "./runner/braintrust.js";
|
||||
import { createResultDir, saveRunArtifacts } from "./runner/persist.js";
|
||||
import { preflight } from "./runner/preflight.js";
|
||||
import { listModifiedFiles, printSummary } from "./runner/results.js";
|
||||
import { createWorkspace } from "./runner/scaffold.js";
|
||||
import {
|
||||
assertionsPassedScorer,
|
||||
finalResultScorer,
|
||||
referenceFilesUsageScorer,
|
||||
skillUsageScorer,
|
||||
} from "./runner/scorers.js";
|
||||
import {
|
||||
getKeys,
|
||||
resetDB,
|
||||
startSupabase,
|
||||
stopSupabase,
|
||||
} from "./runner/supabase-setup.js";
|
||||
import {
|
||||
buildTranscriptSummary,
|
||||
type TranscriptSummary,
|
||||
} from "./runner/transcript.js";
|
||||
import type { EvalRunResult, EvalScenario } from "./types.js";
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Configuration from environment
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
const DEFAULT_MODEL = "claude-sonnet-4-5-20250929";
|
||||
const DEFAULT_SKILL = "supabase";
|
||||
const AGENT_TIMEOUT = 30 * 60 * 1000; // 30 minutes
|
||||
|
||||
const model = process.env.EVAL_MODEL ?? DEFAULT_MODEL;
|
||||
const skillName = process.env.EVAL_SKILL ?? DEFAULT_SKILL;
|
||||
const scenarioFilter = process.env.EVAL_SCENARIO;
|
||||
const isBaseline = process.env.EVAL_BASELINE === "true";
|
||||
const skillEnabled = !isBaseline;
|
||||
|
||||
// Run-level timestamp shared across all scenarios in a single invocation
|
||||
const runTimestamp = new Date()
|
||||
.toISOString()
|
||||
.replace(/[:.]/g, "-")
|
||||
.replace("Z", "");
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Discover scenarios
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
function findEvalsDir(): string {
|
||||
let dir = process.cwd();
|
||||
for (let i = 0; i < 10; i++) {
|
||||
const candidate = join(dir, "packages", "evals", "evals");
|
||||
if (existsSync(candidate)) return candidate;
|
||||
const parent = resolve(dir, "..");
|
||||
if (parent === dir) break;
|
||||
dir = parent;
|
||||
}
|
||||
throw new Error("Could not find packages/evals/evals/ directory");
|
||||
}
|
||||
|
||||
function discoverScenarios(): EvalScenario[] {
|
||||
const evalsDir = findEvalsDir();
|
||||
const dirs = readdirSync(evalsDir, { withFileTypes: true }).filter(
|
||||
(d) => d.isDirectory() && existsSync(join(evalsDir, d.name, "PROMPT.md")),
|
||||
);
|
||||
|
||||
return dirs.map((d) => ({
|
||||
id: d.name,
|
||||
name: d.name,
|
||||
tags: [],
|
||||
}));
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Scenario threshold
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
function getPassThreshold(scenarioId: string): number | null {
|
||||
const scenariosDir = join(findEvalsDir(), "..", "scenarios");
|
||||
const scenarioFile = join(scenariosDir, `${scenarioId}.md`);
|
||||
if (!existsSync(scenarioFile)) return null;
|
||||
|
||||
const content = readFileSync(scenarioFile, "utf-8");
|
||||
const match = content.match(/\*\*pass_threshold:\*\*\s*(\d+)/);
|
||||
return match ? Number.parseInt(match[1], 10) : null;
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// In-process assertion runner (replaces vitest subprocess)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async function runAssertions(
|
||||
assertions: EvalAssertion[],
|
||||
): Promise<AssertionResult[]> {
|
||||
return Promise.all(
|
||||
assertions.map(async (a) => {
|
||||
try {
|
||||
let result: boolean;
|
||||
if (a.timeout) {
|
||||
const timeoutPromise = new Promise<never>((_, reject) =>
|
||||
setTimeout(
|
||||
() =>
|
||||
reject(new Error(`Assertion timed out after ${a.timeout}ms`)),
|
||||
a.timeout,
|
||||
),
|
||||
);
|
||||
result = await Promise.race([
|
||||
Promise.resolve(a.check()),
|
||||
timeoutPromise,
|
||||
]);
|
||||
} else {
|
||||
result = await Promise.resolve(a.check());
|
||||
}
|
||||
return { name: a.name, passed: Boolean(result) };
|
||||
} catch (e) {
|
||||
return { name: a.name, passed: false, error: String(e) };
|
||||
}
|
||||
}),
|
||||
);
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Run a single eval
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async function runEval(
|
||||
scenario: EvalScenario,
|
||||
skillEnabled: boolean,
|
||||
): Promise<{
|
||||
result: EvalRunResult;
|
||||
transcript?: TranscriptSummary;
|
||||
expectedReferenceFiles: string[];
|
||||
}> {
|
||||
const evalsDir = findEvalsDir();
|
||||
const evalDir = join(evalsDir, scenario.id);
|
||||
const variant = skillEnabled ? "with-skill" : "baseline";
|
||||
|
||||
console.log(`\n--- ${scenario.id} (${variant}) ---`);
|
||||
|
||||
// Load assertions and expected reference files from EVAL.ts
|
||||
const evalFilePath = existsSync(join(evalDir, "EVAL.tsx"))
|
||||
? join(evalDir, "EVAL.tsx")
|
||||
: join(evalDir, "EVAL.ts");
|
||||
|
||||
const {
|
||||
assertions = [] as EvalAssertion[],
|
||||
expectedReferenceFiles = [] as string[],
|
||||
} = await import(evalFilePath).catch(() => ({
|
||||
assertions: [] as EvalAssertion[],
|
||||
expectedReferenceFiles: [] as string[],
|
||||
}));
|
||||
|
||||
const passThreshold = getPassThreshold(scenario.id);
|
||||
const prompt = readFileSync(join(evalDir, "PROMPT.md"), "utf-8").trim();
|
||||
|
||||
// 1. Create isolated workspace
|
||||
const { workspacePath, cleanup } = createWorkspace({ evalDir, skillEnabled });
|
||||
console.log(` Workspace: ${workspacePath}`);
|
||||
|
||||
try {
|
||||
// 2. Run the agent
|
||||
console.log(` Running agent (${model})...`);
|
||||
const startedAt = Date.now();
|
||||
const agentResult = await runAgent({
|
||||
cwd: workspacePath,
|
||||
prompt,
|
||||
model,
|
||||
timeout: AGENT_TIMEOUT,
|
||||
skillEnabled,
|
||||
skillName: skillEnabled ? skillName : undefined,
|
||||
});
|
||||
console.log(
|
||||
` Agent finished in ${(agentResult.duration / 1000).toFixed(1)}s`,
|
||||
);
|
||||
|
||||
// 3. Run assertions in-process from the workspace directory so that
|
||||
// eval-utils.ts helpers resolve paths relative to the workspace.
|
||||
console.log(" Running assertions...");
|
||||
const prevCwd = process.cwd();
|
||||
process.chdir(workspacePath);
|
||||
const assertionResults = await runAssertions(assertions).finally(() => {
|
||||
process.chdir(prevCwd);
|
||||
});
|
||||
const passedCount = assertionResults.filter((a) => a.passed).length;
|
||||
const totalCount = assertionResults.length;
|
||||
|
||||
const passed = passThreshold
|
||||
? totalCount > 0 && passedCount >= passThreshold
|
||||
: totalCount > 0 && passedCount === totalCount;
|
||||
|
||||
const pct =
|
||||
totalCount > 0 ? ((passedCount / totalCount) * 100).toFixed(1) : "0.0";
|
||||
const thresholdInfo = passThreshold
|
||||
? `, threshold: ${((passThreshold / totalCount) * 100).toFixed(0)}%`
|
||||
: "";
|
||||
console.log(
|
||||
` Assertions: ${passedCount}/${totalCount} passed (${pct}%${thresholdInfo})`,
|
||||
);
|
||||
|
||||
// 4. Collect modified files
|
||||
const filesModified = listModifiedFiles(workspacePath, evalDir);
|
||||
|
||||
// 5. Build transcript summary
|
||||
const summary = buildTranscriptSummary(agentResult.events);
|
||||
|
||||
// 6. Run scorers
|
||||
const skillScore = skillUsageScorer(summary, skillName);
|
||||
const refScore = referenceFilesUsageScorer(summary, expectedReferenceFiles);
|
||||
const assertScore = assertionsPassedScorer({
|
||||
testsPassed: passedCount,
|
||||
testsTotal: totalCount,
|
||||
status: passed ? "passed" : "failed",
|
||||
} as EvalRunResult);
|
||||
const finalScore = finalResultScorer({
|
||||
status: passed ? "passed" : "failed",
|
||||
testsPassed: passedCount,
|
||||
testsTotal: totalCount,
|
||||
passThreshold: passThreshold ?? undefined,
|
||||
} as EvalRunResult);
|
||||
|
||||
const result: EvalRunResult = {
|
||||
scenario: scenario.id,
|
||||
agent: "claude-code",
|
||||
model,
|
||||
skillEnabled,
|
||||
status: passed ? "passed" : "failed",
|
||||
duration: agentResult.duration,
|
||||
agentOutput: agentResult.output,
|
||||
testsPassed: passedCount,
|
||||
testsTotal: totalCount,
|
||||
passThreshold: passThreshold ?? undefined,
|
||||
assertionResults,
|
||||
filesModified,
|
||||
toolCallCount: summary.toolCalls.length,
|
||||
costUsd: summary.totalCostUsd ?? undefined,
|
||||
prompt,
|
||||
startedAt,
|
||||
durationApiMs: summary.totalDurationApiMs,
|
||||
totalInputTokens: summary.totalInputTokens,
|
||||
totalOutputTokens: summary.totalOutputTokens,
|
||||
totalCacheReadTokens: summary.totalCacheReadTokens,
|
||||
totalCacheCreationTokens: summary.totalCacheCreationTokens,
|
||||
modelUsage: summary.modelUsage,
|
||||
toolErrorCount: summary.toolErrorCount,
|
||||
permissionDenialCount: summary.permissionDenialCount,
|
||||
loadedSkills: summary.skills,
|
||||
referenceFilesRead: summary.referenceFilesRead,
|
||||
scores: {
|
||||
skillUsage: skillScore.score,
|
||||
referenceFilesUsage: refScore.score,
|
||||
assertionsPassed: assertScore.score,
|
||||
finalResult: finalScore.score,
|
||||
},
|
||||
};
|
||||
|
||||
// 7. Persist results
|
||||
const resultDir = createResultDir(runTimestamp, scenario.id, variant);
|
||||
result.resultsDir = resultDir;
|
||||
saveRunArtifacts({
|
||||
resultDir,
|
||||
rawTranscript: agentResult.rawTranscript,
|
||||
assertionResults,
|
||||
result,
|
||||
transcriptSummary: summary,
|
||||
});
|
||||
|
||||
return { result, transcript: summary, expectedReferenceFiles };
|
||||
} catch (error) {
|
||||
const err = error as Error;
|
||||
return {
|
||||
result: {
|
||||
scenario: scenario.id,
|
||||
agent: "claude-code",
|
||||
model,
|
||||
skillEnabled,
|
||||
status: "error",
|
||||
duration: 0,
|
||||
agentOutput: "",
|
||||
testsPassed: 0,
|
||||
testsTotal: 0,
|
||||
filesModified: [],
|
||||
error: err.message,
|
||||
},
|
||||
expectedReferenceFiles: [],
|
||||
};
|
||||
} finally {
|
||||
cleanup();
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Main
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async function main() {
|
||||
preflight();
|
||||
|
||||
console.log("Supabase Skills Evals");
|
||||
console.log(`Model: ${model}`);
|
||||
console.log(`Mode: ${isBaseline ? "baseline (no skills)" : "with skills"}`);
|
||||
|
||||
let scenarios = discoverScenarios();
|
||||
|
||||
if (scenarioFilter) {
|
||||
scenarios = scenarios.filter((s) => s.id === scenarioFilter);
|
||||
if (scenarios.length === 0) {
|
||||
console.error(`Scenario not found: ${scenarioFilter}`);
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
|
||||
console.log(`Scenarios: ${scenarios.map((s) => s.id).join(", ")}`);
|
||||
|
||||
// Start the shared Supabase instance once for all scenarios.
|
||||
startSupabase();
|
||||
const keys = getKeys();
|
||||
|
||||
// Inject keys into process.env so assertions can connect to the real DB.
|
||||
process.env.SUPABASE_URL = keys.apiUrl;
|
||||
process.env.SUPABASE_ANON_KEY = keys.anonKey;
|
||||
process.env.SUPABASE_SERVICE_ROLE_KEY = keys.serviceRoleKey;
|
||||
process.env.SUPABASE_DB_URL = keys.dbUrl;
|
||||
|
||||
const results: EvalRunResult[] = [];
|
||||
const transcripts = new Map<string, TranscriptSummary>();
|
||||
const expectedRefFiles = new Map<string, string[]>();
|
||||
|
||||
try {
|
||||
for (const scenario of scenarios) {
|
||||
// Reset the database before each scenario for a clean slate.
|
||||
console.log(`\n Resetting DB for ${scenario.id}...`);
|
||||
resetDB(keys.dbUrl);
|
||||
|
||||
const { result, transcript, expectedReferenceFiles } = await runEval(
|
||||
scenario,
|
||||
skillEnabled,
|
||||
);
|
||||
results.push(result);
|
||||
if (transcript) {
|
||||
transcripts.set(result.scenario, transcript);
|
||||
}
|
||||
expectedRefFiles.set(result.scenario, expectedReferenceFiles);
|
||||
}
|
||||
} finally {
|
||||
stopSupabase();
|
||||
}
|
||||
|
||||
// Use the results dir from the first result (all share the same timestamp)
|
||||
const resultsDir = results.find((r) => r.resultsDir)?.resultsDir;
|
||||
printSummary(results, resultsDir);
|
||||
|
||||
console.log("\nUploading to Braintrust...");
|
||||
await seedBraintrustDataset(results, expectedRefFiles);
|
||||
await uploadToBraintrust(results, {
|
||||
model,
|
||||
skillEnabled,
|
||||
runTimestamp,
|
||||
transcripts,
|
||||
expectedRefFiles,
|
||||
});
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error("Fatal error:", err);
|
||||
process.exit(1);
|
||||
});
|
||||
@@ -1,145 +0,0 @@
|
||||
import { spawn } from "node:child_process";
|
||||
import { resolveClaudeBin } from "./preflight.js";
|
||||
import {
|
||||
extractFinalOutput,
|
||||
parseStreamJsonOutput,
|
||||
type TranscriptEvent,
|
||||
} from "./transcript.js";
|
||||
|
||||
export interface AgentRunResult {
|
||||
/** Extracted final text output (backward-compatible). */
|
||||
output: string;
|
||||
duration: number;
|
||||
/** Raw NDJSON transcript string from stream-json. */
|
||||
rawTranscript: string;
|
||||
/** Parsed transcript events. */
|
||||
events: TranscriptEvent[];
|
||||
}
|
||||
|
||||
/**
|
||||
* Invoke Claude Code in print mode as a subprocess.
|
||||
*
|
||||
* Uses --output-format stream-json to capture structured NDJSON events
|
||||
* including tool calls, results, and reasoning steps.
|
||||
*
|
||||
* The agent operates in the workspace directory and can read/write files,
|
||||
* and has access to the local Supabase MCP server so it can apply migrations
|
||||
* and query the real database. --strict-mcp-config ensures only the local
|
||||
* Supabase instance is reachable — no host MCP servers leak in.
|
||||
*
|
||||
* --setting-sources project,local prevents skills from the user's global
|
||||
* ~/.agents/skills/ from leaking into the eval environment.
|
||||
*
|
||||
* When skillEnabled, --agents injects the target skill directly into the
|
||||
* agent's context, guaranteeing it is present (not just discoverable).
|
||||
*/
|
||||
export async function runAgent(opts: {
|
||||
cwd: string;
|
||||
prompt: string;
|
||||
model: string;
|
||||
timeout: number;
|
||||
skillEnabled: boolean;
|
||||
/** Skill name to inject via --agents (e.g. "supabase"). Used when skillEnabled. */
|
||||
skillName?: string;
|
||||
}): Promise<AgentRunResult> {
|
||||
const start = Date.now();
|
||||
|
||||
// Point the agent's MCP config at the shared local Supabase instance.
|
||||
// --strict-mcp-config ensures host .mcp.json is ignored entirely.
|
||||
const supabaseUrl = process.env.SUPABASE_URL ?? "http://127.0.0.1:54321";
|
||||
const mcpConfig = JSON.stringify({
|
||||
mcpServers: {
|
||||
supabase: {
|
||||
type: "http",
|
||||
url: `${supabaseUrl}/mcp`,
|
||||
},
|
||||
},
|
||||
});
|
||||
|
||||
const args = [
|
||||
"-p", // Print mode (non-interactive)
|
||||
"--verbose",
|
||||
"--output-format",
|
||||
"stream-json",
|
||||
"--model",
|
||||
opts.model,
|
||||
"--no-session-persistence",
|
||||
"--dangerously-skip-permissions",
|
||||
"--tools",
|
||||
"Edit,Write,Bash,Read,Glob,Grep",
|
||||
"--mcp-config",
|
||||
mcpConfig,
|
||||
"--strict-mcp-config",
|
||||
// Prevent skills from the user's global ~/.agents/skills/ from leaking
|
||||
// into the eval environment. Only workspace (project) and local sources
|
||||
// are loaded, so the eval sees only what was explicitly installed.
|
||||
"--setting-sources",
|
||||
"project,local",
|
||||
];
|
||||
|
||||
if (opts.skillEnabled && opts.skillName) {
|
||||
// Inject the target skill directly into the agent context via --agents.
|
||||
// This guarantees the skill is embedded in the subagent's context at
|
||||
// startup (not just available as a slash command).
|
||||
const agentsDef = JSON.stringify({
|
||||
main: {
|
||||
description: `Supabase developer agent with ${opts.skillName} skill`,
|
||||
skills: [opts.skillName],
|
||||
},
|
||||
});
|
||||
args.push("--agents", agentsDef);
|
||||
} else if (!opts.skillEnabled) {
|
||||
// Baseline runs: disable all skills so the agent relies on innate knowledge
|
||||
args.push("--disable-slash-commands");
|
||||
}
|
||||
|
||||
const env = { ...process.env };
|
||||
// Remove all Claude-related env vars to avoid nested-session detection
|
||||
for (const key of Object.keys(env)) {
|
||||
if (key === "CLAUDECODE" || key.startsWith("CLAUDE_")) {
|
||||
delete env[key];
|
||||
}
|
||||
}
|
||||
|
||||
const claudeBin = resolveClaudeBin();
|
||||
|
||||
return new Promise<AgentRunResult>((resolve) => {
|
||||
const child = spawn(claudeBin, args, {
|
||||
cwd: opts.cwd,
|
||||
env,
|
||||
stdio: ["pipe", "pipe", "pipe"],
|
||||
});
|
||||
|
||||
// Pipe prompt via stdin and close — this is the standard way to
|
||||
// pass multi-line prompts to `claude -p`.
|
||||
child.stdin.write(opts.prompt);
|
||||
child.stdin.end();
|
||||
|
||||
let stdout = "";
|
||||
let stderr = "";
|
||||
child.stdout.on("data", (d: Buffer) => {
|
||||
stdout += d.toString();
|
||||
});
|
||||
child.stderr.on("data", (d: Buffer) => {
|
||||
stderr += d.toString();
|
||||
});
|
||||
|
||||
const timer = setTimeout(() => {
|
||||
child.kill();
|
||||
}, opts.timeout);
|
||||
|
||||
child.on("close", () => {
|
||||
clearTimeout(timer);
|
||||
const rawTranscript = stdout || stderr;
|
||||
const events = parseStreamJsonOutput(rawTranscript);
|
||||
const output = extractFinalOutput(events) || rawTranscript;
|
||||
|
||||
resolve({
|
||||
output,
|
||||
duration: Date.now() - start,
|
||||
rawTranscript,
|
||||
events,
|
||||
});
|
||||
});
|
||||
});
|
||||
}
|
||||
@@ -1,295 +0,0 @@
|
||||
import assert from "node:assert";
|
||||
import { init, initDataset, initLogger, type Logger } from "braintrust";
|
||||
import type { EvalRunResult } from "../types.js";
|
||||
import type { TranscriptSummary } from "./transcript.js";
|
||||
|
||||
/**
|
||||
* Initialize a Braintrust project logger for real-time per-scenario logging.
|
||||
* Call this once at startup and pass the logger to logScenarioToLogger().
|
||||
*/
|
||||
export function initBraintrustLogger(): Logger<true> {
|
||||
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
|
||||
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
|
||||
return initLogger({
|
||||
projectId: process.env.BRAINTRUST_PROJECT_ID,
|
||||
asyncFlush: true,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Log a single scenario result to the Braintrust project logger in real-time.
|
||||
* This runs alongside the experiment upload, giving immediate visibility in
|
||||
* the project log as each scenario completes.
|
||||
*/
|
||||
export function logScenarioToLogger(
|
||||
logger: Logger<true>,
|
||||
r: EvalRunResult,
|
||||
transcript?: TranscriptSummary,
|
||||
): void {
|
||||
const scores: Record<string, number> = {
|
||||
skill_usage: r.scores?.skillUsage ?? 0,
|
||||
reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
|
||||
assertions_passed: r.scores?.assertionsPassed ?? 0,
|
||||
final_result: r.scores?.finalResult ?? 0,
|
||||
};
|
||||
|
||||
const metadata: Record<string, unknown> = {
|
||||
agent: r.agent,
|
||||
model: r.model,
|
||||
skillEnabled: r.skillEnabled,
|
||||
testsPassed: r.testsPassed,
|
||||
testsTotal: r.testsTotal,
|
||||
toolCallCount: r.toolCallCount ?? 0,
|
||||
contextWindowUsed:
|
||||
(r.totalInputTokens ?? 0) +
|
||||
(r.totalCacheReadTokens ?? 0) +
|
||||
(r.totalCacheCreationTokens ?? 0),
|
||||
totalOutputTokens: r.totalOutputTokens,
|
||||
modelUsage: r.modelUsage,
|
||||
toolErrorCount: r.toolErrorCount,
|
||||
permissionDenialCount: r.permissionDenialCount,
|
||||
loadedSkills: r.loadedSkills,
|
||||
referenceFilesRead: r.referenceFilesRead,
|
||||
...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
|
||||
...(r.error ? { error: r.error } : {}),
|
||||
};
|
||||
|
||||
const spanOptions = r.startedAt
|
||||
? { name: r.scenario, startTime: r.startedAt / 1000 }
|
||||
: { name: r.scenario };
|
||||
|
||||
if (transcript && transcript.toolCalls.length > 0) {
|
||||
logger.traced((span) => {
|
||||
span.log({
|
||||
input: {
|
||||
scenario: r.scenario,
|
||||
prompt: r.prompt ?? "",
|
||||
skillEnabled: r.skillEnabled,
|
||||
},
|
||||
output: {
|
||||
status: r.status,
|
||||
agentOutput: r.agentOutput,
|
||||
filesModified: r.filesModified,
|
||||
assertionResults: r.assertionResults,
|
||||
},
|
||||
expected: { testsTotal: r.testsTotal },
|
||||
scores,
|
||||
metadata,
|
||||
});
|
||||
|
||||
for (const tc of transcript.toolCalls) {
|
||||
span.traced(
|
||||
(childSpan) => {
|
||||
childSpan.log({
|
||||
input: { tool: tc.tool, args: tc.input },
|
||||
output: {
|
||||
preview: tc.outputPreview,
|
||||
isError: tc.isError,
|
||||
...(tc.stderr ? { stderr: tc.stderr } : {}),
|
||||
},
|
||||
metadata: { toolUseId: tc.toolUseId },
|
||||
});
|
||||
},
|
||||
{ name: `tool:${tc.tool}` },
|
||||
);
|
||||
}
|
||||
}, spanOptions);
|
||||
} else {
|
||||
logger.traced((span) => {
|
||||
span.log({
|
||||
input: {
|
||||
scenario: r.scenario,
|
||||
prompt: r.prompt ?? "",
|
||||
skillEnabled: r.skillEnabled,
|
||||
},
|
||||
output: {
|
||||
status: r.status,
|
||||
agentOutput: r.agentOutput,
|
||||
filesModified: r.filesModified,
|
||||
assertionResults: r.assertionResults,
|
||||
},
|
||||
expected: { testsTotal: r.testsTotal },
|
||||
scores,
|
||||
metadata,
|
||||
});
|
||||
}, spanOptions);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Seed a Braintrust dataset with one row per scenario.
|
||||
*
|
||||
* Uses scenario.id as the stable row ID so re-seeding is idempotent.
|
||||
* Each row stores the prompt and expected assertions/reference files,
|
||||
* giving Braintrust a stable baseline to track per-scenario score trends
|
||||
* across experiment runs.
|
||||
*/
|
||||
export async function seedBraintrustDataset(
|
||||
results: EvalRunResult[],
|
||||
expectedRefFiles: Map<string, string[]>,
|
||||
): Promise<void> {
|
||||
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
|
||||
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
|
||||
|
||||
const dataset = initDataset({
|
||||
projectId: process.env.BRAINTRUST_PROJECT_ID,
|
||||
dataset: "supabase-skill-scenarios",
|
||||
});
|
||||
|
||||
for (const r of results) {
|
||||
dataset.insert({
|
||||
id: r.scenario,
|
||||
input: {
|
||||
scenario: r.scenario,
|
||||
prompt: r.prompt ?? "",
|
||||
},
|
||||
expected: {
|
||||
testsTotal: r.testsTotal,
|
||||
passThreshold: r.passThreshold ?? 1.0,
|
||||
expectedReferenceFiles: expectedRefFiles.get(r.scenario) ?? [],
|
||||
},
|
||||
metadata: { scenario: r.scenario },
|
||||
});
|
||||
}
|
||||
|
||||
await dataset.flush();
|
||||
console.log("Braintrust dataset seeded: supabase-skill-scenarios");
|
||||
}
|
||||
|
||||
/**
|
||||
* Upload eval results to Braintrust as an experiment.
|
||||
*
|
||||
* Each EvalRunResult becomes a row in the experiment with:
|
||||
* - input: scenario ID, prompt content, skillEnabled flag
|
||||
* - output: status, agent output, files modified, assertion results
|
||||
* - expected: total tests, pass threshold
|
||||
* - scores: skill_usage, reference_files_usage, assertions_passed, final_result
|
||||
* - metadata: agent, model, skillEnabled, test counts, tool calls, context window, output tokens, model usage, errors, cost
|
||||
* - spans: one child span per agent tool call (when transcript available)
|
||||
* - datasetRecordId: links this row to the dataset row for per-scenario tracking
|
||||
*/
|
||||
export async function uploadToBraintrust(
|
||||
results: EvalRunResult[],
|
||||
opts: {
|
||||
model: string;
|
||||
skillEnabled: boolean;
|
||||
runTimestamp: string;
|
||||
transcripts: Map<string, TranscriptSummary>;
|
||||
expectedRefFiles: Map<string, string[]>;
|
||||
},
|
||||
): Promise<void> {
|
||||
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
|
||||
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
|
||||
|
||||
const variant = opts.skillEnabled ? "skill" : "baseline";
|
||||
const experiment = await init({
|
||||
projectId: process.env.BRAINTRUST_PROJECT_ID,
|
||||
experiment: `${opts.model}-${variant}-${opts.runTimestamp}`,
|
||||
baseExperiment: process.env.BRAINTRUST_BASE_EXPERIMENT ?? undefined,
|
||||
metadata: {
|
||||
model: opts.model,
|
||||
skillEnabled: opts.skillEnabled,
|
||||
runTimestamp: opts.runTimestamp,
|
||||
scenarioCount: results.length,
|
||||
},
|
||||
});
|
||||
|
||||
for (const r of results) {
|
||||
const transcript = opts.transcripts.get(r.scenario);
|
||||
|
||||
const scores: Record<string, number> = {
|
||||
skill_usage: r.scores?.skillUsage ?? 0,
|
||||
reference_files_usage: r.scores?.referenceFilesUsage ?? 0,
|
||||
assertions_passed: r.scores?.assertionsPassed ?? 0,
|
||||
final_result: r.scores?.finalResult ?? 0,
|
||||
};
|
||||
|
||||
const input = {
|
||||
scenario: r.scenario,
|
||||
prompt: r.prompt ?? "",
|
||||
skillEnabled: r.skillEnabled,
|
||||
};
|
||||
|
||||
const output = {
|
||||
status: r.status,
|
||||
agentOutput: r.agentOutput,
|
||||
filesModified: r.filesModified,
|
||||
assertionResults: r.assertionResults,
|
||||
};
|
||||
|
||||
const expected = {
|
||||
testsTotal: r.testsTotal,
|
||||
passThreshold: 1.0,
|
||||
};
|
||||
|
||||
const metadata: Record<string, unknown> = {
|
||||
agent: r.agent,
|
||||
model: r.model,
|
||||
skillEnabled: r.skillEnabled,
|
||||
testsPassed: r.testsPassed,
|
||||
testsTotal: r.testsTotal,
|
||||
toolCallCount: r.toolCallCount ?? 0,
|
||||
contextWindowUsed:
|
||||
(r.totalInputTokens ?? 0) +
|
||||
(r.totalCacheReadTokens ?? 0) +
|
||||
(r.totalCacheCreationTokens ?? 0),
|
||||
totalOutputTokens: r.totalOutputTokens,
|
||||
modelUsage: r.modelUsage,
|
||||
toolErrorCount: r.toolErrorCount,
|
||||
permissionDenialCount: r.permissionDenialCount,
|
||||
loadedSkills: r.loadedSkills,
|
||||
referenceFilesRead: r.referenceFilesRead,
|
||||
...(r.costUsd != null ? { costUsd: r.costUsd } : {}),
|
||||
...(r.error ? { error: r.error } : {}),
|
||||
};
|
||||
|
||||
const spanOptions = r.startedAt
|
||||
? { name: r.scenario, startTime: r.startedAt / 1000 }
|
||||
: { name: r.scenario };
|
||||
|
||||
if (transcript && transcript.toolCalls.length > 0) {
|
||||
experiment.traced((span) => {
|
||||
span.log({
|
||||
input,
|
||||
output,
|
||||
expected,
|
||||
scores,
|
||||
metadata,
|
||||
datasetRecordId: r.scenario,
|
||||
});
|
||||
|
||||
for (const tc of transcript.toolCalls) {
|
||||
span.traced(
|
||||
(childSpan) => {
|
||||
childSpan.log({
|
||||
input: { tool: tc.tool, args: tc.input },
|
||||
output: {
|
||||
preview: tc.outputPreview,
|
||||
isError: tc.isError,
|
||||
...(tc.stderr ? { stderr: tc.stderr } : {}),
|
||||
},
|
||||
metadata: { toolUseId: tc.toolUseId },
|
||||
});
|
||||
},
|
||||
{ name: `tool:${tc.tool}` },
|
||||
);
|
||||
}
|
||||
}, spanOptions);
|
||||
} else {
|
||||
experiment.traced((span) => {
|
||||
span.log({
|
||||
input,
|
||||
output,
|
||||
expected,
|
||||
scores,
|
||||
metadata,
|
||||
datasetRecordId: r.scenario,
|
||||
});
|
||||
}, spanOptions);
|
||||
}
|
||||
}
|
||||
|
||||
const summary = await experiment.summarize();
|
||||
console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
|
||||
await experiment.close();
|
||||
}
|
||||
@@ -1,61 +0,0 @@
|
||||
import { mkdirSync, writeFileSync } from "node:fs";
|
||||
import { dirname, join } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
import type { AssertionResult } from "../eval-types.js";
|
||||
import type { EvalRunResult } from "../types.js";
|
||||
import type { TranscriptSummary } from "./transcript.js";
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = dirname(__filename);
|
||||
|
||||
/** Resolve the base directory for storing results.
|
||||
* Supports EVAL_RESULTS_DIR override for Docker volume mounts. */
|
||||
function resultsBase(): string {
|
||||
if (process.env.EVAL_RESULTS_DIR) {
|
||||
return process.env.EVAL_RESULTS_DIR;
|
||||
}
|
||||
// Default: packages/evals/results (__dirname is packages/evals/src/runner)
|
||||
return join(__dirname, "..", "..", "results");
|
||||
}
|
||||
|
||||
/** Create the results directory for a single scenario run. Returns the path. */
|
||||
export function createResultDir(
|
||||
runTimestamp: string,
|
||||
scenarioId: string,
|
||||
variant: "with-skill" | "baseline",
|
||||
): string {
|
||||
const dir = join(resultsBase(), runTimestamp, scenarioId, variant);
|
||||
mkdirSync(dir, { recursive: true });
|
||||
return dir;
|
||||
}
|
||||
|
||||
/** Save all artifacts for a single eval run. */
|
||||
export function saveRunArtifacts(opts: {
|
||||
resultDir: string;
|
||||
rawTranscript: string;
|
||||
assertionResults: AssertionResult[];
|
||||
result: EvalRunResult;
|
||||
transcriptSummary: TranscriptSummary;
|
||||
}): void {
|
||||
writeFileSync(
|
||||
join(opts.resultDir, "transcript.jsonl"),
|
||||
opts.rawTranscript,
|
||||
"utf-8",
|
||||
);
|
||||
|
||||
writeFileSync(
|
||||
join(opts.resultDir, "assertions.json"),
|
||||
JSON.stringify(opts.assertionResults, null, 2),
|
||||
"utf-8",
|
||||
);
|
||||
|
||||
writeFileSync(
|
||||
join(opts.resultDir, "result.json"),
|
||||
JSON.stringify(
|
||||
{ ...opts.result, transcript: opts.transcriptSummary },
|
||||
null,
|
||||
2,
|
||||
),
|
||||
"utf-8",
|
||||
);
|
||||
}
|
||||
@@ -1,126 +0,0 @@
|
||||
import { execFileSync } from "node:child_process";
|
||||
import { accessSync, constants, existsSync } from "node:fs";
|
||||
import { dirname, join } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
|
||||
/** Detect if we're running inside the eval Docker container. */
|
||||
export function isRunningInDocker(): boolean {
|
||||
if (process.env.IN_DOCKER === "true") return true;
|
||||
try {
|
||||
accessSync("/.dockerenv", constants.F_OK);
|
||||
return true;
|
||||
} catch {
|
||||
return false;
|
||||
}
|
||||
}
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = dirname(__filename);
|
||||
|
||||
/**
|
||||
* Resolve the `claude` binary path.
|
||||
*
|
||||
* Looks in the following order:
|
||||
* 1. Local node_modules/.bin/claude (installed via @anthropic-ai/claude-code)
|
||||
* 2. Global `claude` on PATH
|
||||
*
|
||||
* Throws with an actionable message when neither is found.
|
||||
*/
|
||||
export function resolveClaudeBin(): string {
|
||||
// packages/evals/node_modules/.bin/claude
|
||||
const localBin = join(
|
||||
__dirname,
|
||||
"..",
|
||||
"..",
|
||||
"node_modules",
|
||||
".bin",
|
||||
"claude",
|
||||
);
|
||||
if (existsSync(localBin)) {
|
||||
return localBin;
|
||||
}
|
||||
|
||||
// Fall back to PATH
|
||||
try {
|
||||
execFileSync("claude", ["--version"], {
|
||||
stdio: "ignore",
|
||||
timeout: 10_000,
|
||||
});
|
||||
return "claude";
|
||||
} catch {
|
||||
throw new Error(
|
||||
[
|
||||
"claude CLI not found.",
|
||||
"",
|
||||
"Install it in one of these ways:",
|
||||
" npm install (uses @anthropic-ai/claude-code from package.json)",
|
||||
" npm i -g @anthropic-ai/claude-code",
|
||||
"",
|
||||
"Ensure ANTHROPIC_API_KEY is set in the environment.",
|
||||
].join("\n"),
|
||||
);
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
* Verify the host environment has everything needed before spending
|
||||
* API credits on an eval run.
|
||||
*
|
||||
* Checks: Node >= 20, Docker running, supabase CLI available, claude CLI available, API key set.
|
||||
*/
|
||||
export function preflight(): void {
|
||||
const errors: string[] = [];
|
||||
|
||||
// Node.js >= 20
|
||||
const [major] = process.versions.node.split(".").map(Number);
|
||||
if (major < 20) {
|
||||
errors.push(`Node.js >= 20 required (found ${process.versions.node})`);
|
||||
}
|
||||
|
||||
// Docker daemon must be running — needed by the supabase CLI to manage containers.
|
||||
// Required whether running locally or inside the eval container (socket-mounted).
|
||||
try {
|
||||
execFileSync("docker", ["info"], { stdio: "ignore", timeout: 10_000 });
|
||||
} catch {
|
||||
errors.push(
|
||||
isRunningInDocker()
|
||||
? "Docker daemon not reachable inside container. Mount the socket: -v /var/run/docker.sock:/var/run/docker.sock"
|
||||
: "Docker is not running (required by supabase CLI)",
|
||||
);
|
||||
}
|
||||
|
||||
// Supabase CLI available
|
||||
try {
|
||||
execFileSync("supabase", ["--version"], {
|
||||
stdio: "ignore",
|
||||
timeout: 10_000,
|
||||
});
|
||||
} catch {
|
||||
errors.push(
|
||||
"supabase CLI not found. Install it: https://supabase.com/docs/guides/cli/getting-started",
|
||||
);
|
||||
}
|
||||
|
||||
// Claude CLI available
|
||||
try {
|
||||
resolveClaudeBin();
|
||||
} catch (err) {
|
||||
errors.push((err as Error).message);
|
||||
}
|
||||
|
||||
// API key
|
||||
if (!process.env.ANTHROPIC_API_KEY) {
|
||||
errors.push(
|
||||
"ANTHROPIC_API_KEY is not set. Claude Code requires this for authentication.",
|
||||
);
|
||||
}
|
||||
|
||||
if (errors.length > 0) {
|
||||
console.error("Preflight checks failed:\n");
|
||||
for (const e of errors) {
|
||||
console.error(` - ${e}`);
|
||||
}
|
||||
console.error("");
|
||||
process.exit(1);
|
||||
}
|
||||
}
|
||||
@@ -1,84 +0,0 @@
|
||||
import { readdirSync, statSync } from "node:fs";
|
||||
import { join } from "node:path";
|
||||
import type { EvalRunResult } from "../types.js";
|
||||
|
||||
/**
|
||||
* List files created or modified by the agent in the workspace.
|
||||
* Compares against the original eval directory to find new files.
|
||||
*/
|
||||
export function listModifiedFiles(
|
||||
workspacePath: string,
|
||||
originalEvalDir: string,
|
||||
): string[] {
|
||||
const modified: string[] = [];
|
||||
|
||||
function walk(dir: string, prefix: string) {
|
||||
const entries = readdirSync(dir, { withFileTypes: true });
|
||||
for (const entry of entries) {
|
||||
if (
|
||||
entry.name === "node_modules" ||
|
||||
entry.name === ".agents" ||
|
||||
entry.name === ".claude" ||
|
||||
entry.name === "EVAL.ts" ||
|
||||
entry.name === "EVAL.tsx"
|
||||
)
|
||||
continue;
|
||||
|
||||
const relPath = prefix ? `${prefix}/${entry.name}` : entry.name;
|
||||
const fullPath = join(dir, entry.name);
|
||||
|
||||
if (entry.isDirectory()) {
|
||||
walk(fullPath, relPath);
|
||||
} else {
|
||||
// Check if file is new (not in original eval dir)
|
||||
const originalPath = join(originalEvalDir, relPath);
|
||||
try {
|
||||
statSync(originalPath);
|
||||
} catch {
|
||||
// File doesn't exist in original — it was created by the agent
|
||||
modified.push(relPath);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
walk(workspacePath, "");
|
||||
return modified;
|
||||
}
|
||||
|
||||
/** Print a summary table of eval results. */
|
||||
export function printSummary(
|
||||
results: EvalRunResult[],
|
||||
resultsDir?: string,
|
||||
): void {
|
||||
console.log("\n=== Eval Results ===\n");
|
||||
|
||||
for (const r of results) {
|
||||
const icon = r.status === "passed" ? "PASS" : "FAIL";
|
||||
const skill = r.skillEnabled ? "with-skill" : "baseline";
|
||||
const pct =
|
||||
r.testsTotal > 0
|
||||
? ((r.testsPassed / r.testsTotal) * 100).toFixed(1)
|
||||
: "0.0";
|
||||
const thresholdInfo =
|
||||
r.passThreshold && r.testsTotal > 0
|
||||
? `, threshold: ${((r.passThreshold / r.testsTotal) * 100).toFixed(0)}%`
|
||||
: "";
|
||||
console.log(
|
||||
`[${icon}] ${r.scenario} | ${r.model} | ${skill} | ${(r.duration / 1000).toFixed(1)}s | ${pct}% (${r.testsPassed}/${r.testsTotal}${thresholdInfo})`,
|
||||
);
|
||||
if (r.filesModified.length > 0) {
|
||||
console.log(` Files: ${r.filesModified.join(", ")}`);
|
||||
}
|
||||
if (r.status === "error" && r.error) {
|
||||
console.log(` Error: ${r.error}`);
|
||||
}
|
||||
}
|
||||
|
||||
const passed = results.filter((r) => r.status === "passed").length;
|
||||
console.log(`\nTotal: ${passed}/${results.length} passed`);
|
||||
|
||||
if (resultsDir) {
|
||||
console.log(`\nResults saved to: ${resultsDir}`);
|
||||
}
|
||||
}
|
||||
@@ -1,74 +0,0 @@
|
||||
import {
|
||||
cpSync,
|
||||
existsSync,
|
||||
mkdirSync,
|
||||
mkdtempSync,
|
||||
readdirSync,
|
||||
rmSync,
|
||||
writeFileSync,
|
||||
} from "node:fs";
|
||||
import { tmpdir } from "node:os";
|
||||
import { join } from "node:path";
|
||||
import { EVAL_PROJECT_DIR } from "./supabase-setup.js";
|
||||
|
||||
/**
|
||||
* Create an isolated workspace for an eval run.
|
||||
*
|
||||
* 1. Copy the eval directory to a temp folder (excluding EVAL.ts/EVAL.tsx)
|
||||
* 2. Seed with the eval project's supabase/config.toml
|
||||
*
|
||||
* Skills are injected via the --agents flag in agent.ts (not installed into
|
||||
* the workspace here). Combined with --setting-sources project,local, this
|
||||
* prevents host ~/.agents/skills/ from leaking into the eval environment.
|
||||
*
|
||||
* Returns the path to the workspace and a cleanup function.
|
||||
*/
|
||||
export function createWorkspace(opts: {
|
||||
evalDir: string;
|
||||
skillEnabled: boolean;
|
||||
}): { workspacePath: string; cleanup: () => void } {
|
||||
const workspacePath = mkdtempSync(join(tmpdir(), "supabase-eval-"));
|
||||
|
||||
// Copy eval directory, excluding EVAL.ts/EVAL.tsx (hidden from agent)
|
||||
const entries = readdirSync(opts.evalDir, { withFileTypes: true });
|
||||
for (const entry of entries) {
|
||||
if (entry.name === "EVAL.ts" || entry.name === "EVAL.tsx") continue;
|
||||
const src = join(opts.evalDir, entry.name);
|
||||
const dest = join(workspacePath, entry.name);
|
||||
cpSync(src, dest, { recursive: true });
|
||||
}
|
||||
|
||||
// Add .mcp.json so the agent connects to the local Supabase MCP server
|
||||
writeFileSync(
|
||||
join(workspacePath, ".mcp.json"),
|
||||
JSON.stringify(
|
||||
{
|
||||
mcpServers: {
|
||||
"local-supabase": {
|
||||
type: "http",
|
||||
url: "http://localhost:54321/mcp",
|
||||
},
|
||||
},
|
||||
},
|
||||
null,
|
||||
"\t",
|
||||
),
|
||||
);
|
||||
|
||||
// Seed the workspace with the eval project's supabase/config.toml so the
|
||||
// agent can run `supabase db push` against the shared local instance without
|
||||
// needing to run `supabase init` or `supabase start` first.
|
||||
const projectConfigSrc = join(EVAL_PROJECT_DIR, "supabase", "config.toml");
|
||||
if (existsSync(projectConfigSrc)) {
|
||||
const destSupabaseDir = join(workspacePath, "supabase");
|
||||
mkdirSync(join(destSupabaseDir, "migrations"), { recursive: true });
|
||||
cpSync(projectConfigSrc, join(destSupabaseDir, "config.toml"));
|
||||
}
|
||||
|
||||
return {
|
||||
workspacePath,
|
||||
cleanup: () => {
|
||||
rmSync(workspacePath, { recursive: true, force: true });
|
||||
},
|
||||
};
|
||||
}
|
||||
@@ -1,94 +0,0 @@
|
||||
import type { EvalRunResult } from "../types.js";
|
||||
import type { TranscriptSummary } from "./transcript.js";
|
||||
|
||||
export interface ScoreResult {
|
||||
name: string;
|
||||
/** 0.0 – 1.0 */
|
||||
score: number;
|
||||
metadata?: Record<string, unknown>;
|
||||
}
|
||||
|
||||
/**
|
||||
* skillUsageScorer — 1 if the target skill was in the agent's context, 0 otherwise.
|
||||
*
|
||||
* Detected via the `skills` array in the system init event of the NDJSON transcript.
|
||||
* Combined with `--setting-sources project,local` in agent.ts, this array is clean
|
||||
* (no host skill leakage), so its presence is a reliable signal.
|
||||
*/
|
||||
export function skillUsageScorer(
|
||||
transcript: TranscriptSummary,
|
||||
skillName: string,
|
||||
): ScoreResult {
|
||||
const loaded = transcript.skills.includes(skillName);
|
||||
return {
|
||||
name: "skill_usage",
|
||||
score: loaded ? 1 : 0,
|
||||
metadata: {
|
||||
loadedSkills: transcript.skills,
|
||||
targetSkill: skillName,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* referenceFilesUsageScorer — fraction of expected reference files actually read.
|
||||
*
|
||||
* Detected via Read tool calls whose file_path matches "/.agents/skills/*\/references/".
|
||||
* The expectedReferenceFiles list is declared in each EVAL.ts and should match the
|
||||
* "Skill References Exercised" table in the corresponding scenarios/*.md file.
|
||||
*/
|
||||
export function referenceFilesUsageScorer(
|
||||
transcript: TranscriptSummary,
|
||||
expectedReferenceFiles: string[],
|
||||
): ScoreResult {
|
||||
if (expectedReferenceFiles.length === 0) {
|
||||
return {
|
||||
name: "reference_files_usage",
|
||||
score: 1,
|
||||
metadata: { skipped: true },
|
||||
};
|
||||
}
|
||||
const read = transcript.referenceFilesRead;
|
||||
const hits = expectedReferenceFiles.filter((f) => read.includes(f)).length;
|
||||
return {
|
||||
name: "reference_files_usage",
|
||||
score: hits / expectedReferenceFiles.length,
|
||||
metadata: {
|
||||
expected: expectedReferenceFiles,
|
||||
read,
|
||||
hits,
|
||||
total: expectedReferenceFiles.length,
|
||||
},
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* assertionsPassedScorer — ratio of assertions passed vs total.
|
||||
*/
|
||||
export function assertionsPassedScorer(result: EvalRunResult): ScoreResult {
|
||||
const score =
|
||||
result.testsTotal > 0 ? result.testsPassed / result.testsTotal : 0;
|
||||
return {
|
||||
name: "assertions_passed",
|
||||
score,
|
||||
metadata: { passed: result.testsPassed, total: result.testsTotal },
|
||||
};
|
||||
}
|
||||
|
||||
/**
|
||||
* finalResultScorer — 1 if the agent met the pass threshold, 0 otherwise.
|
||||
*
|
||||
* A result is "passed" when assertionsPassed >= passThreshold (set per scenario
|
||||
* in scenarios/*.md). This is the binary outcome used for Braintrust comparisons.
|
||||
*/
|
||||
export function finalResultScorer(result: EvalRunResult): ScoreResult {
|
||||
return {
|
||||
name: "final_result",
|
||||
score: result.status === "passed" ? 1 : 0,
|
||||
metadata: {
|
||||
testsPassed: result.testsPassed,
|
||||
testsTotal: result.testsTotal,
|
||||
passThreshold: result.passThreshold,
|
||||
},
|
||||
};
|
||||
}
|
||||
@@ -1,108 +0,0 @@
|
||||
import { execFileSync } from "node:child_process";
|
||||
import { dirname, resolve } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
|
||||
const __filename = fileURLToPath(import.meta.url);
|
||||
const __dirname = dirname(__filename);
|
||||
|
||||
/**
|
||||
* Directory that contains the eval Supabase project (supabase/config.toml).
|
||||
* The runner starts the shared Supabase instance from here.
|
||||
* Agent workspaces get a copy of supabase/config.toml so they can
|
||||
* connect to the same running instance via `supabase db push`.
|
||||
*/
|
||||
export const EVAL_PROJECT_DIR = resolve(__dirname, "..", "..", "project");
|
||||
|
||||
export interface SupabaseKeys {
|
||||
apiUrl: string;
|
||||
dbUrl: string;
|
||||
anonKey: string;
|
||||
serviceRoleKey: string;
|
||||
}
|
||||
|
||||
/**
|
||||
* Start the local Supabase stack for the eval project.
|
||||
* Idempotent: if already running, the CLI prints a message and exits 0.
|
||||
*/
|
||||
export function startSupabase(): void {
|
||||
console.log(" Starting Supabase...");
|
||||
execFileSync("supabase", ["start", "--exclude", "studio,imgproxy,mailpit"], {
|
||||
cwd: EVAL_PROJECT_DIR,
|
||||
stdio: "inherit",
|
||||
timeout: 5 * 60 * 1000, // 5 min for first image pull
|
||||
});
|
||||
}
|
||||
|
||||
// SQL that clears all user-created objects and migration history between scenarios.
|
||||
// Avoids `supabase db reset` which restarts containers and triggers flaky health checks.
|
||||
const RESET_SQL = `
|
||||
-- Drop and recreate public schema (removes all user tables/views/functions)
|
||||
DROP SCHEMA public CASCADE;
|
||||
CREATE SCHEMA public;
|
||||
GRANT ALL ON SCHEMA public TO postgres;
|
||||
GRANT ALL ON SCHEMA public TO anon;
|
||||
GRANT ALL ON SCHEMA public TO authenticated;
|
||||
GRANT ALL ON SCHEMA public TO service_role;
|
||||
|
||||
-- Clear migration history so the next agent's db push starts from a clean slate
|
||||
DROP SCHEMA IF EXISTS supabase_migrations CASCADE;
|
||||
|
||||
-- Notify PostgREST to reload its schema cache
|
||||
NOTIFY pgrst, 'reload schema';
|
||||
`.trim();
|
||||
|
||||
/**
|
||||
* Reset the database to a clean state between scenarios.
|
||||
*
|
||||
* Uses direct SQL via psql instead of `supabase db reset` to avoid the
|
||||
* container-restart cycle and its flaky health checks. This drops the
|
||||
* public schema (all user tables) and clears the migration history so
|
||||
* `supabase db push` in agent workspaces always starts fresh.
|
||||
*/
|
||||
export function resetDB(dbUrl: string): void {
|
||||
execFileSync("psql", [dbUrl, "--no-psqlrc", "-c", RESET_SQL], {
|
||||
stdio: "inherit",
|
||||
timeout: 30 * 1000,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Stop all Supabase containers for the eval project.
|
||||
* Called once after all scenarios complete.
|
||||
*/
|
||||
export function stopSupabase(): void {
|
||||
console.log(" Stopping Supabase...");
|
||||
execFileSync("supabase", ["stop", "--no-backup"], {
|
||||
cwd: EVAL_PROJECT_DIR,
|
||||
stdio: "inherit",
|
||||
timeout: 60 * 1000,
|
||||
});
|
||||
}
|
||||
|
||||
/**
|
||||
* Read the running instance's API URL and JWT keys.
|
||||
* Returns values that the runner injects into process.env so EVAL.ts
|
||||
* tests can connect to the real database.
|
||||
*/
|
||||
export function getKeys(): SupabaseKeys {
|
||||
const raw = execFileSync("supabase", ["status", "--output", "json"], {
|
||||
cwd: EVAL_PROJECT_DIR,
|
||||
timeout: 30 * 1000,
|
||||
}).toString();
|
||||
|
||||
const status = JSON.parse(raw) as Record<string, string>;
|
||||
|
||||
const apiUrl = status.API_URL ?? "http://127.0.0.1:54321";
|
||||
const dbUrl =
|
||||
status.DB_URL ?? "postgresql://postgres:postgres@127.0.0.1:54322/postgres";
|
||||
const anonKey = status.ANON_KEY ?? "";
|
||||
const serviceRoleKey = status.SERVICE_ROLE_KEY ?? "";
|
||||
|
||||
if (!anonKey || !serviceRoleKey) {
|
||||
throw new Error(
|
||||
`supabase status returned missing keys. Raw output:\n${raw}`,
|
||||
);
|
||||
}
|
||||
|
||||
return { apiUrl, dbUrl, anonKey, serviceRoleKey };
|
||||
}
|
||||
@@ -1,301 +0,0 @@
|
||||
import { basename } from "node:path";
|
||||
|
||||
export interface TranscriptEvent {
|
||||
type: string;
|
||||
[key: string]: unknown;
|
||||
}
|
||||
|
||||
export interface ToolCallSummary {
|
||||
tool: string;
|
||||
toolUseId: string;
|
||||
input: Record<string, unknown>;
|
||||
/** First ~200 chars of output for quick scanning */
|
||||
outputPreview: string;
|
||||
/** Whether the tool call returned an error */
|
||||
isError: boolean;
|
||||
/** stderr output for Bash tool calls */
|
||||
stderr: string;
|
||||
}
|
||||
|
||||
export interface ModelUsage {
|
||||
inputTokens: number;
|
||||
outputTokens: number;
|
||||
cacheReadInputTokens: number;
|
||||
cacheCreationInputTokens: number;
|
||||
costUSD: number;
|
||||
}
|
||||
|
||||
export interface TranscriptSummary {
|
||||
totalTurns: number;
|
||||
totalDurationMs: number;
|
||||
/** API-only latency (excludes local processing overhead) */
|
||||
totalDurationApiMs: number;
|
||||
totalCostUsd: number | null;
|
||||
model: string | null;
|
||||
toolCalls: ToolCallSummary[];
|
||||
finalOutput: string;
|
||||
/** Skills listed in the system init event (loaded into agent context) */
|
||||
skills: string[];
|
||||
/** Basenames of reference files the agent read via the Read tool */
|
||||
referenceFilesRead: string[];
|
||||
/** Per-model token usage and cost breakdown */
|
||||
modelUsage: Record<string, ModelUsage>;
|
||||
totalInputTokens: number;
|
||||
totalOutputTokens: number;
|
||||
totalCacheReadTokens: number;
|
||||
totalCacheCreationTokens: number;
|
||||
/** Count of tool calls that returned is_error === true */
|
||||
toolErrorCount: number;
|
||||
/** Whether the overall session ended in an error */
|
||||
isError: boolean;
|
||||
/** Count of permission_denials in the result event */
|
||||
permissionDenialCount: number;
|
||||
}
|
||||
|
||||
/** Parse a single NDJSON line. Returns null on empty or invalid input. */
|
||||
export function parseStreamJsonLine(line: string): TranscriptEvent | null {
|
||||
const trimmed = line.trim();
|
||||
if (!trimmed) return null;
|
||||
try {
|
||||
return JSON.parse(trimmed) as TranscriptEvent;
|
||||
} catch {
|
||||
return null;
|
||||
}
|
||||
}
|
||||
|
||||
/** Parse raw NDJSON stdout into an array of events. */
|
||||
export function parseStreamJsonOutput(raw: string): TranscriptEvent[] {
|
||||
const events: TranscriptEvent[] = [];
|
||||
for (const line of raw.split("\n")) {
|
||||
const event = parseStreamJsonLine(line);
|
||||
if (event) events.push(event);
|
||||
}
|
||||
return events;
|
||||
}
|
||||
|
||||
/** Extract the final text output from parsed events (for backward compat). */
|
||||
export function extractFinalOutput(events: TranscriptEvent[]): string {
|
||||
// Prefer the result event
|
||||
for (const event of events) {
|
||||
if (event.type === "result") {
|
||||
const result = (event as Record<string, unknown>).result;
|
||||
if (typeof result === "string") return result;
|
||||
}
|
||||
}
|
||||
|
||||
// Fallback: concatenate text blocks from the last assistant message
|
||||
for (let i = events.length - 1; i >= 0; i--) {
|
||||
const event = events[i];
|
||||
if (event.type === "assistant") {
|
||||
const msg = (event as Record<string, unknown>).message as
|
||||
| Record<string, unknown>
|
||||
| undefined;
|
||||
const content = msg?.content;
|
||||
if (Array.isArray(content)) {
|
||||
const texts = content
|
||||
.filter(
|
||||
(b: Record<string, unknown>) =>
|
||||
b.type === "text" && typeof b.text === "string",
|
||||
)
|
||||
.map((b: Record<string, unknown>) => b.text as string);
|
||||
if (texts.length > 0) return texts.join("\n");
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return "";
|
||||
}
|
||||
|
||||
/** Return true if a file path points to a skill reference file. */
|
||||
function isReferenceFilePath(filePath: string): boolean {
|
||||
return (
|
||||
filePath.includes("/.agents/skills/") && filePath.includes("/references/")
|
||||
);
|
||||
}
|
||||
|
||||
/** Walk parsed events to build a transcript summary. */
|
||||
export function buildTranscriptSummary(
|
||||
events: TranscriptEvent[],
|
||||
): TranscriptSummary {
|
||||
const toolCalls: ToolCallSummary[] = [];
|
||||
let finalOutput = "";
|
||||
let totalDurationMs = 0;
|
||||
let totalDurationApiMs = 0;
|
||||
let totalCostUsd: number | null = null;
|
||||
let model: string | null = null;
|
||||
let totalTurns = 0;
|
||||
let skills: string[] = [];
|
||||
const referenceFilesRead: string[] = [];
|
||||
let modelUsage: Record<string, ModelUsage> = {};
|
||||
let totalInputTokens = 0;
|
||||
let totalOutputTokens = 0;
|
||||
let totalCacheReadTokens = 0;
|
||||
let totalCacheCreationTokens = 0;
|
||||
let toolErrorCount = 0;
|
||||
let isError = false;
|
||||
let permissionDenialCount = 0;
|
||||
|
||||
for (const event of events) {
|
||||
const e = event as Record<string, unknown>;
|
||||
|
||||
// System init: extract model and loaded skills
|
||||
if (e.type === "system" && e.subtype === "init") {
|
||||
model = typeof e.model === "string" ? e.model : null;
|
||||
if (Array.isArray(e.skills)) {
|
||||
skills = e.skills.filter((s): s is string => typeof s === "string");
|
||||
}
|
||||
}
|
||||
|
||||
// Assistant messages: extract tool_use blocks
|
||||
if (e.type === "assistant") {
|
||||
const msg = e.message as Record<string, unknown> | undefined;
|
||||
const content = msg?.content;
|
||||
if (Array.isArray(content)) {
|
||||
for (const block of content) {
|
||||
if (block.type === "tool_use") {
|
||||
const toolCall: ToolCallSummary = {
|
||||
tool: block.name ?? "unknown",
|
||||
toolUseId: block.id ?? "",
|
||||
input: block.input ?? {},
|
||||
outputPreview: "",
|
||||
isError: false,
|
||||
stderr: "",
|
||||
};
|
||||
toolCalls.push(toolCall);
|
||||
|
||||
// Track reference file reads
|
||||
if (
|
||||
block.name === "Read" &&
|
||||
typeof block.input?.file_path === "string" &&
|
||||
isReferenceFilePath(block.input.file_path)
|
||||
) {
|
||||
const base = basename(block.input.file_path);
|
||||
if (!referenceFilesRead.includes(base)) {
|
||||
referenceFilesRead.push(base);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// User messages: extract tool_result blocks and match to tool calls
|
||||
if (e.type === "user") {
|
||||
const msg = e.message as Record<string, unknown> | undefined;
|
||||
const content = msg?.content;
|
||||
if (Array.isArray(content)) {
|
||||
for (const block of content) {
|
||||
if (block.type === "tool_result") {
|
||||
const matching = toolCalls.find(
|
||||
(tc) => tc.toolUseId === block.tool_use_id,
|
||||
);
|
||||
if (matching) {
|
||||
const text =
|
||||
typeof block.content === "string"
|
||||
? block.content
|
||||
: JSON.stringify(block.content);
|
||||
matching.outputPreview = text.slice(0, 200);
|
||||
|
||||
// Capture error state from tool result
|
||||
if (block.is_error === true) {
|
||||
matching.isError = true;
|
||||
toolErrorCount++;
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Capture stderr from tool_use_result (Bash tool emits this at the user event level)
|
||||
const toolUseResult = e.tool_use_result as
|
||||
| Record<string, unknown>
|
||||
| undefined;
|
||||
if (toolUseResult && typeof toolUseResult.stderr === "string") {
|
||||
// Match to the most recent Bash tool call without stderr set
|
||||
const lastBash = [...toolCalls]
|
||||
.reverse()
|
||||
.find((tc) => tc.tool === "Bash" && !tc.stderr);
|
||||
if (lastBash) {
|
||||
lastBash.stderr = toolUseResult.stderr;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Result event: final output, cost, duration, turns, token usage
|
||||
if (e.type === "result") {
|
||||
finalOutput = typeof e.result === "string" ? e.result : "";
|
||||
totalDurationMs = typeof e.duration_ms === "number" ? e.duration_ms : 0;
|
||||
totalDurationApiMs =
|
||||
typeof e.duration_api_ms === "number" ? e.duration_api_ms : 0;
|
||||
totalCostUsd =
|
||||
typeof e.total_cost_usd === "number" ? e.total_cost_usd : null;
|
||||
totalTurns = typeof e.num_turns === "number" ? e.num_turns : 0;
|
||||
isError = e.is_error === true;
|
||||
permissionDenialCount = Array.isArray(e.permission_denials)
|
||||
? e.permission_denials.length
|
||||
: 0;
|
||||
|
||||
// Aggregate token usage from the result event's usage field
|
||||
const usage = e.usage as Record<string, unknown> | undefined;
|
||||
if (usage) {
|
||||
totalInputTokens =
|
||||
typeof usage.input_tokens === "number" ? usage.input_tokens : 0;
|
||||
totalOutputTokens =
|
||||
typeof usage.output_tokens === "number" ? usage.output_tokens : 0;
|
||||
totalCacheReadTokens =
|
||||
typeof usage.cache_read_input_tokens === "number"
|
||||
? usage.cache_read_input_tokens
|
||||
: 0;
|
||||
totalCacheCreationTokens =
|
||||
typeof usage.cache_creation_input_tokens === "number"
|
||||
? usage.cache_creation_input_tokens
|
||||
: 0;
|
||||
}
|
||||
|
||||
// Per-model usage breakdown (modelUsage keyed by model name)
|
||||
const rawModelUsage = e.modelUsage as
|
||||
| Record<string, Record<string, unknown>>
|
||||
| undefined;
|
||||
if (rawModelUsage) {
|
||||
modelUsage = {};
|
||||
for (const [modelName, mu] of Object.entries(rawModelUsage)) {
|
||||
modelUsage[modelName] = {
|
||||
inputTokens:
|
||||
typeof mu.inputTokens === "number" ? mu.inputTokens : 0,
|
||||
outputTokens:
|
||||
typeof mu.outputTokens === "number" ? mu.outputTokens : 0,
|
||||
cacheReadInputTokens:
|
||||
typeof mu.cacheReadInputTokens === "number"
|
||||
? mu.cacheReadInputTokens
|
||||
: 0,
|
||||
cacheCreationInputTokens:
|
||||
typeof mu.cacheCreationInputTokens === "number"
|
||||
? mu.cacheCreationInputTokens
|
||||
: 0,
|
||||
costUSD: typeof mu.costUSD === "number" ? mu.costUSD : 0,
|
||||
};
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return {
|
||||
totalTurns,
|
||||
totalDurationMs,
|
||||
totalDurationApiMs,
|
||||
totalCostUsd,
|
||||
model,
|
||||
toolCalls,
|
||||
finalOutput,
|
||||
skills,
|
||||
referenceFilesRead,
|
||||
modelUsage,
|
||||
totalInputTokens,
|
||||
totalOutputTokens,
|
||||
totalCacheReadTokens,
|
||||
totalCacheCreationTokens,
|
||||
toolErrorCount,
|
||||
isError,
|
||||
permissionDenialCount,
|
||||
};
|
||||
}
|
||||
@@ -1,85 +0,0 @@
|
||||
import type { AssertionResult } from "./eval-types.js";
|
||||
|
||||
export interface EvalScenario {
|
||||
/** Directory name under evals/ */
|
||||
id: string;
|
||||
/** Human-readable name */
|
||||
name: string;
|
||||
/** Tags for filtering */
|
||||
tags: string[];
|
||||
}
|
||||
|
||||
export interface AgentConfig {
|
||||
/** Agent identifier */
|
||||
agent: "claude-code";
|
||||
/** Model to use */
|
||||
model: string;
|
||||
/** Whether the supabase skill is available */
|
||||
skillEnabled: boolean;
|
||||
}
|
||||
|
||||
export interface EvalRunResult {
|
||||
scenario: string;
|
||||
agent: string;
|
||||
model: string;
|
||||
skillEnabled: boolean;
|
||||
status: "passed" | "failed" | "error";
|
||||
duration: number;
|
||||
/** Raw test runner output (for debugging) */
|
||||
testOutput?: string;
|
||||
agentOutput: string;
|
||||
/** Number of assertions that passed */
|
||||
testsPassed: number;
|
||||
/** Total number of assertions */
|
||||
testsTotal: number;
|
||||
/** Minimum tests required to pass (from scenario config) */
|
||||
passThreshold?: number;
|
||||
/** Per-assertion pass/fail results */
|
||||
assertionResults?: AssertionResult[];
|
||||
/** Files the agent created or modified in the workspace */
|
||||
filesModified: string[];
|
||||
error?: string;
|
||||
/** Path to the persisted results directory for this run */
|
||||
resultsDir?: string;
|
||||
/** Number of tool calls the agent made */
|
||||
toolCallCount?: number;
|
||||
/** Total cost in USD (from stream-json result event) */
|
||||
costUsd?: number;
|
||||
/** The PROMPT.md content sent to the agent */
|
||||
prompt?: string;
|
||||
/** Epoch ms when the agent run started (for Braintrust span timing) */
|
||||
startedAt?: number;
|
||||
/** API-only latency in ms (excludes local processing overhead) */
|
||||
durationApiMs?: number;
|
||||
/** Aggregate token counts from the result event */
|
||||
totalInputTokens?: number;
|
||||
totalOutputTokens?: number;
|
||||
totalCacheReadTokens?: number;
|
||||
totalCacheCreationTokens?: number;
|
||||
/** Per-model token usage and cost breakdown */
|
||||
modelUsage?: Record<
|
||||
string,
|
||||
{
|
||||
inputTokens: number;
|
||||
outputTokens: number;
|
||||
cacheReadInputTokens: number;
|
||||
cacheCreationInputTokens: number;
|
||||
costUSD: number;
|
||||
}
|
||||
>;
|
||||
/** Count of tool calls that returned is_error === true */
|
||||
toolErrorCount?: number;
|
||||
/** Count of permission_denials in the result event */
|
||||
permissionDenialCount?: number;
|
||||
/** Skills that were in the agent's context (from system init event) */
|
||||
loadedSkills?: string[];
|
||||
/** Basenames of skill reference files the agent read */
|
||||
referenceFilesRead?: string[];
|
||||
/** Computed scorer results */
|
||||
scores?: {
|
||||
skillUsage: number;
|
||||
referenceFilesUsage: number;
|
||||
assertionsPassed: number;
|
||||
finalResult: number;
|
||||
};
|
||||
}
|
||||
350
packages/evals/src/upload.ts
Normal file
350
packages/evals/src/upload.ts
Normal file
@@ -0,0 +1,350 @@
|
||||
/**
|
||||
* Upload eval results from the results/ directory to Braintrust.
|
||||
*
|
||||
* Reads saved result.json, transcript.json, and outputs/eval.txt from each
|
||||
* run, parses the vitest output to extract pass/fail counts, then uploads to
|
||||
* Braintrust as an experiment.
|
||||
*
|
||||
* Usage:
|
||||
* BRAINTRUST_API_KEY=... BRAINTRUST_PROJECT_ID=... tsx src/upload.ts
|
||||
*
|
||||
* Optional env vars:
|
||||
* RESULTS_DIR Override the results directory (default: results/)
|
||||
* RUN_TIMESTAMP Only upload a specific run (e.g. 2026-02-27T13-01-22.316Z)
|
||||
*/
|
||||
|
||||
import assert from "node:assert";
|
||||
import { existsSync, readdirSync, readFileSync } from "node:fs";
|
||||
import { basename, dirname, join, resolve } from "node:path";
|
||||
import { fileURLToPath } from "node:url";
|
||||
import { init } from "braintrust";
|
||||
|
||||
const __dirname = dirname(fileURLToPath(import.meta.url));
|
||||
const ROOT = resolve(__dirname, "..");
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Types matching the saved result files from @vercel/agent-eval
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
interface RunResult {
|
||||
status: "passed" | "failed" | "error";
|
||||
duration: number;
|
||||
model: string;
|
||||
o11y: {
|
||||
totalTurns: number;
|
||||
totalToolCalls: number;
|
||||
toolCalls: Record<string, number>;
|
||||
filesModified: string[];
|
||||
filesRead: string[];
|
||||
errors: string[];
|
||||
thinkingBlocks: number;
|
||||
};
|
||||
}
|
||||
|
||||
interface TranscriptEvent {
|
||||
type: "tool_call" | "tool_result" | "message" | "thinking";
|
||||
tool?: {
|
||||
name: string;
|
||||
originalName: string;
|
||||
args?: Record<string, unknown>;
|
||||
};
|
||||
}
|
||||
|
||||
interface Transcript {
|
||||
agent: string;
|
||||
model: string;
|
||||
events: TranscriptEvent[];
|
||||
}
|
||||
|
||||
interface ParsedEvalOutput {
|
||||
passed: number;
|
||||
failed: number;
|
||||
total: number;
|
||||
tests: Array<{ name: string; passed: boolean }>;
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Parse vitest eval.txt output
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
function parseEvalOutput(text: string): ParsedEvalOutput {
|
||||
const tests: Array<{ name: string; passed: boolean }> = [];
|
||||
|
||||
for (const line of text.split("\n")) {
|
||||
const passMatch = line.match(/^\s+✓\s+(.+)$/);
|
||||
const failMatch = line.match(/^\s+[✗×]\s+(.+)$/);
|
||||
if (passMatch) tests.push({ name: passMatch[1].trim(), passed: true });
|
||||
else if (failMatch)
|
||||
tests.push({ name: failMatch[1].trim(), passed: false });
|
||||
}
|
||||
|
||||
if (tests.length > 0) {
|
||||
const passed = tests.filter((t) => t.passed).length;
|
||||
return {
|
||||
passed,
|
||||
failed: tests.length - passed,
|
||||
total: tests.length,
|
||||
tests,
|
||||
};
|
||||
}
|
||||
|
||||
// Fallback: parse summary line
|
||||
const summaryMatch = text.match(
|
||||
/Tests\s+(\d+)\s+passed(?:\s*\|\s*(\d+)\s+failed)?\s+\((\d+)\)/,
|
||||
);
|
||||
if (summaryMatch) {
|
||||
const passed = parseInt(summaryMatch[1], 10);
|
||||
const failed = summaryMatch[2] ? parseInt(summaryMatch[2], 10) : 0;
|
||||
const total = parseInt(summaryMatch[3], 10);
|
||||
return { passed, failed, total, tests };
|
||||
}
|
||||
|
||||
return { passed: 0, failed: 0, total: 0, tests };
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Extract reference file reads from transcript
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
function extractReferenceFilesRead(transcript: Transcript): string[] {
|
||||
const read: string[] = [];
|
||||
for (const event of transcript.events) {
|
||||
if (event.type !== "tool_call" || !event.tool?.args) continue;
|
||||
if (event.tool.name !== "file_read") continue;
|
||||
const filePath = String(
|
||||
event.tool.args._extractedPath ?? event.tool.args.file_path ?? "",
|
||||
);
|
||||
if (
|
||||
(filePath.includes("/.claude/skills/") ||
|
||||
filePath.includes("/.agents/skills/")) &&
|
||||
filePath.includes("/references/")
|
||||
) {
|
||||
const base = basename(filePath);
|
||||
if (!read.includes(base)) read.push(base);
|
||||
}
|
||||
}
|
||||
return read;
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Find all experiment run directories
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
interface RunEntry {
|
||||
runTimestamp: string;
|
||||
evalName: string;
|
||||
runIndex: number;
|
||||
runDir: string;
|
||||
result: RunResult;
|
||||
transcript: Transcript;
|
||||
evalOutput: string | null;
|
||||
prompt: string;
|
||||
}
|
||||
|
||||
function findRuns(resultsDir: string, filterTimestamp?: string): RunEntry[] {
|
||||
const entries: RunEntry[] = [];
|
||||
const experimentDir = join(resultsDir, "experiment");
|
||||
if (!existsSync(experimentDir)) return entries;
|
||||
|
||||
const timestamps = readdirSync(experimentDir).filter(
|
||||
(t) => !filterTimestamp || t === filterTimestamp,
|
||||
);
|
||||
|
||||
for (const runTimestamp of timestamps) {
|
||||
const tsDir = join(experimentDir, runTimestamp);
|
||||
const evalNames = readdirSync(tsDir).filter((name) =>
|
||||
readdirSync(join(tsDir, name)).some((f) => f.startsWith("run-")),
|
||||
);
|
||||
|
||||
for (const evalName of evalNames) {
|
||||
const evalDir = join(tsDir, evalName);
|
||||
const promptPath = resolve(ROOT, "evals", evalName, "PROMPT.md");
|
||||
const prompt = existsSync(promptPath)
|
||||
? readFileSync(promptPath, "utf-8").trim()
|
||||
: "";
|
||||
|
||||
const runDirs = readdirSync(evalDir)
|
||||
.filter((d) => /^run-\d+$/.test(d))
|
||||
.sort();
|
||||
|
||||
for (const runDir of runDirs) {
|
||||
const runIndex = parseInt(runDir.replace("run-", ""), 10);
|
||||
const runPath = join(evalDir, runDir);
|
||||
const resultPath = join(runPath, "result.json");
|
||||
const transcriptPath = join(runPath, "transcript.json");
|
||||
const evalOutputPath = join(runPath, "outputs", "eval.txt");
|
||||
|
||||
if (!existsSync(resultPath) || !existsSync(transcriptPath)) continue;
|
||||
|
||||
const result: RunResult = JSON.parse(readFileSync(resultPath, "utf-8"));
|
||||
const transcript: Transcript = JSON.parse(
|
||||
readFileSync(transcriptPath, "utf-8"),
|
||||
);
|
||||
const evalOutput = existsSync(evalOutputPath)
|
||||
? readFileSync(evalOutputPath, "utf-8")
|
||||
: null;
|
||||
|
||||
entries.push({
|
||||
runTimestamp,
|
||||
evalName,
|
||||
runIndex,
|
||||
runDir: runPath,
|
||||
result,
|
||||
transcript,
|
||||
evalOutput,
|
||||
prompt,
|
||||
});
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return entries;
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Main upload flow
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
async function main() {
|
||||
assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
|
||||
assert(process.env.BRAINTRUST_PROJECT_ID, "BRAINTRUST_PROJECT_ID is not set");
|
||||
|
||||
const resultsDir = resolve(ROOT, process.env.RESULTS_DIR ?? "results");
|
||||
const filterTimestamp = process.env.RUN_TIMESTAMP;
|
||||
|
||||
const runs = findRuns(resultsDir, filterTimestamp);
|
||||
if (runs.length === 0) {
|
||||
console.error("No runs found in", resultsDir);
|
||||
process.exit(1);
|
||||
}
|
||||
|
||||
console.log(
|
||||
`Found ${runs.length} run(s) across ${new Set(runs.map((r) => r.runTimestamp)).size} experiment(s)`,
|
||||
);
|
||||
|
||||
const byTimestamp = new Map<string, RunEntry[]>();
|
||||
for (const r of runs) {
|
||||
const group = byTimestamp.get(r.runTimestamp) ?? [];
|
||||
group.push(r);
|
||||
byTimestamp.set(r.runTimestamp, group);
|
||||
}
|
||||
|
||||
for (const [runTimestamp, timestampRuns] of byTimestamp) {
|
||||
const model = timestampRuns[0].result.model;
|
||||
const skillEnabled = process.env.EVAL_BASELINE !== "true";
|
||||
const variant = skillEnabled ? "skill" : "baseline";
|
||||
const experimentName = `${model}-${variant}-${runTimestamp}`;
|
||||
|
||||
console.log(
|
||||
`\nUploading experiment: ${experimentName} (${timestampRuns.length} rows)`,
|
||||
);
|
||||
|
||||
const experiment = init({
|
||||
projectId: process.env.BRAINTRUST_PROJECT_ID as string,
|
||||
experiment: experimentName,
|
||||
metadata: {
|
||||
model,
|
||||
runTimestamp,
|
||||
skillEnabled,
|
||||
evalCount: timestampRuns.length,
|
||||
},
|
||||
});
|
||||
|
||||
for (const run of timestampRuns) {
|
||||
const evalParsed = run.evalOutput
|
||||
? parseEvalOutput(run.evalOutput)
|
||||
: { passed: 0, failed: 0, total: 0, tests: [] };
|
||||
|
||||
console.log(
|
||||
` [${run.evalName}] run-${run.runIndex} — tests: ${evalParsed.passed}/${evalParsed.total} passed`,
|
||||
);
|
||||
|
||||
// Reference files scorer
|
||||
const metaPath = resolve(ROOT, "evals", run.evalName, "meta.ts");
|
||||
const metaMod = existsSync(metaPath)
|
||||
? ((await import(metaPath)) as {
|
||||
expectedReferenceFiles?: string[];
|
||||
})
|
||||
: {};
|
||||
const expectedRefs = metaMod.expectedReferenceFiles ?? [];
|
||||
const refsRead = extractReferenceFilesRead(run.transcript);
|
||||
const refHits = expectedRefs.filter((f) => refsRead.includes(f)).length;
|
||||
const referenceFilesUsage =
|
||||
expectedRefs.length > 0 ? refHits / expectedRefs.length : 1;
|
||||
|
||||
console.log(
|
||||
` reference files: ${refHits}/${expectedRefs.length} read (${refsRead.join(", ") || "none"})`,
|
||||
);
|
||||
|
||||
const scores: Record<string, number> = {
|
||||
assertions_passed:
|
||||
evalParsed.total > 0 ? evalParsed.passed / evalParsed.total : 0,
|
||||
reference_files_usage: referenceFilesUsage,
|
||||
final_result: run.result.status === "passed" ? 1 : 0,
|
||||
};
|
||||
|
||||
const metadata: Record<string, unknown> = {
|
||||
model: run.result.model,
|
||||
evalName: run.evalName,
|
||||
runIndex: run.runIndex,
|
||||
totalTurns: run.result.o11y.totalTurns,
|
||||
totalToolCalls: run.result.o11y.totalToolCalls,
|
||||
toolCalls: run.result.o11y.toolCalls,
|
||||
filesModified: run.result.o11y.filesModified,
|
||||
errors: run.result.o11y.errors,
|
||||
thinkingBlocks: run.result.o11y.thinkingBlocks,
|
||||
duration: run.result.duration,
|
||||
referenceFilesRead: refsRead,
|
||||
expectedReferenceFiles: expectedRefs,
|
||||
};
|
||||
|
||||
experiment.traced(
|
||||
(span) => {
|
||||
span.log({
|
||||
input: { eval: run.evalName, prompt: run.prompt },
|
||||
output: {
|
||||
status: run.result.status,
|
||||
filesModified: run.result.o11y.filesModified,
|
||||
tests: evalParsed.tests,
|
||||
evalOutput: run.evalOutput,
|
||||
},
|
||||
expected: {
|
||||
testsTotal: evalParsed.total,
|
||||
expectedReferenceFiles: expectedRefs,
|
||||
},
|
||||
scores,
|
||||
metadata,
|
||||
datasetRecordId: run.evalName,
|
||||
});
|
||||
|
||||
// Child spans for each tool call in the transcript
|
||||
for (const event of run.transcript.events) {
|
||||
if (event.type !== "tool_call" || !event.tool) continue;
|
||||
span.traced(
|
||||
(child) => {
|
||||
child.log({
|
||||
input: {
|
||||
tool: event.tool?.name,
|
||||
args: event.tool?.args ?? {},
|
||||
},
|
||||
output: {},
|
||||
metadata: { originalName: event.tool?.originalName },
|
||||
});
|
||||
},
|
||||
{ name: `tool:${event.tool.name}` },
|
||||
);
|
||||
}
|
||||
},
|
||||
{ name: `${run.evalName}/run-${run.runIndex}` },
|
||||
);
|
||||
}
|
||||
|
||||
const summary = await experiment.summarize();
|
||||
console.log(`\nBraintrust experiment: ${summary.experimentUrl}`);
|
||||
}
|
||||
}
|
||||
|
||||
main().catch((err) => {
|
||||
console.error(err);
|
||||
process.exit(1);
|
||||
});
|
||||
@@ -1,16 +1,11 @@
|
||||
{
|
||||
"compilerOptions": {
|
||||
"target": "ES2022",
|
||||
"module": "ESNext",
|
||||
"moduleResolution": "bundler",
|
||||
"esModuleInterop": true,
|
||||
"module": "NodeNext",
|
||||
"moduleResolution": "NodeNext",
|
||||
"strict": true,
|
||||
"skipLibCheck": true,
|
||||
"outDir": "dist",
|
||||
"rootDir": "src",
|
||||
"declaration": true,
|
||||
"resolveJsonModule": true
|
||||
"noEmit": true
|
||||
},
|
||||
"include": ["src/**/*"],
|
||||
"exclude": ["node_modules", "dist", "evals"]
|
||||
"include": ["experiments", "src", "evals"]
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user