multi model testing

2026-03-27 10:09:26 +08:00 · 2026-02-18 13:28:42 +00:00
parent 27d7af255d
commit 082eac2a01
8 changed files with 315 additions and 107 deletions
--- a/mise.toml
+++ b/mise.toml
@@ -46,11 +46,15 @@ sources = ["test/**", "skills/**"]
 # ── Eval tasks ────────────────────────────────────────────────────────
 [tasks.eval]
-description = "Run code-fix evals (local, no upload)"
+description = "Run code-fix evals for all configured models (local, no upload)"
 run = "npm --prefix packages/evals run eval"
 sources = ["packages/evals/src/**", "skills/**/references/**"]
 [tasks."eval:model"]
 description = "Run code-fix eval for a single model (local, no upload)"
 run = "EVAL_MODEL={{arg(name='model')}} npm --prefix packages/evals run eval"
 [tasks."eval:upload"]
-description = "Run code-fix evals and upload to Braintrust"
+description = "Run code-fix evals for all models and upload to Braintrust"
 run = "npm --prefix packages/evals run eval:upload"
 sources = ["packages/evals/src/**", "skills/**/references/**"]
--- a/packages/evals/.env.example
+++ b/packages/evals/.env.example
@@ -1,3 +1,5 @@
 ANTHROPIC_API_KEY=
 BRAINTRUST_API_KEY=
 BRAINTRUST_PROJECT_ID=
 ANTHROPIC_API_KEY=
 # Provider API keys for eval models are configured in the Braintrust dashboard
 # under Settings → AI providers (not needed locally).
--- a/packages/evals/AGENTS.md
+++ b/packages/evals/AGENTS.md
@@ -9,19 +9,26 @@ using skill documentation as context. It uses
 Two-step **LLM-as-judge** pattern powered by Braintrust's `Eval()`:
-1. The **eval model** (default: `claude-sonnet-4-5-20250929`) receives a prompt
+1. The **eval model** receives a prompt with skill context and produces a code
-   with skill context and produces a code fix.
+   fix. All eval model calls go through the **Braintrust AI proxy** — a single
-2. Three independent **judge scorers** (default: `claude-opus-4-6`) evaluate the
+   OpenAI-compatible endpoint that routes to any provider (Anthropic, OpenAI,
-   fix via structured output (Zod schemas via AI SDK's `Output.object()`).
+   Google, etc.).
 2. Five independent **judge scorers** (`claude-opus-4-6` via direct Anthropic
   API) evaluate the fix via structured output (Zod schemas via AI SDK's
   `Output.object()`).
 The eval runs once per model in the model matrix, creating a separate Braintrust
 experiment per model for side-by-side comparison.
 Key files:
 ```
 src/
-  code-fix.eval.ts        # Braintrust Eval() entry point
+  code-fix.eval.ts        # Braintrust Eval() entry point (loops over models)
  dataset.ts              # Maps extracted test cases to EvalCase format
-  scorer.ts               # Three AI SDK-based scorers (Correctness, Completeness, Best Practice)
+  scorer.ts               # Five AI SDK-based scorers (quality, safety, minimality)
-  models.ts               # Model provider factory (Anthropic / OpenAI)
+  models.ts               # Braintrust proxy + direct Anthropic provider
  models.config.ts        # Model matrix (add/remove models here)
  dataset/
    types.ts              # CodeFixTestCase interface
    extract.ts            # Auto-extracts test cases from skill references
@@ -35,14 +42,18 @@ src/
 (`dataset/extract.ts`) finds consecutive `**Incorrect:**` / `**Correct:**` code
 block pairs under `##` sections. Each pair becomes one test case.
-Three independent scorers evaluate each fix (0–1 scale):
+Five independent scorers evaluate each fix (0–1 scale):
 - **Correctness** — does the fix address the core issue?
 - **Completeness** — does the fix include all necessary changes?
 - **Best Practice** — does the fix follow Supabase conventions?
 - **Regression Safety** — does the fix avoid introducing new problems (broken
  functionality, removed security measures, new vulnerabilities)?
 - **Minimality** — is the fix tightly scoped to the identified issue without
  unnecessary rewrites or over-engineering?
-Braintrust aggregates the scores and provides a dashboard for tracking
+Each model in the matrix generates a separate Braintrust experiment. The
-regressions over time.
+dashboard supports side-by-side comparison of experiments.
 ## Adding Test Cases
@@ -81,6 +92,67 @@ Rules:
 - Files prefixed with `_` (like `_sections.md`, `_template.md`) are skipped
 - Each pair gets an ID like `supabase/db-rls-mandatory#0` (skill/filename#index)
 ## Adding/Removing Models
 Edit the `EVAL_MODELS` array in `src/models.config.ts`:
 ```typescript
 export const EVAL_MODELS: EvalModelConfig[] = [
  { id: "claude-sonnet-4-5-20250929", label: "Claude Sonnet 4.5", provider: "anthropic", ci: true },
  { id: "gpt-5.3", label: "GPT 5.3", provider: "openai", ci: true },
  // Add new models here
 ];
 ```
 Provider API keys must be configured in the Braintrust dashboard under
 Settings → AI providers.
 ## Running Evals
 ```bash
 # Run all models locally (no Braintrust upload)
 mise run eval
 # Run a single model
 mise run eval:model model=claude-sonnet-4-5-20250929
 # Run and upload to Braintrust dashboard
 mise run eval:upload
 ```
 Or directly:
 ```bash
 cd packages/evals
 # Local run (all models)
 npx braintrust eval --no-send-logs src/code-fix.eval.ts
 # Single model
 EVAL_MODEL=claude-sonnet-4-5-20250929 npx braintrust eval --no-send-logs src/code-fix.eval.ts
 # Filter to one test case (across all models)
 npx braintrust eval --no-send-logs src/code-fix.eval.ts --filter 'input.testCase.id=db-migrations-idempotent'
 ```
 ## Environment
 API keys are loaded by mise from `packages/evals/.env` (configured in root
 `mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
 ```
 BRAINTRUST_API_KEY=...         # Required: proxy routing + dashboard upload
 BRAINTRUST_PROJECT_ID=...      # Required: Braintrust project identifier
 ANTHROPIC_API_KEY=sk-ant-...   # Required: judge model (Claude Opus 4.6)
 ```
 Optional overrides:
 ```
 EVAL_MODEL=claude-sonnet-4-5-20250929  # Run only this model (skips matrix)
 EVAL_JUDGE_MODEL=claude-opus-4-6       # Judge model for scorers
 ```
 ## Modifying Prompts
 - `src/prompts/code-fix.ts` — what the eval model sees
@@ -97,46 +169,3 @@ Each scorer in `src/scorer.ts` is independent. To add a new dimension:
 1. Create a new `EvalScorer` function in `scorer.ts`
 2. Add it to the `scores` array in `code-fix.eval.ts`
 ## Running Evals
 ```bash
 # Run locally (no Braintrust upload)
 mise run eval
 # Run and upload to Braintrust dashboard
 mise run eval:upload
 ```
 Or directly:
 ```bash
 cd packages/evals
 # Local run
 npx braintrust eval --no-send-logs src/code-fix.eval.ts
 # Upload to Braintrust
 npx braintrust eval src/code-fix.eval.ts
 ```
 In CI, evals run via `braintrustdata/eval-action@v1` and are gated by the
 `run-evals` PR label.
 ## Environment
 API keys are loaded by mise from `packages/evals/.env` (configured in root
 `mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
 ```
 ANTHROPIC_API_KEY=sk-ant-...    # Required: eval model + judge model
 BRAINTRUST_API_KEY=...          # Required for upload to Braintrust dashboard
 BRAINTRUST_PROJECT_ID=...       # Required for upload to Braintrust dashboard
 ```
 Optional overrides:
 ```
 EVAL_MODEL=claude-sonnet-4-5-20250929    # Model under test
 EVAL_JUDGE_MODEL=claude-opus-4-6         # Judge model for scorers
 ```
--- a/packages/evals/src/code-fix.eval.ts
+++ b/packages/evals/src/code-fix.eval.ts
@@ -2,7 +2,9 @@ import assert from "node:assert";
 import { generateText } from "ai";
 import { Eval } from "braintrust";
 import { dataset } from "./dataset.js";
-import { getModel } from "./models.js";
+import type { EvalModelConfig } from "./models.config.js";
 import { EVAL_MODELS } from "./models.config.js";
 import { getProxyModel } from "./models.js";
 import {
 	buildCodeFixPrompt,
 	buildCodeFixSystemPrompt,
@@ -11,26 +13,70 @@ import {
 	bestPracticeScorer,
 	completenessScorer,
 	correctnessScorer,
 	minimalityScorer,
 	regressionSafetyScorer,
 } from "./scorer.js";
 assert(process.env.BRAINTRUST_API_KEY, "BRAINTRUST_API_KEY is not set");
 assert(process.env.ANTHROPIC_API_KEY, "ANTHROPIC_API_KEY is not set");
-const modelId = process.env.EVAL_MODEL || "claude-sonnet-4-5-20250929";
+/**
 * Resolve which models to run based on environment:
 * - EVAL_MODEL set → single model only (local dev / cost control)
 * - CI without EVAL_ALL_MODELS → ci:true models only
 * - Otherwise → all models
 */
 function getModelsToRun(): EvalModelConfig[] {
 	const singleModel = process.env.EVAL_MODEL;
 	if (singleModel) {
 		const found = EVAL_MODELS.find((m) => m.id === singleModel);
 		return [
 			found ?? {
 				id: singleModel,
 				label: singleModel,
 				provider: "unknown",
 				ci: false,
 			},
 		];
 	}
-Eval("CodeFix", {
+	if (process.env.CI && !process.env.EVAL_ALL_MODELS) {
-	projectId: process.env.BRAINTRUST_PROJECT_ID,
+		return EVAL_MODELS.filter((m) => m.ci);
-	trialCount: process.env.CI ? 3 : 1,
+	}
-	data: () => dataset(),
+
-	task: async (input) => {
+	return EVAL_MODELS;
-		const model = getModel(modelId);
+}
-		const response = await generateText({
+
-			model,
+const models = getModelsToRun();
-			system: buildCodeFixSystemPrompt(),
+
-			prompt: buildCodeFixPrompt(input.testCase),
+for (const modelConfig of models) {
-			temperature: 0.2,
+	Eval("CodeFix", {
-			maxRetries: 2,
+		experimentName: modelConfig.id,
-		});
+		projectId: process.env.BRAINTRUST_PROJECT_ID,
-		return { llmOutput: response.text };
+		trialCount: process.env.CI ? 3 : 1,
-	},
+		metadata: {
-	scores: [correctnessScorer, completenessScorer, bestPracticeScorer],
+			model: modelConfig.id,
-});
+			label: modelConfig.label,
 			provider: modelConfig.provider,
 		},
 		data: () => dataset(),
 		task: async (input) => {
 			const model = getProxyModel(modelConfig.id);
 			const response = await generateText({
 				model,
 				system: buildCodeFixSystemPrompt(),
 				prompt: buildCodeFixPrompt(input.testCase),
 				temperature: 0.2,
 				maxRetries: 2,
 			});
 			return { llmOutput: response.text };
 		},
 		scores: [
 			correctnessScorer,
 			completenessScorer,
 			bestPracticeScorer,
 			regressionSafetyScorer,
 			minimalityScorer,
 		],
 	});
 }
--- a/packages/evals/src/dataset.ts
+++ b/packages/evals/src/dataset.ts
@@ -10,23 +10,41 @@ export type Expected = {
 };
 export type Metadata = {
 	name: string;
 	skillName: string;
 	section: string;
 	referenceFile: string;
 	tags: string[];
 };
 export type Output = { llmOutput: string };
 /**
 * Extract the feature category from a reference filename.
 * e.g. "db-migrations-idempotent.md" → "db"
 *      "auth-core-sessions.md" → "auth"
 */
 function featureCategory(filename: string): string {
 	return filename.replace(/\.md$/, "").split("-")[0];
 }
 export function dataset(): EvalCase<Input, Expected, Metadata>[] {
 	return extractCodeFixDataset().map((tc) => ({
 		id: tc.id,
 		input: { testCase: tc },
 		tags: [
 			featureCategory(tc.referenceFilename),
 			tc.referenceFilename.replace(/\.md$/, ""),
 		],
 		expected: {
 			correctCode: tc.goodExample.code,
 			correctLanguage: tc.goodExample.language,
 		},
 		metadata: {
 			name: tc.title,
 			skillName: tc.skillName,
 			section: tc.section,
 			referenceFile: tc.referenceFilename,
 			tags: tc.tags,
 		},
 	}));
--- a/packages/evals/src/models.config.ts
+++ b/packages/evals/src/models.config.ts
@@ -0,0 +1,47 @@
 export interface EvalModelConfig {
 	/** Model ID passed to the Braintrust proxy */
 	id: string;
 	/** Human-readable label for dashboards */
 	label: string;
 	/** Provider name for display/grouping */
 	provider: string;
 	/** Whether to include in CI runs by default */
 	ci: boolean;
 }
 /**
 * Models to evaluate. Add/remove entries to change the eval matrix.
 * Set `ci: false` to exclude expensive models from automated CI runs.
 */
 export const EVAL_MODELS: EvalModelConfig[] = [
 	{
 		id: "claude-sonnet-4-5-20250929",
 		label: "Claude Sonnet 4.5",
 		provider: "anthropic",
 		ci: true,
 	},
 	{
 		id: "gpt-5.3",
 		label: "GPT 5.3",
 		provider: "openai",
 		ci: true,
 	},
 	{
 		id: "gpt-5.2",
 		label: "GPT 5.2",
 		provider: "openai",
 		ci: true,
 	},
 	{
 		id: "gemini-3-pro",
 		label: "Gemini 3.0 Pro",
 		provider: "google",
 		ci: true,
 	},
 	{
 		id: "claude-opus-4-6",
 		label: "Claude Opus 4.6",
 		provider: "anthropic",
 		ci: false,
 	},
 ];
--- a/packages/evals/src/models.ts
+++ b/packages/evals/src/models.ts
@@ -1,48 +1,49 @@
 import type { AnthropicProvider } from "@ai-sdk/anthropic";
 import { anthropic } from "@ai-sdk/anthropic";
-import type { OpenAIProvider } from "@ai-sdk/openai";
+import { createOpenAI } from "@ai-sdk/openai";
 import { openai } from "@ai-sdk/openai";
 import type { LanguageModel } from "ai";
 /** Model ID accepted by the Anthropic provider (string literal union + string). */
 export type AnthropicModelId = Parameters<AnthropicProvider["chat"]>[0];
-/** Model ID accepted by the OpenAI provider (string literal union + string). */
+/**
-export type OpenAIModelId = Parameters<OpenAIProvider["chat"]>[0];
+ * Braintrust AI proxy — routes to any provider (Anthropic, OpenAI, Google)
 * via a single OpenAI-compatible endpoint.
 *
 * Provider API keys are configured in the Braintrust dashboard at
 * project or org level. The x-bt-parent header scopes the request to
 * the project so project-level keys are resolved.
 */
 const braintrustProxy = createOpenAI({
 	baseURL: "https://api.braintrust.dev/v1/proxy",
 	apiKey: process.env.BRAINTRUST_API_KEY ?? "",
 	headers: process.env.BRAINTRUST_PROJECT_ID
 		? { "x-bt-parent": `project_id:${process.env.BRAINTRUST_PROJECT_ID}` }
 		: undefined,
 });
-/** Any model ID accepted by the eval harness. */
+/**
-export type SupportedModelId = AnthropicModelId | OpenAIModelId;
+ * Get a model for the eval task. Claude models use the Anthropic SDK
-
+ * directly (via ANTHROPIC_API_KEY). All other models route through the
-const MODEL_MAP: Record<string, () => LanguageModel> = {
+ * Braintrust proxy (keys configured at the org level in Braintrust).
-	"claude-opus-4-6": () => anthropic("claude-opus-4-6"),
+ */
-	"claude-sonnet-4-5-20250929": () => anthropic("claude-sonnet-4-5-20250929"),
+export function getProxyModel(modelId: string): LanguageModel {
 	"claude-haiku-4-5-20251001": () => anthropic("claude-haiku-4-5-20251001"),
 	"claude-opus-4-5-20251101": () => anthropic("claude-opus-4-5-20251101"),
 	"claude-sonnet-4-20250514": () => anthropic("claude-sonnet-4-20250514"),
 	"gpt-4o": () => openai("gpt-4o"),
 	"gpt-4o-mini": () => openai("gpt-4o-mini"),
 	"o3-mini": () => openai("o3-mini"),
 };
 export function getModel(modelId: SupportedModelId): LanguageModel {
 	const factory = MODEL_MAP[modelId];
 	if (factory) return factory();
 	// Fall back to provider detection from model ID prefix
 	if (modelId.startsWith("claude")) {
 		return anthropic(modelId as AnthropicModelId);
 	}
-	if (
+	return braintrustProxy(modelId);
-		modelId.startsWith("gpt") ||
+}
-		modelId.startsWith("o1") ||
+
-		modelId.startsWith("o3")
+/**
-	) {
+ * Get a model using direct provider SDKs. Used for the judge model which
-		return openai(modelId as OpenAIModelId);
+ * is always Claude and uses ANTHROPIC_API_KEY directly (no proxy).
 */
 export function getModel(modelId: string): LanguageModel {
 	if (modelId.startsWith("claude")) {
 		return anthropic(modelId as AnthropicModelId);
 	}
-	throw new Error(
+	return getProxyModel(modelId);
 		`Unknown model: ${modelId}. Available: ${Object.keys(MODEL_MAP).join(", ")}`,
 	);
 }
 export function getJudgeModel(): LanguageModel {
--- a/packages/evals/src/scorer.ts
+++ b/packages/evals/src/scorer.ts
@@ -124,3 +124,64 @@ Score 1 if the fix follows best practices, 0.5 if it mostly follows best practic
 		metadata: { reasoning: result.reasoning },
 	};
 };
 export const regressionSafetyScorer: EvalScorer<
 	Input,
 	TaskOutput,
 	Expected
 > = async ({ input, output }) => {
 	const context = buildContext(input.testCase, output.llmOutput);
 	const result = await judge(`${context}
 ## Task
 Evaluate **regression safety**: Does the LLM's fix avoid introducing new problems?
 Carefully check whether the fix:
 - Breaks existing functionality that was working in the original code
 - Removes security measures (RLS policies, auth checks, input validation) that were already present
 - Changes function signatures, return types, or column names in ways that would break callers
 - Introduces SQL injection, XSS, or other security vulnerabilities not present in the original
 - Drops data, removes columns, or alters schemas destructively without necessity
 - Changes behavior beyond the scope of the identified issue
 The fix should repair the identified problem WITHOUT creating new ones. A fix that solves the original issue but breaks something else is dangerous in production.
 Score 1 if the fix introduces no new problems. Score 0.5 if the fix introduces minor issues (e.g., slightly different naming that could confuse but not break). Score 0 if the fix introduces a new bug, security vulnerability, or breaking change.`);
 	return {
 		name: "Regression Safety",
 		score: result.score,
 		metadata: { reasoning: result.reasoning },
 	};
 };
 export const minimalityScorer: EvalScorer<
 	Input,
 	TaskOutput,
 	Expected
 > = async ({ input, output }) => {
 	const context = buildContext(input.testCase, output.llmOutput);
 	const result = await judge(`${context}
 ## Task
 Evaluate **minimality**: Does the LLM's fix make only the changes necessary to address the identified issue?
 Check whether the fix:
 - Rewrites or restructures code beyond what is needed to fix the problem
 - Adds features, abstractions, or utilities not present in the reference solution
 - Changes formatting, variable names, or style in unrelated parts of the code
 - Adds excessive comments, logging, or error handling not required by the fix
 - Over-engineers the solution (e.g., adding configuration options, generalization, or layers of abstraction when a simple targeted fix suffices)
 Compare the scope of changes in the LLM's fix against the reference. The reference represents the ideal minimal fix. The LLM's fix should be similarly focused.
 Score 1 if the fix is tightly scoped to the identified issue (similar scope to the reference). Score 0.5 if the fix includes some unnecessary changes but the core fix is present. Score 0 if the fix significantly over-reaches — rewriting large portions of code, adding unrelated features, or restructuring beyond what is needed.`);
 	return {
 		name: "Minimality",
 		score: result.score,
 		metadata: { reasoning: result.reasoning },
 	};
 };