multi model testing

2026-03-27 10:09:26 +08:00 · 2026-02-18 13:28:42 +00:00
parent 27d7af255d
commit 082eac2a01
8 changed files with 315 additions and 107 deletions
--- a/packages/evals/AGENTS.md
+++ b/packages/evals/AGENTS.md
@@ -9,19 +9,26 @@ using skill documentation as context. It uses

 Two-step **LLM-as-judge** pattern powered by Braintrust's `Eval()`:

-1. The **eval model** (default: `claude-sonnet-4-5-20250929`) receives a prompt
-   with skill context and produces a code fix.
-2. Three independent **judge scorers** (default: `claude-opus-4-6`) evaluate the
-   fix via structured output (Zod schemas via AI SDK's `Output.object()`).
+1. The **eval model** receives a prompt with skill context and produces a code
+   fix. All eval model calls go through the **Braintrust AI proxy** — a single
+   OpenAI-compatible endpoint that routes to any provider (Anthropic, OpenAI,
+   Google, etc.).
+2. Five independent **judge scorers** (`claude-opus-4-6` via direct Anthropic
+   API) evaluate the fix via structured output (Zod schemas via AI SDK's
+   `Output.object()`).
+
+The eval runs once per model in the model matrix, creating a separate Braintrust
+experiment per model for side-by-side comparison.

 Key files:

 ```
 src/
-  code-fix.eval.ts        # Braintrust Eval() entry point
+  code-fix.eval.ts        # Braintrust Eval() entry point (loops over models)
  dataset.ts              # Maps extracted test cases to EvalCase format
-  scorer.ts               # Three AI SDK-based scorers (Correctness, Completeness, Best Practice)
-  models.ts               # Model provider factory (Anthropic / OpenAI)
+  scorer.ts               # Five AI SDK-based scorers (quality, safety, minimality)
+  models.ts               # Braintrust proxy + direct Anthropic provider
+  models.config.ts        # Model matrix (add/remove models here)
  dataset/
    types.ts              # CodeFixTestCase interface
    extract.ts            # Auto-extracts test cases from skill references
@@ -35,14 +42,18 @@ src/
 (`dataset/extract.ts`) finds consecutive `**Incorrect:**` / `**Correct:**` code
 block pairs under `##` sections. Each pair becomes one test case.

-Three independent scorers evaluate each fix (0–1 scale):
+Five independent scorers evaluate each fix (0–1 scale):

 - **Correctness** — does the fix address the core issue?
 - **Completeness** — does the fix include all necessary changes?
 - **Best Practice** — does the fix follow Supabase conventions?
+- **Regression Safety** — does the fix avoid introducing new problems (broken
+  functionality, removed security measures, new vulnerabilities)?
+- **Minimality** — is the fix tightly scoped to the identified issue without
+  unnecessary rewrites or over-engineering?

-Braintrust aggregates the scores and provides a dashboard for tracking
-regressions over time.
+Each model in the matrix generates a separate Braintrust experiment. The
+dashboard supports side-by-side comparison of experiments.

 ## Adding Test Cases

@@ -81,6 +92,67 @@ Rules:
 - Files prefixed with `_` (like `_sections.md`, `_template.md`) are skipped
 - Each pair gets an ID like `supabase/db-rls-mandatory#0` (skill/filename#index)

+## Adding/Removing Models
+
+Edit the `EVAL_MODELS` array in `src/models.config.ts`:
+
+```typescript
+export const EVAL_MODELS: EvalModelConfig[] = [
+  { id: "claude-sonnet-4-5-20250929", label: "Claude Sonnet 4.5", provider: "anthropic", ci: true },
+  { id: "gpt-5.3", label: "GPT 5.3", provider: "openai", ci: true },
+  // Add new models here
+];
+```
+
+Provider API keys must be configured in the Braintrust dashboard under
+Settings → AI providers.
+
+## Running Evals
+
+```bash
+# Run all models locally (no Braintrust upload)
+mise run eval
+
+# Run a single model
+mise run eval:model model=claude-sonnet-4-5-20250929
+
+# Run and upload to Braintrust dashboard
+mise run eval:upload
+```
+
+Or directly:
+
+```bash
+cd packages/evals
+
+# Local run (all models)
+npx braintrust eval --no-send-logs src/code-fix.eval.ts
+
+# Single model
+EVAL_MODEL=claude-sonnet-4-5-20250929 npx braintrust eval --no-send-logs src/code-fix.eval.ts
+
+# Filter to one test case (across all models)
+npx braintrust eval --no-send-logs src/code-fix.eval.ts --filter 'input.testCase.id=db-migrations-idempotent'
+```
+
+## Environment
+
+API keys are loaded by mise from `packages/evals/.env` (configured in root
+`mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
+
+```
+BRAINTRUST_API_KEY=...         # Required: proxy routing + dashboard upload
+BRAINTRUST_PROJECT_ID=...      # Required: Braintrust project identifier
+ANTHROPIC_API_KEY=sk-ant-...   # Required: judge model (Claude Opus 4.6)
+```
+
+Optional overrides:
+
+```
+EVAL_MODEL=claude-sonnet-4-5-20250929  # Run only this model (skips matrix)
+EVAL_JUDGE_MODEL=claude-opus-4-6       # Judge model for scorers
+```
+
 ## Modifying Prompts

 - `src/prompts/code-fix.ts` — what the eval model sees
@@ -97,46 +169,3 @@ Each scorer in `src/scorer.ts` is independent. To add a new dimension:

 1. Create a new `EvalScorer` function in `scorer.ts`
 2. Add it to the `scores` array in `code-fix.eval.ts`
-
-## Running Evals
-
-```bash
-# Run locally (no Braintrust upload)
-mise run eval
-
-# Run and upload to Braintrust dashboard
-mise run eval:upload
-```
-
-Or directly:
-
-```bash
-cd packages/evals
-
-# Local run
-npx braintrust eval --no-send-logs src/code-fix.eval.ts
-
-# Upload to Braintrust
-npx braintrust eval src/code-fix.eval.ts
-```
-
-In CI, evals run via `braintrustdata/eval-action@v1` and are gated by the
-`run-evals` PR label.
-
-## Environment
-
-API keys are loaded by mise from `packages/evals/.env` (configured in root
-`mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
-
-```
-ANTHROPIC_API_KEY=sk-ant-...    # Required: eval model + judge model
-BRAINTRUST_API_KEY=...          # Required for upload to Braintrust dashboard
-BRAINTRUST_PROJECT_ID=...       # Required for upload to Braintrust dashboard
-```
-
-Optional overrides:
-
-```
-EVAL_MODEL=claude-sonnet-4-5-20250929    # Model under test
-EVAL_JUDGE_MODEL=claude-opus-4-6         # Judge model for scorers
-```