multi model testing

This commit is contained in:
Pedro Rodrigues
2026-02-18 13:28:42 +00:00
parent 27d7af255d
commit 082eac2a01
8 changed files with 315 additions and 107 deletions

View File

@@ -9,19 +9,26 @@ using skill documentation as context. It uses
Two-step **LLM-as-judge** pattern powered by Braintrust's `Eval()`:
1. The **eval model** (default: `claude-sonnet-4-5-20250929`) receives a prompt
with skill context and produces a code fix.
2. Three independent **judge scorers** (default: `claude-opus-4-6`) evaluate the
fix via structured output (Zod schemas via AI SDK's `Output.object()`).
1. The **eval model** receives a prompt with skill context and produces a code
fix. All eval model calls go through the **Braintrust AI proxy** — a single
OpenAI-compatible endpoint that routes to any provider (Anthropic, OpenAI,
Google, etc.).
2. Five independent **judge scorers** (`claude-opus-4-6` via direct Anthropic
API) evaluate the fix via structured output (Zod schemas via AI SDK's
`Output.object()`).
The eval runs once per model in the model matrix, creating a separate Braintrust
experiment per model for side-by-side comparison.
Key files:
```
src/
code-fix.eval.ts # Braintrust Eval() entry point
code-fix.eval.ts # Braintrust Eval() entry point (loops over models)
dataset.ts # Maps extracted test cases to EvalCase format
scorer.ts # Three AI SDK-based scorers (Correctness, Completeness, Best Practice)
models.ts # Model provider factory (Anthropic / OpenAI)
scorer.ts # Five AI SDK-based scorers (quality, safety, minimality)
models.ts # Braintrust proxy + direct Anthropic provider
models.config.ts # Model matrix (add/remove models here)
dataset/
types.ts # CodeFixTestCase interface
extract.ts # Auto-extracts test cases from skill references
@@ -35,14 +42,18 @@ src/
(`dataset/extract.ts`) finds consecutive `**Incorrect:**` / `**Correct:**` code
block pairs under `##` sections. Each pair becomes one test case.
Three independent scorers evaluate each fix (01 scale):
Five independent scorers evaluate each fix (01 scale):
- **Correctness** — does the fix address the core issue?
- **Completeness** — does the fix include all necessary changes?
- **Best Practice** — does the fix follow Supabase conventions?
- **Regression Safety** — does the fix avoid introducing new problems (broken
functionality, removed security measures, new vulnerabilities)?
- **Minimality** — is the fix tightly scoped to the identified issue without
unnecessary rewrites or over-engineering?
Braintrust aggregates the scores and provides a dashboard for tracking
regressions over time.
Each model in the matrix generates a separate Braintrust experiment. The
dashboard supports side-by-side comparison of experiments.
## Adding Test Cases
@@ -81,6 +92,67 @@ Rules:
- Files prefixed with `_` (like `_sections.md`, `_template.md`) are skipped
- Each pair gets an ID like `supabase/db-rls-mandatory#0` (skill/filename#index)
## Adding/Removing Models
Edit the `EVAL_MODELS` array in `src/models.config.ts`:
```typescript
export const EVAL_MODELS: EvalModelConfig[] = [
{ id: "claude-sonnet-4-5-20250929", label: "Claude Sonnet 4.5", provider: "anthropic", ci: true },
{ id: "gpt-5.3", label: "GPT 5.3", provider: "openai", ci: true },
// Add new models here
];
```
Provider API keys must be configured in the Braintrust dashboard under
Settings → AI providers.
## Running Evals
```bash
# Run all models locally (no Braintrust upload)
mise run eval
# Run a single model
mise run eval:model model=claude-sonnet-4-5-20250929
# Run and upload to Braintrust dashboard
mise run eval:upload
```
Or directly:
```bash
cd packages/evals
# Local run (all models)
npx braintrust eval --no-send-logs src/code-fix.eval.ts
# Single model
EVAL_MODEL=claude-sonnet-4-5-20250929 npx braintrust eval --no-send-logs src/code-fix.eval.ts
# Filter to one test case (across all models)
npx braintrust eval --no-send-logs src/code-fix.eval.ts --filter 'input.testCase.id=db-migrations-idempotent'
```
## Environment
API keys are loaded by mise from `packages/evals/.env` (configured in root
`mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
```
BRAINTRUST_API_KEY=... # Required: proxy routing + dashboard upload
BRAINTRUST_PROJECT_ID=... # Required: Braintrust project identifier
ANTHROPIC_API_KEY=sk-ant-... # Required: judge model (Claude Opus 4.6)
```
Optional overrides:
```
EVAL_MODEL=claude-sonnet-4-5-20250929 # Run only this model (skips matrix)
EVAL_JUDGE_MODEL=claude-opus-4-6 # Judge model for scorers
```
## Modifying Prompts
- `src/prompts/code-fix.ts` — what the eval model sees
@@ -97,46 +169,3 @@ Each scorer in `src/scorer.ts` is independent. To add a new dimension:
1. Create a new `EvalScorer` function in `scorer.ts`
2. Add it to the `scores` array in `code-fix.eval.ts`
## Running Evals
```bash
# Run locally (no Braintrust upload)
mise run eval
# Run and upload to Braintrust dashboard
mise run eval:upload
```
Or directly:
```bash
cd packages/evals
# Local run
npx braintrust eval --no-send-logs src/code-fix.eval.ts
# Upload to Braintrust
npx braintrust eval src/code-fix.eval.ts
```
In CI, evals run via `braintrustdata/eval-action@v1` and are gated by the
`run-evals` PR label.
## Environment
API keys are loaded by mise from `packages/evals/.env` (configured in root
`mise.toml`). Copy `.env.example` to `.env` and fill in the keys.
```
ANTHROPIC_API_KEY=sk-ant-... # Required: eval model + judge model
BRAINTRUST_API_KEY=... # Required for upload to Braintrust dashboard
BRAINTRUST_PROJECT_ID=... # Required for upload to Braintrust dashboard
```
Optional overrides:
```
EVAL_MODEL=claude-sonnet-4-5-20250929 # Model under test
EVAL_JUDGE_MODEL=claude-opus-4-6 # Judge model for scorers
```