Skip to content

clerk/clerk-evals

Repository files navigation

clerk-evals

Evaluation suites for testing how LLMs perform at writing Clerk code. 27 evals across 8 categories (Quickstarts, Auth, User Management, UI Components, Organizations, Webhooks, Upgrades, Billing) covering Next.js, React, iOS, and Android. 16 models from OpenAI, Anthropic, Google, and Vercel.

diagram

Quickstart

Install Bun >=1.3.0, then gather the required API keys. See .env.example

cp .env.example .env
bun i
bun start

Add a new evaluation

For detailed, copy-pastable steps see docs/ADDING_EVALS.md. In short:

  • Create src/evals/your-eval/ with PROMPT.md and graders.ts.
  • Implement graders that return booleans using defineGraders(...) and shared judges in @/src/graders/catalog.
  • Append an entry to the evaluations array in src/config/evaluations.ts with framework, category, and path (e.g., evals/waitlist).
  • Run bun start --eval "your-eval" --smoke --debug to test with one model.
Example scores
[
  {
    "model": "claude-sonnet-4-5",
    "framework": "Next.js",
    "category": "Auth",
    "value": 0.8333333333333334,
    "updatedAt": "2026-01-06T17:51:27.901Z"
  },
  {
    "model": "gpt-5-chat-latest",
    "framework": "Next.js",
    "category": "Auth",
    "value": 0.6666666666666666,
    "updatedAt": "2026-01-06T17:51:30.871Z"
  },
  {
    "model": "claude-opus-4-5",
    "framework": "Next.js",
    "category": "Billing",
    "value": 1.0,
    "updatedAt": "2026-01-06T17:51:56.370Z"
  }
]

Debugging

# Run a single evaluation with debug output
bun start --eval "auth/routes" --debug

# Smoke test (one model, one eval)
bun start --eval "auth/routes" --smoke --debug

CLI Usage

bun start [options]
Flag Description
--mcp Enable MCP tools (uses mcp.clerk.dev by default)
--skills Enable skills tools (loads from ../skills/skills/)
--model "claude-sonnet-4-0" Filter by exact model name (case-insensitive)
--provider "anthropic" Filter by provider (openai, anthropic, google, vercel)
--eval "protect" Filter evals by category or path
--debug Save outputs to debug-runs/
--dry Print task summary without running
--smoke Run only the first task (quick validation)
--fail-under 70 CI gate: fail if average score < threshold %
# Baseline (no tools)
bun start --model "claude-sonnet-4-0" --eval "protect"

# With MCP tools
bun start --mcp --model "claude-sonnet-4-0" --eval "protect"

# With skills
bun start --skills --model "claude-sonnet-4-5"

# Local MCP server
MCP_SERVER_URL_OVERRIDE=http://localhost:8787/mcp bun start --mcp

# Dry run (see what would execute)
bun start --dry

Batch Runner

Run all 16 models sequentially with timeout and retry:

./run-evals.sh                              # All models, baseline + MCP
./run-evals.sh --models "gpt-5,claude-sonnet-4-5"  # Specific models
./run-evals.sh --baseline-only              # Skip MCP
./run-evals.sh --mcp-only                   # Skip baseline
./run-evals.sh --list                       # List available models

Braintrust Integration

Set BRAINTRUST_API_KEY to enable experiment logging and tracing:

# Single run: creates experiment automatically
BRAINTRUST_API_KEY=sk-... bun start --mcp

# Batch run: defers reporting, consolidates into one experiment per mode
BRAINTRUST_API_KEY=sk-... ./run-evals.sh

# Manual consolidated report from recent results
BRAINTRUST_API_KEY=sk-... bun report:braintrust --since "2026-03-19T17:00:00Z"

The eval runner uses wrapAISDK to auto-trace all generateText calls (inputs, outputs, tool invocations, token usage). Traces flow to Braintrust even during batch runs.

Agent Evals

Run evaluations using AI coding agents (Claude Code, Codex) instead of direct LLM calls.

Prerequisites

Agent evals spawn CLI tools as child processes. Install them globally before running:

Both API keys must be set in your .env.

Usage

bun start:agent --agent claude-code [options]
Flag Description
--agent, -a Agent type (required): claude-code, cursor
--mcp Enable MCP tools
--eval, -e Filter evals by path
--debug, -d Save outputs to debug-runs/
--timeout, -t Timeout per eval (ms)

Shortcuts:

bun agent:claude        # claude-code baseline
bun agent:claude:mcp    # claude-code with MCP

Examples:

# Run all evals with Claude Code
bun start:agent --agent claude-code

# Run specific eval with debug output
bun start:agent -a claude-code -e auth/protect -d

# Run with MCP tools enabled
bun start:agent --agent claude-code --mcp

Output Files

Runner Output Description
bun start scores.json Baseline scores (no tools)
bun start --mcp scores-mcp.json MCP scores (with tools)
bun start --skills scores-skills.json Skills scores
bun start:agent agent-scores.json Agent evaluation scores
bun merge-scores llm-scores.json Combined for llm-leaderboard
bun report:braintrust Braintrust UI Consolidated experiment per mode

Workflow for llm-leaderboard

bun start              # 1. Baseline -> scores.json
bun start --mcp        # 2. MCP -> scores-mcp.json
bun merge-scores       # 3. Merge -> llm-scores.json

The merge script combines both score files and calculates improvement metrics:

{
  "model": "claude-sonnet-4-5",
  "label": "Claude Sonnet 4.5",
  "framework": "Next.js",
  "category": "Auth",
  "value": 0.83,
  "provider": "anthropic",
  "mcpScore": 0.95,
  "improvement": 0.12
}

Overview

This project is broken up into a few core pieces:

  • src/index.ts: This is the main entrypoint of the project. Models, reporters, and the runner are registered here, and all executed. Evaluations are defined in src/config/evaluations.ts.
  • /evals: Folders that contain a prompt and grading expectations. Runners currently assume that eval folders contain two files: graders.ts and PROMPT.md.
  • /runners: The primary logic responsible for loading evaluations, calling provider llms, and outputting scores.
  • /reporters: The primary logic responsible for sending scores somewhere — stdout, a file, etc.

Running

A runner takes a simple object as an argument:

{
  "provider": "openai",
  "model": "gpt-5",
  "evalPath": "/absolute/path/to/clerk-evals/src/evals/auth/protect"
}

It will resolve the provider and model to the respective SDK.

It will load the designated evaluation, generate LLM text from the prompt, and pass the result to graders.

Evaluations

At the moment, evaluations are simply folders that contain:

  • PROMPT.md: the instruction for which we're evaluating the model's output on
  • graders.ts: a module containing grader functions which return true/false signalling if the model's output passed or failed. This is essentially our acceptance criteria.

Graders

Shared grader primitives live in src/graders/index.ts. Use them to declare new checks with a consistent, terse shape:

import { contains, defineGraders, judge } from '@/src/graders'
import { llmChecks } from '@/src/graders/catalog'

export const graders = defineGraders({
  references_middleware: contains('middleware.ts'),
  package_json: llmChecks.packageJsonClerkVersion,
  custom_flow_description: judge(
    'Does the answer walk through protecting a Next.js API route with Clerk auth() and explain the response states?',
  ),
})
  • contains / containsAny: case-insensitive substring checks by default
  • matches: regex checks
  • judge: thin wrappers around the LLM-as-judge scorer. Shared prompts live in src/graders/catalog.ts; add new reusable prompts there.
  • defineGraders: preserves type inference for the exported graders record.

Score

For a given model, and evaluation, we'll retrieve a score from 0..1, which is the percentage of grader functions that passed.

Reporting

Three reporters:

  • console: color-coded ASCII table (category x model matrix)
  • file: saves scores to scores.json / scores-mcp.json / scores-skills.json
  • braintrust: logs experiments to Braintrust (opt-in via BRAINTRUST_API_KEY)

For batch runs, src/report-braintrust.ts consolidates all per-model results from SQLite into a single experiment per mode.

Interfaces

For the notable interfaces, see /interfaces.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages