Evaluateur Walkthrough¶
This notebook provides a comprehensive walkthrough of evaluateur's capabilities for generating synthetic evaluation queries for LLM applications.
What is Evaluateur?¶
Evaluateur follows the dimensions → tuples → queries workflow (from Hamel Husain's evaluation FAQ):
- Dimensions: Define the axes of variation for your queries using Pydantic models
- Options: Generate diverse values for each dimension
- Tuples: Create combinations of options
- Queries: Convert tuples into natural language queries
This approach helps you systematically generate diverse, representative test queries for your LLM application.
Setup¶
Installation¶
pip install evaluateur
# or
uv add evaluateur
Environment Configuration¶
Evaluateur requires an LLM API key. Set the key for your chosen provider:
export OPENAI_API_KEY=sk-your-key-here
# or
export ANTHROPIC_API_KEY=sk-ant-...
Or create a .env file in your project root with:
OPENAI_API_KEY=sk-your-key-here
You can also override the default model (openai/gpt-4.1-mini):
export EVALUATEUR_MODEL=anthropic/claude-3-5-sonnet-latest
# Load environment variables from .env file
from dotenv import load_dotenv
load_dotenv()
Basic Workflow¶
Let's start with the simplest usage pattern: define a dimension model and generate queries.
from pydantic import BaseModel, Field
from evaluateur import Evaluator
class Query(BaseModel):
"""Dimensions for educational content queries."""
topic: str = Field(..., description="the subject area")
difficulty: str = Field(..., description="complexity level")
# Create evaluator with your dimension model
evaluator = Evaluator(Query)
# Generate queries using the complete pipeline
async for q in evaluator.run(
instructions="Generate diverse educational topics",
tuple_count=5,
):
print(f"Query: {q.query}")
print(f" From: {q.source_tuple.model_dump()}")
print()
Step-by-Step Control¶
For more control, you can call each method separately. This lets you inspect and customize the output at each stage.
from pydantic import BaseModel, Field
from evaluateur import Evaluator, TupleStrategy
class CustomerQuery(BaseModel):
"""Dimensions for customer support queries."""
product: str = Field(..., description="product category")
issue_type: str = Field(..., description="type of customer issue")
sentiment: str = Field(..., description="customer emotional state")
evaluator = Evaluator(CustomerQuery)
# Step 1: Generate options for each dimension
options = await evaluator.options(
instructions="Focus on e-commerce scenarios",
count_per_field=4,
)
print("Generated options:")
for field, values in options.model_dump().items():
print(f" {field}: {values}")
# Step 2: Generate tuples (combinations of options)
tuples = []
async for t in evaluator.tuples(
options,
strategy=TupleStrategy.CROSS_PRODUCT,
count=8,
seed=42,
):
tuples.append(t)
print(f"Tuple: {t.model_dump()}")
print(f"\nGenerated {len(tuples)} tuples")
# Step 3: Convert tuples to natural language queries
print("Generated queries:\n")
async for q in evaluator.queries(
tuples=tuples,
instructions="Write as if you're a frustrated customer",
):
print(f"Query: {q.query}")
print(f" From: {q.source_tuple.model_dump()}")
print()
Fixed vs Generated Options¶
You can mix fixed options (using list[str]) with dynamically generated ones (using str).
- Fixed options: Define as
list[str]with explicit values - these won't be modified - Generated options: Define as
strwith a description - the LLM generates diverse values
from pydantic import BaseModel, Field
from evaluateur import Evaluator
class SupportTicket(BaseModel):
# Fixed options - these values are preserved exactly
priority: list[str] = ["low", "medium", "high", "critical"]
channel: list[str] = ["email", "chat", "phone"]
# Dynamic options - generated by the LLM
product_area: str = Field(..., description="part of the product")
issue_category: str = Field(..., description="type of technical issue")
evaluator = Evaluator(SupportTicket)
# Only generates options for product_area and issue_category
options = await evaluator.options(count_per_field=4)
print("Priority (fixed):", options.priority)
print("Channel (fixed):", options.channel)
print("Product area (generated):", options.product_area)
print("Issue category (generated):", options.issue_category)
Tuple Generation Strategies¶
Evaluateur supports two strategies for generating tuples:
- CROSS_PRODUCT (default): Samples from the Cartesian product of all options
- AI: Uses an LLM to generate coherent, realistic combinations
from pydantic import BaseModel, Field
from evaluateur import Evaluator, TupleStrategy
class Query(BaseModel):
domain: str = Field(..., description="knowledge domain")
audience: str = Field(..., description="target audience")
evaluator = Evaluator(Query)
options = await evaluator.options(count_per_field=5)
print("Options:")
print(f" Domains: {options.domain}")
print(f" Audiences: {options.audience}")
print()
# Cross-product strategy: random sampling from all combinations
print("CROSS_PRODUCT strategy:")
async for t in evaluator.tuples(
options,
strategy=TupleStrategy.CROSS_PRODUCT,
count=5,
seed=42,
):
print(f" {t.model_dump()}")
# AI strategy: LLM picks coherent combinations
print("AI strategy:")
async for t in evaluator.tuples(
options,
strategy=TupleStrategy.AI,
count=5,
):
print(f" {t.model_dump()}")
When to use each:
- CROSS_PRODUCT: Good for exhaustive coverage and reproducibility. Efficient for large option spaces.
- AI: Better for semantically coherent combinations where some pairs make more sense together.
Reproducibility with Seeds¶
Use the seed parameter for reproducible tuple sampling. The same seed produces the same tuples.
from pydantic import BaseModel, Field
from evaluateur import Evaluator, TupleStrategy
class Query(BaseModel):
category: list[str] = ["tech", "health", "finance", "education", "entertainment"]
tone: list[str] = ["formal", "casual", "technical", "friendly"]
evaluator = Evaluator(Query)
options = await evaluator.options()
# Run 1: seed=42
print("Seed 42 (run 1):")
async for t in evaluator.tuples(options, count=3, seed=42):
print(f" {t.model_dump()}")
# Run 2: same seed = same tuples
print("\nSeed 42 (run 2) - identical:")
async for t in evaluator.tuples(options, count=3, seed=42):
print(f" {t.model_dump()}")
# Run 3: different seed = different tuples
print("\nSeed 123 - different:")
async for t in evaluator.tuples(options, count=3, seed=123):
print(f" {t.model_dump()}")
Goal-Guided Optimization¶
Evaluateur supports goal-guided query generation. Goals are flat, flexible, and optionally categorized using the CTO framework:
- Components: What system parts should be tested? (e.g., freshness checks, citation accuracy)
- Trajectories: What user journeys should be covered? (e.g., conflict handling, multi-step workflows)
- Outcomes: What output qualities matter? (e.g., checklist-ready, actionable recommendations)
Categories are optional -- you can use any string or skip them entirely.
Structured Goals with GoalSpec¶
from pydantic import BaseModel, Field
from evaluateur import Evaluator, Goal, GoalSpec
class PriorAuthQuery(BaseModel):
"""Dimensions for prior authorization queries."""
payer: str = Field(..., description="Insurance payer")
procedure_type: str = Field(..., description="Type of medical procedure")
patient_context: str = Field(..., description="Patient situation")
# Define structured goals
goals = GoalSpec(goals=[
Goal(
name="freshness checks",
text="Test that responses use current policy information with effective dates",
category="components",
),
Goal(
name="citation accuracy",
text="Ensure responses cite specific policy sections and references",
category="components",
),
Goal(
name="conflict handling",
text="Test behavior when payer policy conflicts with clinical guidelines",
category="trajectories",
),
Goal(
name="checklist-ready",
text="Produce responses that list requirements and documents needed",
category="outcomes",
),
])
evaluator = Evaluator(PriorAuthQuery)
print("Goal-guided queries:\n")
async for q in evaluator.run(
goals=goals,
tuple_count=6,
seed=42,
):
print(f"[{q.metadata.goal_focus}] {q.query}")
print()
Free-Form Goals¶
For quick iteration, you can provide goals as plain text. Evaluateur parses structured lists (numbered/bulleted) directly without an LLM call. CTO section headers are auto-detected.
from pydantic import BaseModel, Field
from evaluateur import Evaluator
class Query(BaseModel):
topic: str = Field(..., description="subject area")
complexity: str = Field(..., description="question difficulty")
evaluator = Evaluator(Query)
# Free-form text goals with CTO headers (parsed without LLM)
goals = """
Components:
- Prioritize freshness checks and citation accuracy
Trajectories:
- Include conflict handling when sources disagree
Outcomes:
- Produce checklist-ready outputs that are easy to verify
"""
print("Free-form goal-guided queries:\n")
async for q in evaluator.run(
goals=goals,
tuple_count=4,
):
print(f"[{q.metadata.goal_focus}] {q.query}")
print()
Goal Weights¶
Control sampling probability with weights. Higher weight = more likely to be selected.
from collections import Counter
from pydantic import BaseModel, Field
from evaluateur import Evaluator, Goal, GoalSpec
class Query(BaseModel):
topic: str = Field(..., description="subject")
# Weighted goals: freshness is 3x more likely than others
weighted_goals = GoalSpec(goals=[
Goal(name="freshness", text="Test data currency", category="components", weight=3.0),
Goal(name="conflicts", text="Test conflict handling", category="trajectories", weight=1.0),
Goal(name="checklists", text="Request structured output", category="outcomes", weight=1.0),
])
evaluator = Evaluator(Query)
# Count goal focus across many queries
focus_counts: Counter[str] = Counter()
async for q in evaluator.run(
goals=weighted_goals,
tuple_count=30,
seed=42,
):
focus_counts[q.metadata.goal_focus or "none"] += 1
print("Goal focus distribution:")
for goal, count in sorted(focus_counts.items()):
print(f" {goal}: {count} ({count/30*100:.0f}%)")
Goal Modes¶
Evaluateur supports three goal modes:
sample(default): Each query focuses on one goal (weighted random)cycle: Rotates through goals consecutively (even coverage)full: All goals are included in every query prompt
from pydantic import BaseModel, Field
from evaluateur import Evaluator, Goal, GoalSpec
class Query(BaseModel):
topic: str = Field(..., description="subject area")
goals = GoalSpec(goals=[
Goal(name="accuracy", text="Test factual accuracy", category="components"),
Goal(name="error recovery", text="Test error handling", category="trajectories"),
Goal(name="actionable", text="Request clear next steps", category="outcomes"),
])
evaluator = Evaluator(Query)
# Sample mode: one goal per query
print("SAMPLE mode (one goal per query):")
async for q in evaluator.run(
goals=goals,
goal_mode="sample",
tuple_count=3,
):
print(f" Focus: {q.metadata.goal_focus} (category: {q.metadata.goal_category})")
# Full mode: all goals in every query
print("FULL mode (all goals in every query):")
async for q in evaluator.run(
goals=goals,
goal_mode="full",
tuple_count=3,
):
print(f" Focus: {q.metadata.goal_focus} (all goals applied)")
print(f" Query: {q.query[:80]}...")
print()
Inspecting Query Metadata¶
Each generated query includes rich metadata for traceability.
from pydantic import BaseModel, Field
from evaluateur import Evaluator, Goal, GoalSpec
class Query(BaseModel):
domain: str = Field(..., description="knowledge domain")
difficulty: str = Field(..., description="question difficulty")
goals = GoalSpec(goals=[
Goal(name="accuracy", text="Test factual accuracy", category="components"),
Goal(name="clarity", text="Request clear explanations", category="outcomes"),
])
evaluator = Evaluator(Query)
async for q in evaluator.run(
goals=goals,
tuple_count=2,
seed=42,
):
print("Query:", q.query)
print("Source tuple:", q.source_tuple.model_dump())
print("Metadata:")
print(f" - goal_guided: {q.metadata.goal_guided}")
print(f" - goal_mode: {q.metadata.goal_mode}")
print(f" - goal_focus: {q.metadata.goal_focus}")
print(f" - goal_category: {q.metadata.goal_category}")
print()
Collecting and Serializing Results¶
Store generated queries for later analysis or use in your evaluation pipeline.
import json
from pydantic import BaseModel, Field
from evaluateur import Evaluator
class Query(BaseModel):
topic: str = Field(..., description="subject area")
style: str = Field(..., description="writing style")
evaluator = Evaluator(Query)
# Collect all results
results = []
async for q in evaluator.run(tuple_count=5, seed=42):
results.append(
{
"query": q.query,
"tuple": q.source_tuple.model_dump(),
"metadata": q.metadata.model_dump(),
}
)
# Pretty print the results
print(json.dumps(results, indent=2))
Provider Configuration¶
Evaluateur uses Instructor under the hood, supporting any provider Instructor supports.
Using Different Models¶
from pydantic import BaseModel, Field
from evaluateur import Evaluator
class Query(BaseModel):
topic: str = Field(..., description="subject")
# Use a specific model
evaluator = Evaluator(Query, llm="openai/gpt-4o")
async for q in evaluator.run(tuple_count=2):
print(f"Query: {q.query}")
Using Other Providers¶
from evaluateur import Evaluator
# Anthropic
evaluator = Evaluator(Query, llm="anthropic/claude-3-5-sonnet-latest")
# Ollama (local)
evaluator = Evaluator(Query, llm="ollama/llama3.2")
# Advanced: bring your own Instructor client
import instructor
from anthropic import AsyncAnthropic
inst = instructor.from_anthropic(AsyncAnthropic())
evaluator = Evaluator(Query, client=inst, model_name="claude-3-5-sonnet-latest")
See the Provider Configuration guide for more examples.
Summary¶
This walkthrough covered the main evaluateur capabilities:
- Basic workflow: dimensions → options → tuples → queries
- Step-by-step control: Call
options(),tuples(), andqueries()separately - Fixed vs generated options: Mix
list[str](fixed) withstr(generated) - Tuple strategies:
CROSS_PRODUCTfor coverage,AIfor coherence - Reproducibility: Use seeds for deterministic sampling
- Goal-guided optimization: Shape queries with flat, categorizable goals
- Goal modes:
samplefor diversity,cyclefor even coverage,fullfor comprehensive - Metadata inspection: Track source tuples, goal focus, and goal categories
- Provider configuration: Use any LLM provider via Instructor
For more details, see:
- Dimensions, Tuples, Queries - Core concepts
- Goal-Guided Optimization - Goals in depth
- Context Builders - Advanced customization
- Provider Configuration - LLM provider setup