| Capability | Corresponding Component |
|---|---|
| Implement task execution logic | Workflow Function |
| Evaluate model outputs objectively | Judge Function, JudgeOutput |
| Run automated selection over a dataset | DatasetConfig, select_model |
Core Components
The model selection process involves three core components that work together:- Workflow Function: An async function that executes your agent logic with a given model and returns the result.
- Judge Function: Evaluates the workflow output and returns a reward indicating performance (higher is better).
- Task Dataset: A collection of tasks for evaluating and comparing models.
WorkflowOutput and JudgeOutput are framework-provided data classes. Your workflow and judge functions must return instances of these types. Do not define your own output classes.Prerequisites
Before running the examples, install the required dependencies and set up your API key:Setup & Configuration
Define your candidate models that will be evaluated:Defining the Workflow Function
The workflow function executes your agent logic with a given model and returns a standardized result. It must:- Accept a
task(e.g., a question or input) and amodelinstance - Run inference using that model
- Return a
WorkflowOutputobject containing the model’s response
Your function must return
WorkflowOutput. Do not define custom output classes.Implementing the Judge Function
The judge function evaluates the output of the workflow and assigns a numerical reward (higher = better) along with optional diagnostic metrics. It must:- Accept the original
taskand theresponsefrom the workflow - Compute a scalar
reward(e.g., accuracy, BLEU score, or inverse latency) - Return a
JudgeOutputobject withrewardandmetrics
Using Built-in Judges
AgentScope provides built-in judge functions for common efficiency metrics:Running Model Selection
With your components defined, run the model selection process:workflow_func: The workflow function that executes tasks with different models.judge_func: The judge function that evaluates performance.train_dataset: Configuration for the evaluation dataset.candidate_models: List of models to compare.
Supported Dataset Formats
DatasetConfig supports multiple data sources:
| Type | Example path | Format |
|---|---|---|
| Hugging Face Dataset | "openai/gsm8k" | Must specify name and split |
| Local JSON File | "./data/tasks.json" | Array of objects with question/answer |
| Local JSONL File | "./data/tasks.jsonl" | One JSON object per line |
Minimal JSON Example (tasks.json)
Complete Examples
Example 1: Token Usage Optimization
This example selects the best model based on token consumption:Example 2: Translation Quality with BLEU Score
This example selects the best model for translation tasks based on BLEU score:Key Benefits
Performance optimization
Identify the model that achieves the highest accuracy on your specific task.
Cost efficiency
Select models that achieve desired performance with lower computational costs.
Latency control
Choose models that meet your speed constraints without sacrificing quality.
Resource awareness
Find the best model that fits within your infrastructure limitations.
Best Practices
- Choose appropriate metrics: Align your judge function with your actual goals (accuracy, efficiency, cost, etc.)
- Monitor detailed metrics: Use detailed metrics to understand the trade-offs between different models
- Validate results: Manually check a few outputs from your selected model to ensure quality meets expectations