Reinforcement Learning

This tutorial focuses on the Model Weights Tuning method, which uses Reinforcement Learning (RL) to optimize your agent’s model parameters. This method can significantly enhance agent performance, but it requires substantial resources, including GPUs.

Prerequisites

Before proceeding, ensure you have:

AgentScope v1.0.14 or higher
Linux system
At least one Nvidia GPU with compute capability 8.0+ (e.g., Nvidia A100, H100, RTX 3090, RTX 4090, RTX 5090, etc.)
CUDA 12.8 or higher

Install the required dependencies:

pip install trinity-rft>=0.5.1
pip install flash-attn==2.8.1 --no-build-isolation

For installation issues, see the Trinity-RFT installation guide.

Start a Ray cluster to manage distributed tuning:

ray start --head

Refine Workflow Function

Continuing from Overview, assume you have a task dataset and judge function ready. Now, refine your workflow function for model weights tuning:

async def example_workflow_function(
    task: Dict,
    model: ChatModelBase,
) -> WorkflowOutput:
    agent = ReActAgent(
        name="react_agent",
        sys_prompt="You are a helpful math assistant.",
        model=model,  # use the model passed in
        formatter=OpenAIChatFormatter(),
    )

    response = await agent.reply(
        msg=Msg(
            "user",
            task["question"],  # extract question from task
            role="user",
        ),
    )

    return WorkflowOutput(  # wrap the response in WorkflowOutput
        response=response,
    )

This refined function now accepts model as a input parameter, and uses it to initialize the agent. The model has the same interface as OpenAIChatModel, but its weights will be tuned during the tuning process.

Configuration & Tuning

After refining the workflow function, you can set up the tuning configuration and start the tuning process.

from agentscope.tuner import tune, AlgorithmConfig, DatasetConfig, TunerModelConfig

# your workflow / judge function here...

if __name__ == "__main__":
    dataset = DatasetConfig(path="my_dataset", split="train")
    model = TunerModelConfig(model_path="Qwen/Qwen3-0.6B", max_model_len=16384)
    algorithm = AlgorithmConfig(
        algorithm_type="multi_step_grpo",
        group_size=8,
        batch_size=32,
        learning_rate=1e-6,
    )
    tune(
        workflow_func=run_react_agent,
        judge_func=judge_function,
        model=model,
        train_dataset=dataset,
        algorithm=algorithm,
    )

Key configurations include:

DatasetConfig: Specifies the task dataset for tuning.
- path: The path to the dataset, which can be a local path or a dataset from Hugging Face.
- split: The split of the dataset to be used for tuning (e.g., “train”, “test”, etc.).
- name: (Optional) The name of the dataset to be used for tuning. Some Huggingface datasets have multiple subsets, you can specify the subset name to select the specific subset for tuning. If not specified, it will use the default subset.
TunerModelConfig: Specifies the model to be tuned.
- model_path: The path to the model, which can be a local path or a model from Hugging Face.
- max_model_len: The maximum sequence length the model can handle. Longer lengths will cost more GPU memory.
- inference_engine_num: The number of inference engines to use for parallel inference. This can help speed up the inference process during tuning. Default is 1.
- tensor_parallel_size: Use tensor parallelism to split the model across multiple GPUs. Set this value to the number of GPUs you want to use for tensor parallelism. Default is 1 (no tensor parallelism).
- ulysses_sequence_parallel_size: Use Ulysses sequence parallelism to split the input sequences across multiple GPUs. Set this value to the number of GPUs you want to use for sequence parallelism. Default is 1 (no sequence parallelism).

When using tensor or Ulysses sequence parallelism, ensure that your cluster has enough GPUs to accommodate the model. The total number of GPUs used will be:inference_engine_num * tensor_parallel_size + ulysses_sequence_parallel_sizeIf you only have one GPU, please set inference_engine_num, tensor_parallel_size, and ulysses_sequence_parallel_size to 1.

AlgorithmConfig: Specifies the tuning algorithm and its hyperparameters.
- algorithm_type: The type of tuning algorithm to use. Currently, we only recommend using "multi_step_grpo" for model weights tuning, which is a variant of GRPO algorithm designed for agentic scenarios.
- group_size: Each task will be executed group_size times to collect multiple responses in GRPO series algorithms. A larger group size can provide more stable reward estimates but will increase the tuning time.
- batch_size: The number of tasks to be processed in each tuning iteration.
- learning_rate: The learning rate for updating the model weights during tuning.
- save_interval_steps: The interval (in number of iterations) at which to save the model checkpoints. Default is 100.

Then you can save the above code into a Python file (e.g., main.py) and run it:

python main.py

Monitor Tuning Process

Each run creates directory under the current workspace.

current_working_directory/
    └── checkpoints/
        └──AgentScope/
            └── Experiment-20260104185355/  # each run saved in a sub-directory with timestamp
                ├── monitor/          # Monitor (e.g., Tensorboard) cache directory
                ├── log/              # tuning logs
                └── global_step_x/    # the model checkpoint saved at step x

You can monitor tuning progress with TensorBoard:

tensorboard --logdir checkpoints/AgentScope/<your-experiment-subdir>/monitor/tensorboard

Get Tuned Model

Checkpoints are saved every save_interval_steps steps as introduced in AlgorithmConfig. You can find the saved model checkpoints in the corresponding global_step_x directory. The saved checkpoints are in pytorch format, which is hard to use directly. You can use Trinity-RFT CLI to convert them into Hugging Face format for easier usage:

trinity convert --checkpoint-dir checkpoints/AgentScope/Experiment-20260104185355/global_step_100

This command will convert the checkpoint saved at step 100 into Hugging Face format and save it in the same directory. Then you copy the converted model to your desired location for later use.

Advanced Features

This section introduces some advanced features not covered in the above example, which can further improve the tuning performance or provide more insights during the tuning process.

LLM-as-a-Judge

In addition to using custom judge functions, you can also leverage powerful LLMs as judges to evaluate the agent’s performance during tuning. This can be particularly useful when designing a custom judge function is challenging. However, the tuning process requires frequent calls to the judge model, which can easily reach the rate limits of public LLM APIs. To address this, tuner provides auxiliary_models to deploy your own judge model in the training cluster. auxiliary_models are different from the main model being tuned, and they are only used within the workflow function and judge function for inference. Below are the steps to set up and use LLM-as-a-Judge in the tuning process:

Add `auxiliary_models` parameter

Modifiy the signature of your workflow function and judge function and add configurations for the auxiliary models.

async def workflow_function(
    task: Dict,
    model: ChatModelBase,
    auxiliary_models: Optional[Dict[str, ChatModelBase]],
) -> WorkflowOutput:
    # your workflow logic here...

async def judge_function(
    task: Dict,
    response: Any,
    auxiliary_models: Optional[Dict[str, ChatModelBase]],
) -> JudgeOutput:
    # your judge logic here...

The auxiliary_models parameter is a dictionary that contains the auxiliary models used in the tuning process. You can access the model using auxiliary_models[model_name] within your workflow function and judge function. Auxiliary models also have the same interface as OpenAIChatModel, so you can use them just like others.

Configure auxiliary models

When calling the tune function, you can configure the auxiliary models with a dictionary of TunerModelConfig. And the keys of the dictionary should match the names used in the auxiliary_models parameter of your workflow function and judge function.

from agentscope.tuner import TunerModelConfig

auxiliary_models = {
    "judge_model": TunerModelConfig(
        model_path="your-judge-model-path",
        max_model_len=8192,
    ),
}

tune(
    workflow_func=workflow_function,
    judge_func=judge_function,
    model=model,
    train_dataset=dataset,
    algorithm=algorithm,
    auxiliary_models=auxiliary_models,
)

The auxiliary_models requires extra GPUs during tuning. Each auxiliary model will occupy inference_engine_num * tensor_parallel_size GPUs.

Runtime Monitoring

During the tuning process, it’s often helpful to monitor the running status of workflow / judge functions in real time. You can use the metrics parameter in the WorkflowOutput and JudgeOutput to log metrics you want to monitor during tuning. These metrics will be automatically logged to TensorBoard for visualization. For example:

async def workflow_function(
    task: Dict,
    model: ChatModelBase,
) -> WorkflowOutput:
    start_time = time.time()
    # your workflow logic here...
    end_time = time.time()
    return WorkflowOutput(
        response=response,
        metrics={
            "workflow_response_time": end_time - start_time,  # log response time for monitoring
            # you can log any other metrics you want to monitor here...
        },
    )


async def judge_function(
    task: Dict,
    response: Any,
) -> JudgeOutput:
    start_time = time.time()
    # your judge logic here...
    end_time = time.time()
    return JudgeOutput(
        reward=reward,
        metrics={
            "judge_response_time": end_time - start_time,  # log response time for monitoring
            # make sure the metric keys in workflow function and judge function are different to avoid conflicts in TensorBoard
            # you can log any other metrics you want to monitor here...
        },
    )

Except for using TensorBoard to visualize the tuning metrics, you can also enable logger-based runtime monitoring to log the runtime information of each function execution. To enable logger-based runtime monitoring, you only need to add a logger parameter to your workflow function and judge function:

from logging import Logger

async def workflow_function(
    task: Dict,
    model: ChatModelBase,
    logger: Logger,
) -> WorkflowOutput:
    logger.info(f"Processing task: {task['id']}")
    # your workflow logic here...
    if error_occurred:
        logger.error(f"Error occurred while processing task: {task['id']}")
    logger.info(f"Successfully processed task: {task['id']}")
    # your workflow logic here...

The logger parameter is an instance of Python’s built-in Logger class, but it is pre-configured to print logs to both console and log files in the directory introduced in Monitor Tuning Process. You can use this logger to log any information you want during the execution of the workflow function and judge function, which can help you better understand the tuning process and debug if necessary. Because the tuning process will run multiple parallel workers to execute the workflow function and judge function, each worker will have its own log file in the log directory. The log files are named as explorer_runner_{worker_id}.log, where {worker_id} is the ID of the worker. Below is an example structure of the log directory:

path/to/log/
    ├── explorer_runner_0.log
    ├── explorer_runner_1.log
    ├── explorer_runner_2.log
    └── ...

You can check these log files to see the runtime information of each worker during the tuning process.

Tuning without Local GPU

Model weights tuning usually requires GPUs for training. If you lack local GPUs, you can still perform tuning by leveraging remote training APIs such as TuFT and Tinker. To use these APIs, set the required environment variables before starting the Ray cluster:

# For TuFT, set your server address:
# export TINKER_BASE_URL=http://your-tuft-server-address
export TINKER_API_KEY=your-api-key
ray start --head

When configuring the model for tuning, specify a model_path supported by your remote training API (e.g., TuFT or Tinker), and provide a TinkerConfig with your desired LoRA rank:

from agentscope.tuner import TunerModelConfig, TinkerConfig

model = TunerModelConfig(
    model_path="Qwen/Qwen3-4B-Instruct-2507",  # must be a model supported by the remote training API you are using
    max_model_len=16384,
    tinker_config=TinkerConfig(
        rank=16  # LoRA rank for tuning; higher values may improve performance but increase resource usage and tuning time
    )
)

Other tuning configurations remain as described in previous sections. For more details, see the TuFT documentation and Tinker documentation.

Configuration using YAML File

The examples above show how to configure tuning using Python code. Alternatively, you can use a YAML file for configuration, which is often more convenient for managing complex setups. The following YAML configuration corresponds to the previous Python example:

checkpoint_root_dir: ./checkpoints  # Root directory for checkpoints and logs
project: AgentScope                 # Project name (sub-directory under checkpoint_root_dir)
name: Experiment-timestamp          # Experiment name (sub-directory under project)
model:
  model_path: Qwen/Qwen3-0.6B       # Model path (local or Hugging Face)
  max_model_len: 16384              # Maximum sequence length
  max_response_tokens: 4096
algorithm:
  algorithm_type: multi_step_grpo   # Tuning algorithm type (recommended: "multi_step_grpo")
  repeat_times: 8                   # Number of repeats per task (corresponds to group_size)
  optimizer:
    lr: 1e-6                        # Learning rate
buffer:
  total_epochs: 1                   # Total epochs to process the dataset
  batch_size: 32                    # Batch size per tuning iteration
  explorer_input:
    taskset:
      path: '/path/to/my_dataset'   # Dataset path (local or Hugging Face)
      split: 'train'                # Dataset split for tuning
explorer:
  rollout_model:
    engine_num: 1                   # Number of inference engines for parallel inference
    tensor_parallel_size: 1         # Tensor parallelism size
synchronizer:
  sync_style: explorer_driven
  sync_method: 'nccl'
trainer:
  save_interval: 100                # Checkpoint save interval (steps)
  ulysses_sequence_parallel_size: 1 # Ulysses sequence parallelism size
monitor:
  monitor_type: tensorboard         # Monitoring tool (tensorboard, wandb, mlflow, swanlab)
log:
  level: INFO                       # Logging level (DEBUG, INFO, WARNING, ERROR)

Save this YAML configuration to a file (e.g., config.yaml) and load it in your Python code when calling the tune function:

# Define your workflow and judge functions ...

if __name__ == "__main__":
    tune(workflow_func=workflow_function, judge_func=judge_function, config_path="config.yaml")

With this approach, all tuning configurations are loaded from the YAML file, making it easier to manage and modify settings for different runs.

For Trinity-RFT, YAML configuration is the only supported method. Python code configuration classes in tuner are converted to YAML files and passed to Trinity-RFT. For more details, see the Trinity-RFT Configuration Guide.If both Python code and YAML configurations are set, Python code takes precedence, allowing you to override specific YAML settings. However, to avoid confusion, it’s best to use only one configuration method for consistency.

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

Reinforcement Learning

Prerequisites

Refine Workflow Function

Configuration & Tuning

Monitor Tuning Process

Get Tuned Model

Advanced Features

LLM-as-a-Judge

Runtime Monitoring

Tuning without Local GPU

Configuration using YAML File

Get Started

Tutorial

Out-of-box Agents

Basic Concepts

Building Blocks

Observe & Evaluate

Tune Agent

Deploy & Serve

Others

Documentation Index

​Prerequisites

​Refine Workflow Function

​Configuration & Tuning

​Monitor Tuning Process

​Get Tuned Model

​Advanced Features

​LLM-as-a-Judge

​Runtime Monitoring

​Tuning without Local GPU

​Configuration using YAML File

Prerequisites

Refine Workflow Function

Configuration & Tuning

Monitor Tuning Process

Get Tuned Model

Advanced Features

LLM-as-a-Judge

Runtime Monitoring

Tuning without Local GPU

Configuration using YAML File