Agent middleware is the mechanism for injecting custom logic — logging, tracing, input rewriting, access control — into key points of the agent execution pipeline, without modifying the agent or model code.AgentScope exposes 6 hook positions plus a tool-provider hook, covering the full path from the outer reply process down to the raw model API call:
Position
Type
Description
on_reply
Onion
Wraps a complete reply, covering all ReAct rounds, tool executions, and the final output
on_reasoning
Onion
Wraps a single ReAct round’s reasoning step (input assembly → model call → stream decoding)
on_acting
Onion
Wraps a single tool call execution
on_model_call
Onion
Wraps the underlying ChatModel API call — the closest to the model
on_compress_context
Onion
Wraps Agent.compress_context() — fires before each reasoning step when the agent decides whether to compress its context
on_system_prompt
Transformer
Fires every time the system prompt is assembled; multiple middlewares chain in sequence, each transforming the previous one’s output
list_tools
Tool source
Optional. Returns a list[ToolBase] that the middleware contributes. Not invoked automatically — the caller assembling the agent’s toolkit decides whether to call it and how to merge the result.
These hooks operate at the agent level. For per-tool onion hooks that fire on every invocation of a specific tool — regardless of whether it’s called inside or outside an agent — see Tool Middleware.
The three types differ as follows:
Onion — middleware wraps the next handler, allowing logic before/after next_handler() and observation of the intermediate event stream.
Transformer — middlewares form a pipeline; the previous one’s output feeds into the next one. There is no “inner layer” concept.
Tool source — not a hook on the runtime path. Agent.__init__ does not call list_tools(); you opt in explicitly by collecting the tools from your middlewares and passing them into the toolkit yourself.
The diagram below shows how these hooks nest within the agent lifecycle. on_system_prompt is embedded inside on_reasoning because it fires when the reasoning step assembles the system prompt; on_compress_context sits at the top of each ReAct round, before reasoning:
on_acting currently wraps only tool execution inside the agent runtime; tools dispatched outside the agent via external execution are not tracked by on_acting.
AgentScope packages a set of hooks into a class — a single middleware class can implement any subset of the 6 hook positions (plus the optional list_tools tool-provider hook) at the same time. Pass instances to Agent(middlewares=[...]) to equip them:
from agentscope.agent import Agentfrom agentscope.middleware import TracingMiddlewareagent = Agent( name="assistant", system_prompt="You are a helpful assistant.", model=model, toolkit=toolkit, middlewares=[TracingMiddleware()],)
At construction time the agent scans each middleware instance, checks which hooks it actually implements, and routes it into the matching position-specific execution lists. Unimplemented positions are skipped automatically with no call overhead.
TracingMiddleware wires the full agent lifecycle to OpenTelemetry tracing. It instruments on_reply, on_model_call, and on_acting, producing hierarchical spans.Before using it, register a TracerProvider and an OTLP exporter in the process:
from agentscope.agent import Agentfrom agentscope.middleware import TracingMiddlewareagent = Agent( name="assistant", system_prompt="You are a helpful assistant.", model=model, toolkit=toolkit, middlewares=[TracingMiddleware()],)
Each reply produces a nested span tree. The key attributes captured at each level are:
Agent Reply Span
Model Call Span
Tool Execution Span
From on_reply:
Agent name, session ID, reply ID
Input messages and the final output message
HITL pending tool calls
External execution pending tool calls
From on_model_call:
Model name, provider, input/output token counts
Request and response message content
Wraps streaming responses, writing attributes onto the final chunk
From on_acting:
Tool name, call ID, input arguments
Tool execution result
When no TracerProvider is configured, every hook short-circuits directly to next_handler() — no spans are created, no attributes are computed — making the overhead negligible.
When the agent receives an ExternalExecutionResultEvent (a tool executed outside the agent), TracingMiddleware synthesizes a compensating span for each external execution result, preserving full observability for tools run by external systems.
To trace custom operations within the agent lifecycle, use the standard OpenTelemetry Python SDK directly. Obtain a tracer scoped to AgentScope and wrap any target code in a span:
from opentelemetry import tracefrom agentscope import __version__tracer = trace.get_tracer("agentscope", __version__)with tracer.start_as_current_span( name="your_span_name", attributes={ # Optional key-value pairs attached to the span, # e.g. function name, input arguments, or any custom metadata. }, end_on_exit=True,) as span: # your code here
These custom spans are emitted alongside AgentScope’s built-in spans and delivered to the same OTLP collector configured in the TracerProvider.
ReplyBudgetControlMiddleware enforces a weighted token budget per reply. It tracks cumulative token usage across all reasoning steps within a single reply and, once the budget is exhausted, instructs the agent to wrap up immediately without invoking any further tools. This is useful for capping the cost or latency of long, tool-heavy ReAct loops.The weighted cost is computed on every model call as:
Once the accumulated cost reaches token_budget, the middleware:
Appends a HintBlock to the last assistant message in the agent’s context (or creates a new AssistantMsg if needed), reminding the model to produce a final concluding response.
Overrides tool_choice to ToolChoice(mode="none") for the next reasoning step, preventing any further tool calls.
Attach it like any other middleware:
from agentscope.agent import Agentfrom agentscope.middleware import ReplyBudgetControlMiddlewareagent = Agent( name="assistant", system_prompt="You are a helpful assistant.", model=model, toolkit=toolkit, middlewares=[ ReplyBudgetControlMiddleware( token_budget=10000, input_token_weight=1.0, output_token_weight=2.0, ), ],)
Maximum weighted token cost allowed per reply. Once the accumulated cost reaches this threshold, the agent is instructed to wrap up without calling any more tools.
Multiplier applied to output tokens when computing the weighted cost. Set this higher than input_token_weight to reflect that output tokens are typically more expensive.
The message injected into the agent’s context when the budget is exceeded. Defaults to a built-in wrap-up prompt that asks the model to provide a final concluding response without invoking any tools.
The middleware is stateless on the instance itself — all runtime state lives in agent.state.middle_context, keyed by the middleware key and the current reply_id. This means the same middleware instance can safely be shared across multiple agents, and budget state persists across human-in-the-loop (HITL) interruptions and resumptions. State is automatically cleaned up when the reply ends.
The budget is scoped per reply, not per agent lifetime. Each new reply starts with a fresh counter, so the limit applies independently to every call to agent(...).
TTSMiddleware intercepts the agent’s text output and synthesizes speech audio, injecting DataBlockStartEvent / DataBlockDeltaEvent / DataBlockEndEvent into the event stream alongside the text. It hooks into on_reply to observe every TextBlockDeltaEvent and TextBlockEndEvent.
The middleware automatically adapts to the TTS model’s mode:
TTS Mode
Behavior
Non-realtime (realtime=False)
Accumulates text until TextBlockEndEvent, then calls synthesize(text) and emits the full audio as one data block
Realtime (realtime=True)
Pushes each TextBlockDeltaEvent.delta via push(), emitting audio chunks as they arrive; calls synthesize() on TextBlockEndEvent to flush remaining audio
The output event stream differs by mode:Non-realtime — audio follows after the text block completes:
TextBlockStartEventTextBlockDeltaEvent (text)TextBlockDeltaEvent (text)TextBlockEndEventDataBlockStartEvent ← audio synthesized after text endsDataBlockDeltaEvent (audio)DataBlockDeltaEvent (audio)DataBlockEndEvent
Realtime — audio chunks arrive interleaved with text as synthesis runs concurrently:
TextBlockStartEventTextBlockDeltaEvent (text)DataBlockStartEvent ← audio begins during text streamDataBlockDeltaEvent (audio)TextBlockDeltaEvent (text)DataBlockDeltaEvent (audio)TextBlockEndEventDataBlockDeltaEvent (audio) ← remaining audio from synthesize()DataBlockEndEvent
Each DataBlockDeltaEvent.data carries an incremental base64-encoded audio chunk; the full audio is the concatenation of every delta’s decoded bytes, keyed by block_id.
AgentScope implements long-term memory as middleware, so an agent can persist and recall durable facts across sessions. A memory backend hooks on_reply (pre-reply search + post-reply write-back), on_system_prompt (advertise memory tools), and contributes agent-callable memory tools (such as search_memory / add_memory) via list_tools. The currently available backend is Mem0Middleware, powered by mem0. See Long-Term Memory for installation, control modes, and construction paths.
Subclass MiddlewareBase and implement only the hooks you need — leave the rest alone.The example below covers every position in a single middleware. Each onion hook receives an input_kwargs dict carrying the fields that flow into the wrapped layer; forward it with next_handler(**input_kwargs), or pass keyword arguments to override specific fields:
from typing import AsyncGenerator, Awaitable, Callablefrom agentscope.agent import Agentfrom agentscope.event import AgentEventfrom agentscope.message import Msgfrom agentscope.middleware import MiddlewareBasefrom agentscope.model import ChatResponsefrom agentscope.tool import ToolBaseclass FullObservabilityMiddleware(MiddlewareBase): """Observe every middleware position at once, plus contribute a tool.""" async def on_reply( self, agent: Agent, # {"inputs": Msg | list[Msg] | UserConfirmResultEvent | ExternalExecutionResultEvent | None} input_kwargs: dict, next_handler: Callable[..., AsyncGenerator[AgentEvent | Msg, None]], ) -> AsyncGenerator[AgentEvent | Msg, None]: print(f"[reply] start for {agent.name}") async for item in next_handler(**input_kwargs): yield item print(f"[reply] end for {agent.name}") async def on_reasoning( self, agent: Agent, # {"tool_choice": ToolChoice | None} input_kwargs: dict, next_handler: Callable[..., AsyncGenerator[AgentEvent, None]], ) -> AsyncGenerator[AgentEvent, None]: print("[reasoning] start") async for event in next_handler(**input_kwargs): yield event print("[reasoning] end") async def on_model_call( self, agent: Agent, # {"messages": list[Msg], "tools": list[dict], "tool_choice": ToolChoice | None, "current_model": ChatModelBase} input_kwargs: dict, next_handler: Callable[ ..., Awaitable[ChatResponse | AsyncGenerator[ChatResponse, None]] ], ) -> ChatResponse | AsyncGenerator[ChatResponse, None]: print(f"[model_call] {input_kwargs['current_model'].model}") result = await next_handler(**input_kwargs) print("[model_call] done") return result async def on_compress_context( self, agent: Agent, # {"context_config": ContextConfig | None} input_kwargs: dict, next_handler: Callable[..., Awaitable[None]], ) -> None: print(f"[compress_context] checking context for {agent.name}") await next_handler(**input_kwargs) print("[compress_context] done") async def on_system_prompt( self, agent: Agent, current_prompt: str, ) -> str: print(f"[system_prompt] length={len(current_prompt)}") return current_prompt async def list_tools(self) -> list[ToolBase]: # Optional hook. Not invoked automatically by ``Agent.__init__``; # if you want these tools available to the agent, collect them # from your middlewares yourself and pass them into the toolkit. return []
The overall execution order of all hooks within a single reply follows the agent lifecycle:
on_reply └── per ReAct round: ├── on_compress_context → compress_context() │ └── on_system_prompt (token counting before compression) ├── on_reasoning │ ├── _prepare_model_input() → on_system_prompt │ └── on_model_call └── on_acting (once per tool call in this round)
list_tools is not part of the per-reply execution path and is not invoked automatically by the agent — it is a convenience interface so a middleware can advertise its own tools. The caller assembling the toolkit decides whether to collect them.