All chat model classes share a unified __call__ interface. The input to __call__ is the formatted messages — the result of applying a formatter to Msg objects. This formatted input matches the exact format expected by the underlying API provider.Method signature:
Typical workflow when calling a model directly:In AgentScope, agents communicate by passing Msg objects. When calling a model directly (outside an agent), the typical flow is:
Build Msg objects with name, role, and content (text or content blocks)
Use a Formatter to convert [Msg] into the provider-specific message format
Call the ChatModel with the formatted messages to get a ChatResponse
When using an agent (e.g., ReActAgent), steps 2-3 are handled automatically — the agent internally manages the Msg → Formatter → Model → ChatResponse pipeline.Example workflow:
import asyncioimport osfrom agentscope.formatter import DashScopeChatFormatterfrom agentscope.model import DashScopeChatModelfrom agentscope.message import Msgasync def example_model_call(): # Step 1: Create model and formatter model = DashScopeChatModel( model_name="qwen-max", api_key=os.environ["DASHSCOPE_API_KEY"], stream=False, ) formatter = DashScopeChatFormatter() # Step 2: Build Msg objects user_msg = Msg(name="user", content="Hi!", role="user") # Step 3: Format messages (convert to provider-specific format) formatted_messages = await formatter.format([user_msg]) # Step 4: Call model with formatted messages and get ChatResponse res = await model(formatted_messages) print("Response:", res.content) print("Usage:", res.usage)asyncio.run(example_model_call())
The key point: ChatModel accepts formatted messages (the output of a formatter), not raw Msg objects. This design allows each model to receive input in its native API format. The model returns a ChatResponse object containing the generated content and usage information.
To enable streaming, set stream=True in the constructor. When streaming is enabled, __call__ returns an async generator that yields ChatResponse instances.
Streaming in AgentScope is accumulative — each chunk contains all previous content plus newly generated content, not just the delta. This simplifies consumption since you always have the complete current state without tracking deltas.
async def example_streaming(): model = DashScopeChatModel( model_name="qwen-max", api_key=os.environ["DASHSCOPE_API_KEY"], stream=True, ) formatter = DashScopeChatFormatter() user_msg = Msg(name="user", content="Count from 1 to 5.", role="user") formatted_messages = await formatter.format([user_msg]) # Get async generator generator = await model(formatted_messages) # Iterate through chunks (each contains accumulated content) async for chunk in generator: print(chunk.content) # Accumulated content up to this pointasyncio.run(example_streaming())
Example output (each line shows accumulative text):
AgentScope supports reasoning models (chain-of-thought) via ThinkingBlock. When enable_thinking=True, the model’s response includes both thinking process and final answer.
async def example_reasoning(): model = DashScopeChatModel( model_name="qwen-turbo", api_key=os.environ["DASHSCOPE_API_KEY"], enable_thinking=True, # Enable reasoning stream=True, ) formatter = DashScopeChatFormatter() user_msg = Msg(name="user", content="What is 17 * 23?", role="user") formatted_messages = await formatter.format([user_msg]) res = await model(formatted_messages) # Collect final chunk last_chunk = None async for chunk in res: last_chunk = chunk # Response contains both ThinkingBlock and TextBlock for block in last_chunk.content: block_type = block['type'] content = block.get('thinking') or block.get('text') print(f"[{block_type}] {content[:80]}...")asyncio.run(example_reasoning())
The thinking content is streamed alongside text content in accumulative mode.
AgentScope provides a unified tools interface across all providers. Tools are defined using a standardized JSON schema format and passed to the model via the tools parameter.
AgentScope provides a token counter module under agentscope.token to estimate the number of tokens in a set of messages before sending them to a model. This is useful for managing context window budgets and implementing prompt truncation strategies.
The formatter module integrates token counters to support automatic prompt truncation. When a token budget is configured, the formatter uses the corresponding counter to trim messages before they are sent to the model.
Supported providers:
Provider
Class
Image Data
Tools
Anthropic
AnthropicTokenCounter
✅
✅
OpenAI
OpenAITokenCounter
✅
✅
Gemini
GeminiTokenCounter
✅
✅
HuggingFace
HuggingFaceTokenCounter
Depends on the model
Depends on the model
DashScope does not provide a token-counting API. For DashScope (Qwen) models, use HuggingFaceTokenCounter with the corresponding Qwen tokenizer instead.
import asynciofrom agentscope.token import OpenAITokenCounterasync def example_token_counting(): messages = [ {"role": "user", "content": "Hello!"}, {"role": "assistant", "content": "Hi, how can I help you?"}, ] counter = OpenAITokenCounter(model_name="gpt-4.1") n_tokens = await counter.count(messages) print(f"Number of tokens: {n_tokens}")asyncio.run(example_token_counting())
Non-realtime TTS models require complete text before synthesis. The core method is synthesize(), which accepts a Msg object and returns a TTSResponse containing audio data.
async def synthesize(self, msg: Msg) -> TTSResponse | AsyncGenerator[TTSResponse, None]: """ Synthesize speech from text. Args: msg: A Msg object containing text content Returns: - TTSResponse: when stream=False (complete audio) - AsyncGenerator[TTSResponse, None]: when stream=True (audio chunks) """
Basic usage:
import asyncioimport osfrom agentscope.tts import DashScopeTTSModelfrom agentscope.message import Msgasync def example_non_realtime_tts(): tts_model = DashScopeTTSModel( api_key=os.environ.get("DASHSCOPE_API_KEY", ""), model_name="qwen3-tts-flash", voice="Cherry", stream=False, ) msg = Msg(name="assistant", content="Hello, this is a TTS demo.", role="assistant") tts_response = await tts_model.synthesize(msg) # tts_response.content contains an AudioBlock with base64-encoded audio print("Audio data length:", len(tts_response.content["source"]["data"]))asyncio.run(example_non_realtime_tts())
Realtime TTS models accept streaming text input — text chunks can be fed incrementally as they become available (e.g., from a streaming chat model). This enables the lowest possible latency.Core methods:
async def push(self, msg: Msg) -> TTSResponse: """ Non-blocking. Submit text chunk and return any audio received so far. """async def synthesize(self, msg: Msg) -> TTSResponse | AsyncGenerator[TTSResponse, None]: """ Blocking. Finalize the session and return all remaining audio. """
Key concepts:
Stateful processing: Only one streaming session can be active at a time, identified by msg.id
Incremental input: Use push() to submit text chunks as they arrive
Finalization: Use synthesize() to complete the session and get remaining audio
Usage example:
from agentscope.tts import DashScopeRealtimeTTSModelasync def example_realtime_tts(): tts_model = DashScopeRealtimeTTSModel( api_key=os.environ.get("DASHSCOPE_API_KEY", ""), model_name="qwen3-tts-flash-realtime", voice="Cherry", stream=False, ) async with tts_model: # Push accumulative text chunks (non-blocking) res = await tts_model.push(msg_chunk_1) res = await tts_model.push(msg_chunk_2) # ... # Finalize and get all remaining audio (blocking) res = await tts_model.synthesize(final_msg)
Integration with Agent:AgentScope agents can automatically synthesize speech when provided with a TTS model. The agent handles the streaming text → TTS pipeline internally.
Realtime models provide bidirectional, persistent communication over WebSocket, designed primarily for voice agent scenarios where the user speaks and the model responds with speech in real-time.
Tool support: Some providers support function calling in realtime (e.g., OpenAI, Gemini)
The key difference from traditional chat models is that realtime models handle the entire voice interaction pipeline (ASR + LLM + TTS) in a single, optimized connection, minimizing latency.
AgentScope provides RealtimeAgent to work with realtime models. The agent handles the WebSocket connection, audio streaming, and message exchange automatically.
Embedding models generate vector representations for text, images, and other data types. These embeddings are used for retrieval, similarity search, and as input features for downstream tasks.
For image and video inputs, use ImageBlock with URLSource (for publicly accessible URLs) or Base64Source (for base64-encoded data). The example above uses text for simplicity.