Overview
The model layer is organized as a two-tier hierarchy: a Credential at the top, and the model families a provider exposes beneath it — Chat Model, TTS, Embedding, and Realtime Model.Credential
ChatModelBase
OpenAIChatModel
OpenAIResponseModel
AnthropicChatModel
DashScopeChatModel
DeepSeekChatModel
GeminiChatModel
MoonshotChatModel
XAIChatModel
OllamaChatModel
TTSModelBase
DashScopeTTSModel
DashScopeRealtimeTTSModel
DashScopeCosyVoiceRealtimeTTSModel
EmbeddingModelBase
DashScopeEmbeddingModel
OpenAIEmbeddingModel
GeminiEmbeddingModel
OllamaEmbeddingModel
RealtimeModelBase (coming soon)
api_key, base_url, …). From a credential, you can retrieve the list of available models for each model family that provider supports.
This layering mirrors the natural frontend flow — register a credential first, then pick a model from under it — letting the UI authenticate once and surface every model family the provider supports.
Chat Model
A chat model is the LLM that drives an agent’s conversation and tool calls, accepting and producing multimodal content beyond plain text. AgentScope currently ships the following chat model classes:| Provider | Model Class |
|---|---|
| OpenAI | OpenAIChatModel |
| OpenAI (Responses API) | OpenAIResponseModel |
| Anthropic | AnthropicChatModel |
| DashScope | DashScopeChatModel |
| DeepSeek | DeepSeekChatModel |
| Gemini | GeminiChatModel |
| Moonshot | MoonshotChatModel |
| xAI | XAIChatModel |
| Ollama | OllamaChatModel |
Create Chat Model
Every chat model takes a credential, a model name, and an optional provider-specificParameters object. The three tabs below show typical setups for streaming, tool calling, and reasoning:
| Argument | Type | Description |
|---|---|---|
credential | CredentialBase | Provider-specific credential |
model | str | Model identifier (e.g. "qwen-plus") |
parameters | Parameters | None | Provider-specific parameters such as temperature, thinking_enable, parallel_tool_calls |
stream | bool | Whether to stream output |
max_retries | int | Maximum API retries on failure |
context_size | int | Context window used for context compression |
formatter | FormatterBase | None | Override message formatter |
Call Chat Model
Invoke the model by calling it with a list ofMsg objects, plus optional tools and tool_choice:
stream setting:
stream=False— awaits a singleChatResponsecarrying the full output.stream=True— awaits anAsyncGenerator[ChatResponse, None]. Intermediate chunks (is_last=False) carry only the delta generated in that step. So that callers don’t have to accumulate deltas themselves, AgentScope appends one final chunk withis_last=Truethat carries the full accumulated content.
ChatResponse carries content blocks (TextBlock, ThinkingBlock, ToolCallBlock, DataBlock), an is_last flag, and a ChatUsage recording token counts and elapsed time.
Generate Structured Output
When you need a JSON object that conforms to a Pydantic model or JSON schema, callgenerate_structured_output instead of __call__. It returns a StructuredResponse whose content is a validated dict matching the schema:
generate_structured_output synthesizes a forced tool call from the schema, then validates and repairs the model’s response.Formatter
A formatter translates AgentScope’sMsg objects into the list[dict] payload that each provider’s API expects. It is configured via the optional formatter argument on the chat model constructor. Every provider ships two built-in variants:
| Variant | Use Case |
|---|---|
| ChatFormatter (default) | Standard single-agent dialog. Each Msg maps 1:1 to an API message, preserving native roles (user, assistant, system). |
| MultiAgentFormatter | Multi-agent scenarios such as debate or moderation. Consecutive agent messages are grouped and wrapped in <history> tags with the sender’s name, while tool call / result sequences keep their native API format. |
FormatterBase and pass an instance through the same formatter argument.
Custom Provider
You can extend AgentScope with your own model provider by implementing a credential and a chat model, then registering the credential.Step 1: Define the Credential
SubclassCredentialBase with a unique type discriminator and implement get_chat_model_class():
Step 2: Implement the Chat Model
SubclassChatModelBase, define a Parameters inner class, and implement _call_api:
Step 3: Add Model Cards (optional)
Drop YAML files into a_models/ directory next to your model implementation. Each file describes one model — its capabilities (input_types, output_types), limits (context_size, output_size), and any per-model parameter_overrides:
MyProviderChatModel.list_models() then loads every YAML in that directory. To pull cards from a different location — for example, a registry your application maintains separately — pass custom_yaml_dir:
Integrate with Frontend
What is ModelCard
ModelCard is a declarative description of a model’s capabilities and constraints, designed to drive the frontend — model selectors, parameter forms, and feature toggles can be rendered dynamically without hardcoding any provider-specific knowledge.
Each ModelCard contains:
| Field | Type | Description |
|---|---|---|
name | str | Model identifier (e.g. "claude-sonnet-4-6") |
label | str | Human-readable display name (e.g. "Claude Sonnet 4.6") |
status | "active" | "deprecated" | "sunset" | Model lifecycle status |
input_types | list[str] | Accepted input MIME types — used by the frontend to filter attachment uploads (e.g. only show an image button when image/* is supported) |
output_types | list[str] | Output MIME types the model can produce — advertises capabilities such as a thinking toggle when application/x-thinking is present |
context_size | int | Maximum context window in tokens |
output_size | int | Maximum output tokens |
parameter_schema | dict | Final JSON Schema for the parameter form — base schema merged with per-model overrides (see below) |
parameters_overrides | dict[str, dict] | The raw per-model overrides, before merging |
input_types and output_types use MIME types to describe modality. Common values:
| MIME Type | Meaning |
|---|---|
text/plain | Text |
application/x-thinking | Reasoning / chain-of-thought |
image/* (e.g. image/png, image/jpeg) | Image |
audio/* (e.g. audio/wav, audio/mp3) | Audio |
video/* (e.g. video/mp4) | Video |
claude-sonnet-4-6:
Parameter schema and overrides
Theparameter_schema exposed to the frontend is built in two layers:
- Base schema — auto-derived from the chat model’s
Parametersclass viamodel_json_schema(). This lists every adjustable parameter (temperature,max_tokens,thinking_enable, …) along with its type and the API-wide range. - Per-model overrides — the YAML’s
parameter_overridesblock is merged on top, field by field.
max_tokens, but each one has a different ceiling. Overrides let a card tighten a range, pin a default, or hide a parameter that doesn’t apply.
| Override syntax | Effect |
|---|---|
param: { ... } | Shallow-merge into the base field (e.g. max_tokens: {maximum: 16384}) |
param: { hidden: true } | Hide the parameter from the frontend |
param: null | Remove the parameter entirely |
Retrieve ModelCards
You retrieve model cards by callinglist_models() on either the credential class or the model class. Internally, CredentialBase.list_models() delegates to its linked ChatModelBase subclass (obtained via get_chat_model_class()), which loads YAML card definitions from its _models/ directory.
get_chat_model_class() returns the corresponding ChatModelBase subclass, which in turn knows where to find its model card YAML files:
TTS
A TTS Model converts text into synthesized speech audio, supporting both standard and realtime (streaming-input) synthesis modes. AgentScope currently ships the following TTS model classes:| Provider | Model Class | Highlights |
|---|---|---|
| DashScope | DashScopeTTSModel | Qwen3-TTS, multiple voices, streaming output |
| DashScope (Realtime) | DashScopeRealtimeTTSModel | Qwen3-TTS WebSocket streaming input, ideal for LLM output piping |
| DashScope (CosyVoice Realtime) | DashScopeCosyVoiceRealtimeTTSModel | CosyVoice-v3 streaming input, supports cosyvoice-v3-plus/flash/sambert |
Create TTS Model
Every TTS model takes a credential, a model name, and an optional provider-specificParameters object. The two tabs below show the standard and realtime setups:
| Argument | Type | Description |
|---|---|---|
credential | CredentialBase | Provider-specific credential |
model | str | Model identifier (e.g. "qwen3-tts-flash") |
parameters | Parameters | None | Provider-specific parameters such as voice |
stream | bool | Whether to stream audio output |
DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel:
| Argument | Type | Default | Description |
|---|---|---|---|
cold_start_length | int | None | None | Minimum character count before first text chunk is sent to the API |
cold_start_words | int | None | None | Minimum word count before first text chunk is sent |
max_retries | int | 3 | Maximum retry attempts on WebSocket failure |
retry_delay | float | 5.0 | Initial retry delay in seconds (exponential backoff) |
Call TTS Model
Invoke the model by callingsynthesize() with the text to speak:
stream setting:
stream=False— returns a singleTTSResponsewith the complete audio.stream=True— returns anAsyncGenerator[TTSResponse, None]. Each chunk carries an incremental audio delta; the final chunk hasis_last=True.
TTSResponse carries:
| Field | Type | Description |
|---|---|---|
content | DataBlock | None | Audio data. Format indicated by content.source.media_type (e.g. "audio/wav", "audio/pcm;rate=24000") |
is_last | bool | True on the final streaming chunk |
usage | TTSUsage | None | Token counts (input_tokens, output_tokens) and elapsed time in seconds |
id | str | Auto-generated unique identifier |
metadata | dict | None | Optional provider-specific metadata |
Realtime TTS (Streaming Input)
For realtime models (DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel), text can be pushed incrementally as it arrives from a streaming LLM. Both share the same push() / synthesize() interface. The lifecycle is managed via async with or manual connect() / close():
DashScopeRealtimeTTSModel (Qwen3) produces audio at token-level granularity — each push() call typically returns audio data. In contrast, DashScopeCosyVoiceRealtimeTTSModel relies on the CosyVoice server which automatically segments text into sentences before synthesizing. Audio is only returned after a complete sentence boundary is detected, so push() may return empty responses for partial sentences. Calling synthesize() forces synthesis of all remaining text including incomplete sentences.| Method | Description |
|---|---|
connect() | Open WebSocket connection |
push(text) | Append text incrementally (non-blocking), returns audio accumulated so far |
synthesize() | Finalize and return remaining audio |
close() | Tear down connection |
Integrate with Agent
In the agent layer, TTS is integrated viaTTSMiddleware — it intercepts the agent’s text output and synthesizes speech automatically:
| TTS Mode | Middleware Behavior |
|---|---|
| Non-realtime | Waits for full text, then synthesizes all at once |
| Realtime | Pushes text deltas as they arrive, streams audio back concurrently |
TTS Model Card
TTSModelCard describes a TTS model’s capabilities — available voices, streaming support, and parameter ranges — and is used to drive the frontend model picker. Each card is defined by a YAML file alongside the model implementation:
Qwen3 TTS
CosyVoice Realtime
voices list is automatically injected into the parameter_schema as an enum constraint on the voice field, so the frontend renders a dropdown selector.
| Field | Type | Description |
|---|---|---|
name | str | Model identifier (e.g. "qwen3-tts-flash") |
label | str | Display name (e.g. "Qwen3-TTS-Flash") |
status | str | "active", "deprecated", or "sunset" |
realtime | bool | Whether model supports streaming input |
input_types | list[str] | Accepted input MIME types (always ["text/plain"]) |
output_types | list[str] | Output MIME types (typically ["audio/wav"]) |
parameter_schema | dict | Merged JSON Schema for the parameter form — base schema from Parameters class, enriched with voices enum from YAML |
parameters_overrides | dict | Per-model overrides (same syntax as chat model cards) |
Custom TTS Provider
To add a new TTS provider, implement aTTSModelBase subclass and register it on the credential:
Embedding
An Embedding Model converts text — and, for multimodal models, images, videos, and other media — into dense vectors that power semantic search, RAG, and memory retrieval. AgentScope currently ships the following embedding model classes:| Provider | Model Class | Highlights |
|---|---|---|
| DashScope | DashScopeEmbeddingModel | Unified text + multimodal API (text-embedding-v4, qwen3-vl-embedding, …), content-aware batching |
| OpenAI | OpenAIEmbeddingModel | text-embedding-3-small/large, compatible with OpenAI-compatible endpoints |
| Gemini | GeminiEmbeddingModel | Text (gemini-embedding-001) and multimodal (gemini-embedding-2, image / video / audio / PDF) |
| Ollama | OllamaEmbeddingModel | Local embedding models (nomic-embed-text, …), credential carries the host URL |
Create Embedding Model
Every embedding model takes a credential, a model name, and an optionalParameters object — the same pattern as chat models. Parameters carries dimensions, the output vector size:
| Argument | Type | Description |
|---|---|---|
credential | CredentialBase | Provider-specific credential |
model | str | Model identifier (e.g. "text-embedding-v4") |
parameters | Parameters | None | dimensions — the output vector size (default 512) |
embedding_cache | EmbeddingCacheBase | None | Optional cache that skips repeated API calls (see below) |
context_size | int | Maximum input tokens per item |
max_retries | int | Maximum retries per batch on retryable failures |
retry_delay | float | Seconds between retry attempts |
Valid
dimensions values differ per model — each model card pins the supported enum and default via parameter_overrides (e.g. text-embedding-v4 accepts 2048 / 1536 / 1024 / … / 64). See EmbeddingModelCard.Call Embedding Model
Invoke the model by calling it with a list of inputs. Text-only models acceptlist[str]; multimodal models also accept DataBlock elements:
- Inputs are split into chunks of the model’s batch size (10 for DashScope text, 2048 for OpenAI, 100 for Gemini, 512 for Ollama).
- All chunks are dispatched concurrently via
asyncio.gather. - Each chunk is retried independently up to
max_retriestimes on provider-specific retryable errors. - Results are merged into a single
EmbeddingResponse, preserving input order.
EmbeddingResponse carries:
| Field | Type | Description |
|---|---|---|
embeddings | list[Embedding] | One vector per input, in input order |
usage | EmbeddingUsage | None | tokens consumed and time elapsed in seconds |
source | "api" | "cache" | Whether the result came from the API or the cache |
id / created_at / type | str | Response identity and timestamp; type is always "embedding" |
Multimodal Embedding
Multimodal models (DashScopeEmbeddingModel with qwen3-vl-embedding etc., GeminiEmbeddingModel with gemini-embedding-2) accept DataBlock inputs alongside strings — images as URL or base64, videos as URL:
Multimodal models replace the plain batch-size split with content-aware batching: inputs are greedily packed into batches that respect the model’s per-request limits on total elements, images, and videos (e.g.
qwen3-vl-embedding allows 20 elements / 5 images / 1 video per request, tongyi-embedding-vision-plus allows 20 / 64 / 8). You never need to split inputs yourself.Embedding Cache
Pass anEmbeddingCacheBase implementation through the embedding_cache argument to reuse previously computed vectors. The built-in FileEmbeddingCache stores each result as a .npy file keyed by the SHA-256 hash of the request:
max_file_number or max_cache_size is exceeded, the oldest files are evicted first. To use a different backend (Redis, SQLite, …), subclass EmbeddingCacheBase and implement its four methods: store, retrieve, remove, and clear.
Custom Embedding Provider
Adding an embedding provider follows the same steps as a chat provider.Step 1: Link the Credential
Overrideget_embedding_model_class() on your credential (the base implementation returns None, meaning “no embedding support”):
Step 2: Implement the Embedding Model
SubclassEmbeddingModelBase and implement _call_api for a single batch — batching, concurrency, and retry are inherited from the base class. Declare provider-specific transient errors via _get_retryable_exceptions:
EmbeddingModelBase[str] for text-only, EmbeddingModelBase[str | DataBlock] for multimodal — IDEs then surface the correct inputs type to callers.
Step 3: Add Model Cards (optional)
Drop YAML files into a_models/ directory next to your implementation; MyProviderEmbeddingModel.list_models() then picks them up — exactly like chat model cards.
EmbeddingModelCard
EmbeddingModelCard mirrors ModelCard for the frontend, with embedding-specific defaults — the output type application/x-embedding marks a model as producing dense vectors:
| Field | Difference from ModelCard |
|---|---|
type | Always "embedding_model" |
input_types | Defaults to ["text/plain"]; multimodal cards add image/*, video/*, … |
output_types | Defaults to ["application/x-embedding"] |
parameter_schema | Built from the embedding Parameters class (dimensions) merged with YAML parameter_overrides — same override semantics as chat cards |
output_size | Not present — embedding models have no output token limit |
get_embedding_model_class():