Model - AgentScope

Overview

The model layer is organized as a two-tier hierarchy: a Credential at the top, and the model families a provider exposes beneath it — Chat Model, TTS, Embedding, and Realtime Model.

Credential

ChatModelBase

OpenAIChatModel

OpenAIResponseModel

AnthropicChatModel

DashScopeChatModel

DeepSeekChatModel

GeminiChatModel

MoonshotChatModel

XAIChatModel

OllamaChatModel

TTSModelBase

DashScopeTTSModel

DashScopeRealtimeTTSModel

DashScopeCosyVoiceRealtimeTTSModel

EmbeddingModelBase

DashScopeEmbeddingModel

OpenAIEmbeddingModel

GeminiEmbeddingModel

OllamaEmbeddingModel

RealtimeModelBase (coming soon)

A Credential carries the API authentication fields a provider requires (api_key, base_url, …). From a credential, you can retrieve the list of available models for each model family that provider supports. This layering mirrors the natural frontend flow — register a credential first, then pick a model from under it — letting the UI authenticate once and surface every model family the provider supports.

Chat Model

A chat model is the LLM that drives an agent’s conversation and tool calls, accepting and producing multimodal content beyond plain text. AgentScope currently ships the following chat model classes:

Provider	Model Class
OpenAI	`OpenAIChatModel`
OpenAI (Responses API)	`OpenAIResponseModel`
Anthropic	`AnthropicChatModel`
DashScope	`DashScopeChatModel`
DeepSeek	`DeepSeekChatModel`
Gemini	`GeminiChatModel`
Moonshot	`MoonshotChatModel`
xAI	`XAIChatModel`
Ollama	`OllamaChatModel`

Create Chat Model

Every chat model takes a credential, a model name, and an optional provider-specific Parameters object. The three tabs below show typical setups for streaming, tool calling, and reasoning:

import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential

model = DashScopeChatModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="qwen-plus",
    stream=True,
)

Common constructor arguments shared by every chat model:

Argument	Type	Description
`credential`	`CredentialBase`	Provider-specific credential
`model`	`str`	Model identifier (e.g. `"qwen-plus"`)
`parameters`	`Parameters \| None`	Provider-specific parameters such as `temperature`, `thinking_enable`, `parallel_tool_calls`
`stream`	`bool`	Whether to stream output
`max_retries`	`int`	Maximum API retries on failure
`context_size`	`int`	Context window used for context compression
`formatter`	`FormatterBase \| None`	Override message formatter

Call Chat Model

Invoke the model by calling it with a list of Msg objects, plus optional tools and tool_choice:

async def __call__(
    self,
    messages: list[Msg],
    tools: list[dict] | None = None,
    tool_choice: ToolChoice | None = None,
    **kwargs: Any,
) -> ChatResponse | AsyncGenerator[ChatResponse, None]:

The return type follows the model’s stream setting:

stream=False — awaits a single ChatResponse carrying the full output.
stream=True — awaits an AsyncGenerator[ChatResponse, None]. Intermediate chunks (is_last=False) carry only the delta generated in that step. So that callers don’t have to accumulate deltas themselves, AgentScope appends one final chunk with is_last=True that carries the full accumulated content.

import asyncio
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=True,
    )
    msgs = [UserMsg(name="user", content="Count from 1 to 5.")]

    async for chunk in await model(msgs):
        if chunk.is_last:
            print("Final:", chunk.content)   # full accumulated content
        else:
            print("Delta:", chunk.content)   # delta only

asyncio.run(main())

A representative streaming trace, illustrating the delta-then-accumulated pattern:

Delta: [TextBlock(text='1')]
Delta: [TextBlock(text=', 2,')]
Delta: [TextBlock(text=' 3, ')]
Delta: [TextBlock(text='4, 5')]
Final: [TextBlock(text='1, 2, 3, 4, 5')]

Each ChatResponse carries content blocks (TextBlock, ThinkingBlock, ToolCallBlock, DataBlock), an is_last flag, and a ChatUsage recording token counts and elapsed time.

Generate Structured Output

When you need a JSON object that conforms to a Pydantic model or JSON schema, call generate_structured_output instead of __call__. It returns a StructuredResponse whose content is a validated dict matching the schema:

import asyncio
import os
from pydantic import BaseModel
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

class WeatherInfo(BaseModel):
    city: str
    temperature: float
    unit: str

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=False,
    )
    response = await model.generate_structured_output(
        messages=[UserMsg(name="user", content="What's the weather in Shanghai?")],
        structured_model=WeatherInfo,
    )
    print(response.content)  # validated dict matching WeatherInfo

asyncio.run(main())

generate_structured_output synthesizes a forced tool call from the schema, then validates and repairs the model’s response.

Formatter

A formatter translates AgentScope’s Msg objects into the list[dict] payload that each provider’s API expects. It is configured via the optional formatter argument on the chat model constructor. Every provider ships two built-in variants:

Variant	Use Case
ChatFormatter (default)	Standard single-agent dialog. Each `Msg` maps 1:1 to an API message, preserving native roles (`user`, `assistant`, `system`).
MultiAgentFormatter	Multi-agent scenarios such as debate or moderation. Consecutive agent messages are grouped and wrapped in `<history>` tags with the sender’s name, while tool call / result sequences keep their native API format.

Switch to multi-agent mode by passing the MultiAgent variant — no agent code changes are required:

import os
from agentscope.model import OpenAIChatModel
from agentscope.credential import OpenAICredential
from agentscope.formatter import OpenAIMultiAgentFormatter

model = OpenAIChatModel(
    credential=OpenAICredential(api_key=os.environ["OPENAI_API_KEY"]),
    model="gpt-4.1",
    formatter=OpenAIMultiAgentFormatter(),
)

For non-standard payload shapes (e.g. a provider whose API doesn’t follow the OpenAI or Anthropic conventions), subclass FormatterBase and pass an instance through the same formatter argument.

Custom Provider

You can extend AgentScope with your own model provider by implementing a credential and a chat model, then registering the credential.

Step 1: Define the Credential

Subclass CredentialBase with a unique type discriminator and implement get_chat_model_class():

from typing import Literal, Type, TYPE_CHECKING
from pydantic import ConfigDict, Field, SecretStr
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.model import ChatModelBase

class MyProviderCredential(CredentialBase):
    model_config = ConfigDict(title="My Provider API")
    type: Literal["my_provider_credential"] = "my_provider_credential"

    api_key: SecretStr = Field(description="API key for My Provider.")
    base_url: str = Field(default="https://api.myprovider.com/v1")

    @classmethod
    def get_chat_model_class(cls) -> Type["ChatModelBase"]:
        from .my_model import MyProviderChatModel
        return MyProviderChatModel

Step 2: Implement the Chat Model

Subclass ChatModelBase, define a Parameters inner class, and implement _call_api:

from typing import Literal, Any, AsyncGenerator
from pydantic import BaseModel, Field
from agentscope.model import ChatModelBase, ChatResponse
from agentscope.message import Msg
from agentscope.tool import ToolChoice
from agentscope.formatter import FormatterBase, OpenAIChatFormatter

class MyProviderChatModel(ChatModelBase):
    class Parameters(BaseModel):
        max_tokens: int | None = Field(default=None, gt=0)
        temperature: float | None = Field(default=None, ge=0, le=2)

    type: Literal["my_provider_chat"] = "my_provider_chat"

    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: Parameters | None = None,
        stream: bool = True,
        max_retries: int = 3,
        context_size: int = 128000,
        formatter: FormatterBase | None = None,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters or self.Parameters(),
            stream=stream,
            max_retries=max_retries,
            context_size=context_size,
        )
        # If your API follows the OpenAI format, reuse OpenAIChatFormatter;
        # otherwise implement your own FormatterBase subclass.
        self.formatter = formatter or OpenAIChatFormatter()

    async def _call_api(
        self,
        model_name: str,
        messages: list[Msg],
        tools: list[dict] | None = None,
        tool_choice: ToolChoice | None = None,
        **kwargs: Any,
    ) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
        formatted_messages = await self.formatter.format(messages)
        # Call your provider's API using self.credential.api_key, etc.
        ...

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your model implementation. Each file describes one model — its capabilities (input_types, output_types), limits (context_size, output_size), and any per-model parameter_overrides:

name: my-model-v1
label: My Model V1
status: active
input_types:
  - text/plain
output_types:
  - text/plain
context_size: 128000
output_size: 4096
parameter_overrides:
  max_tokens: {"maximum": 4096}

MyProviderChatModel.list_models() then loads every YAML in that directory. To pull cards from a different location — for example, a registry your application maintains separately — pass custom_yaml_dir:

cards = MyProviderChatModel.list_models(custom_yaml_dir="/path/to/cards")

Integrate with Frontend

What is ModelCard

ModelCard is a declarative description of a model’s capabilities and constraints, designed to drive the frontend — model selectors, parameter forms, and feature toggles can be rendered dynamically without hardcoding any provider-specific knowledge. Each ModelCard contains:

Field	Type	Description
`name`	`str`	Model identifier (e.g. `"claude-sonnet-4-6"`)
`label`	`str`	Human-readable display name (e.g. `"Claude Sonnet 4.6"`)
`status`	`"active" \| "deprecated" \| "sunset"`	Model lifecycle status
`input_types`	`list[str]`	Accepted input MIME types — used by the frontend to filter attachment uploads (e.g. only show an image button when `image/*` is supported)
`output_types`	`list[str]`	Output MIME types the model can produce — advertises capabilities such as a thinking toggle when `application/x-thinking` is present
`context_size`	`int`	Maximum context window in tokens
`output_size`	`int`	Maximum output tokens
`parameter_schema`	`dict`	Final JSON Schema for the parameter form — base schema merged with per-model overrides (see below)
`parameters_overrides`	`dict[str, dict]`	The raw per-model overrides, before merging

input_types and output_types use MIME types to describe modality. Common values:

MIME Type	Meaning
`text/plain`	Text
`application/x-thinking`	Reasoning / chain-of-thought
`image/*` (e.g. `image/png`, `image/jpeg`)	Image
`audio/*` (e.g. `audio/wav`, `audio/mp3`)	Audio
`video/*` (e.g. `video/mp4`)	Video

A typical YAML card for claude-sonnet-4-6:

name: claude-sonnet-4-6
label: Claude Sonnet 4.6
status: active

input_types:
  - text/plain
  - image/jpeg
  - image/png
  - image/gif
  - image/webp

output_types:
  - text/plain
  - application/x-thinking

context_size: 1000000
output_size: 65536

parameter_overrides:
  max_tokens: {"maximum": 65536}

Parameter schema and overrides

The parameter_schema exposed to the frontend is built in two layers:

Base schema — auto-derived from the chat model’s Parameters class via model_json_schema(). This lists every adjustable parameter (temperature, max_tokens, thinking_enable, …) along with its type and the API-wide range.
Per-model overrides — the YAML’s parameter_overrides block is merged on top, field by field.

Overrides matter because adjustable ranges are not uniform across an API: every Qwen model accepts max_tokens, but each one has a different ceiling. Overrides let a card tighten a range, pin a default, or hide a parameter that doesn’t apply.

Override syntax	Effect
`param: { ... }`	Shallow-merge into the base field (e.g. `max_tokens: {maximum: 16384}`)
`param: { hidden: true }`	Hide the parameter from the frontend
`param: null`	Remove the parameter entirely

Retrieve ModelCards

You retrieve model cards by calling list_models() on either the credential class or the model class. Internally, CredentialBase.list_models() delegates to its linked ChatModelBase subclass (obtained via get_chat_model_class()), which loads YAML card definitions from its _models/ directory.

from agentscope.credential import DashScopeCredential
from agentscope.model import AnthropicChatModel

# Via credential class
cards = DashScopeCredential.list_models()

# Or directly on the model class
cards = AnthropicChatModel.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")

The credential’s get_chat_model_class() returns the corresponding ChatModelBase subclass, which in turn knows where to find its model card YAML files:

model_cls = DashScopeCredential.get_chat_model_class()  # -> DashScopeChatModel
cards = model_cls.list_models()                          # -> list[ModelCard]

This design allows the frontend to discover available models, their capabilities, and valid parameter ranges — all from a single credential, without any hardcoded provider logic.

TTS

A TTS Model converts text into synthesized speech audio, supporting both standard and realtime (streaming-input) synthesis modes. AgentScope currently ships the following TTS model classes:

Provider	Model Class	Highlights
DashScope	`DashScopeTTSModel`	Qwen3-TTS, multiple voices, streaming output
DashScope (Realtime)	`DashScopeRealtimeTTSModel`	Qwen3-TTS WebSocket streaming input, ideal for LLM output piping
DashScope (CosyVoice Realtime)	`DashScopeCosyVoiceRealtimeTTSModel`	CosyVoice-v3 streaming input, supports cosyvoice-v3-plus/flash/sambert

Create TTS Model

Every TTS model takes a credential, a model name, and an optional provider-specific Parameters object. The two tabs below show the standard and realtime setups:

import os
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

tts = DashScopeTTSModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="qwen3-tts-flash",
    parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
    stream=True,
)

Common constructor arguments shared by every TTS model:

Argument	Type	Description
`credential`	`CredentialBase`	Provider-specific credential
`model`	`str`	Model identifier (e.g. `"qwen3-tts-flash"`)
`parameters`	`Parameters \| None`	Provider-specific parameters such as `voice`
`stream`	`bool`	Whether to stream audio output

Additional arguments for DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel:

Argument	Type	Default	Description
`cold_start_length`	`int \| None`	`None`	Minimum character count before first text chunk is sent to the API
`cold_start_words`	`int \| None`	`None`	Minimum word count before first text chunk is sent
`max_retries`	`int`	`3`	Maximum retry attempts on WebSocket failure
`retry_delay`	`float`	`5.0`	Initial retry delay in seconds (exponential backoff)

Call TTS Model

Invoke the model by calling synthesize() with the text to speak:

async def synthesize(
    self,
    text: str | None = None,
    **kwargs: Any,
) -> TTSResponse | AsyncGenerator[TTSResponse, None]:

The return type follows the model’s stream setting:

stream=False — returns a single TTSResponse with the complete audio.
stream=True — returns an AsyncGenerator[TTSResponse, None]. Each chunk carries an incremental audio delta; the final chunk has is_last=True.

Each TTSResponse carries:

Field	Type	Description
`content`	`DataBlock \| None`	Audio data. Format indicated by `content.source.media_type` (e.g. `"audio/wav"`, `"audio/pcm;rate=24000"`)
`is_last`	`bool`	`True` on the final streaming chunk
`usage`	`TTSUsage \| None`	Token counts (`input_tokens`, `output_tokens`) and elapsed `time` in seconds
`id`	`str`	Auto-generated unique identifier
`metadata`	`dict \| None`	Optional provider-specific metadata

import asyncio
import os
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash",
        parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    # Streaming synthesis
    async for chunk in await tts.synthesize("Hello, world!"):
        if chunk.content:
            # chunk.content is a DataBlock with base64-encoded audio/wav
            print(f"Audio chunk: {len(chunk.content.source.data)} bytes")

asyncio.run(main())

Realtime TTS (Streaming Input)

For realtime models (DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel), text can be pushed incrementally as it arrives from a streaming LLM. Both share the same push() / synthesize() interface. The lifecycle is managed via async with or manual connect() / close():

DashScopeRealtimeTTSModel (Qwen3) produces audio at token-level granularity — each push() call typically returns audio data. In contrast, DashScopeCosyVoiceRealtimeTTSModel relies on the CosyVoice server which automatically segments text into sentences before synthesizing. Audio is only returned after a complete sentence boundary is detected, so push() may return empty responses for partial sentences. Calling synthesize() forces synthesis of all remaining text including incomplete sentences.

import asyncio
import os
from agentscope.tts import DashScopeRealtimeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeRealtimeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash-realtime",
        parameters=DashScopeRealtimeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    async with tts:
        # Push text incrementally as it arrives from a streaming LLM.
        # Each push() returns a TTSResponse with audio accumulated so far
        # (or content=None if not yet available).
        resp1 = await tts.push("Hello, ")
        if resp1.content:
            print("Audio available after first push")

        resp2 = await tts.push("how are you today?")
        if resp2.content:
            print("Audio available after second push")

        # Finalize: flush remaining buffered text and collect final audio.
        # text= is optional — pass extra text to append before finalizing,
        # or omit to finalize previously pushed text only.
        response = await tts.synthesize()

asyncio.run(main())

Method	Description
`connect()`	Open WebSocket connection
`push(text)`	Append text incrementally (non-blocking), returns audio accumulated so far
`synthesize()`	Finalize and return remaining audio
`close()`	Tear down connection

Integrate with Agent

In the agent layer, TTS is integrated via TTSMiddleware — it intercepts the agent’s text output and synthesizes speech automatically:

from agentscope.agent import Agent
from agentscope.middleware import TTSMiddleware
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

agent = Agent(
    name="assistant",
    model=chat_model,
    middlewares=[
        TTSMiddleware(
            DashScopeTTSModel(
                credential=DashScopeCredential(api_key="..."),
                model="qwen3-tts-flash",
                parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
                stream=True,
            ),
        ),
    ],
)

# The agent's reply stream now includes audio events
async for event in agent.reply_stream(user_msg):
    # TextBlockDeltaEvent — text content
    # DataBlockDeltaEvent — audio content (WAV)
    ...

The middleware automatically selects the optimal synthesis strategy:

TTS Mode	Middleware Behavior
Non-realtime	Waits for full text, then synthesizes all at once
Realtime	Pushes text deltas as they arrive, streams audio back concurrently

TTS Model Card

TTSModelCard describes a TTS model’s capabilities — available voices, streaming support, and parameter ranges — and is used to drive the frontend model picker. Each card is defined by a YAML file alongside the model implementation:

Qwen3 TTS

name: qwen3-tts-flash
label: Qwen3-TTS-Flash
status: active
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - Cherry
  - Serena
  - Ethan
  - Chelsie
parameter_overrides: {}

CosyVoice Realtime

name: cosyvoice-v3-plus
label: CosyVoice-v3-Plus
status: active
realtime: true
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - longanyang
  - longxiaochun
  - longshuo
parameter_overrides: {}

The voices list is automatically injected into the parameter_schema as an enum constraint on the voice field, so the frontend renders a dropdown selector.

Field	Type	Description
`name`	`str`	Model identifier (e.g. `"qwen3-tts-flash"`)
`label`	`str`	Display name (e.g. `"Qwen3-TTS-Flash"`)
`status`	`str`	`"active"`, `"deprecated"`, or `"sunset"`
`realtime`	`bool`	Whether model supports streaming input
`input_types`	`list[str]`	Accepted input MIME types (always `["text/plain"]`)
`output_types`	`list[str]`	Output MIME types (typically `["audio/wav"]`)
`parameter_schema`	`dict`	Merged JSON Schema for the parameter form — base schema from `Parameters` class, enriched with `voices` enum from YAML
`parameters_overrides`	`dict`	Per-model overrides (same syntax as chat model cards)

Retrieve TTS model cards via the credential:

from agentscope.credential import DashScopeCredential

cards = DashScopeCredential.list_tts_models()
for card in cards:
    print(f"{card.name} (realtime={card.realtime}): {card.label}")

Or directly on the model class:

from agentscope.tts import DashScopeTTSModel, DashScopeCosyVoiceRealtimeTTSModel

# Qwen3 TTS models
cards = DashScopeTTSModel.list_models()

# CosyVoice Realtime models
cosyvoice_cards = DashScopeCosyVoiceRealtimeTTSModel.list_models()

Custom TTS Provider

To add a new TTS provider, implement a TTSModelBase subclass and register it on the credential:

from typing import Literal, Type, TYPE_CHECKING, AsyncGenerator, Any
from pydantic import BaseModel, Field
from agentscope.tts import TTSModelBase, TTSResponse
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.tts import TTSModelBase as TTSBase

class MyTTSModel(TTSModelBase):
    class Parameters(BaseModel):
        voice: str = Field(default="default", title="Voice")

    type: Literal["my_tts"] = "my_tts"

    async def synthesize(
        self, text: str | None = None, **kwargs: Any
    ) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
        # Call your provider's API here
        ...

# Register on your credential
class MyCredential(CredentialBase):
    @classmethod
    def get_tts_model_classes(cls) -> list[Type["TTSBase"]]:
        return [MyTTSModel]

Embedding

An Embedding Model converts text — and, for multimodal models, images, videos, and other media — into dense vectors that power semantic search, RAG, and memory retrieval. AgentScope currently ships the following embedding model classes:

Provider	Model Class	Highlights
DashScope	`DashScopeEmbeddingModel`	Unified text + multimodal API (`text-embedding-v4`, `qwen3-vl-embedding`, …), content-aware batching
OpenAI	`OpenAIEmbeddingModel`	`text-embedding-3-small/large`, compatible with OpenAI-compatible endpoints
Gemini	`GeminiEmbeddingModel`	Text (`gemini-embedding-001`) and multimodal (`gemini-embedding-2`, image / video / audio / PDF)
Ollama	`OllamaEmbeddingModel`	Local embedding models (`nomic-embed-text`, …), credential carries the host URL

Create Embedding Model

Every embedding model takes a credential, a model name, and an optional Parameters object — the same pattern as chat models. Parameters carries dimensions, the output vector size:

import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

model = DashScopeEmbeddingModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="text-embedding-v4",
    parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
)

Common constructor arguments shared by every embedding model:

Argument	Type	Description
`credential`	`CredentialBase`	Provider-specific credential
`model`	`str`	Model identifier (e.g. `"text-embedding-v4"`)
`parameters`	`Parameters \| None`	`dimensions` — the output vector size (default `512`)
`embedding_cache`	`EmbeddingCacheBase \| None`	Optional cache that skips repeated API calls (see below)
`context_size`	`int`	Maximum input tokens per item
`max_retries`	`int`	Maximum retries per batch on retryable failures
`retry_delay`	`float`	Seconds between retry attempts

Valid dimensions values differ per model — each model card pins the supported enum and default via parameter_overrides (e.g. text-embedding-v4 accepts 2048 / 1536 / 1024 / … / 64). See EmbeddingModelCard.

Call Embedding Model

Invoke the model by calling it with a list of inputs. Text-only models accept list[str]; multimodal models also accept DataBlock elements:

async def __call__(
    self,
    inputs: list[str | DataBlock],
    **kwargs: Any,
) -> EmbeddingResponse:

Batching and retry are handled for you:

Inputs are split into chunks of the model’s batch size (10 for DashScope text, 2048 for OpenAI, 100 for Gemini, 512 for Ollama).
All chunks are dispatched concurrently via asyncio.gather.
Each chunk is retried independently up to max_retries times on provider-specific retryable errors.
Results are merged into a single EmbeddingResponse, preserving input order.

import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
    )
    response = await model(
        ["What is AgentScope?", "A multi-agent framework."],
    )
    print(len(response.embeddings))     # 2 — one vector per input
    print(len(response.embeddings[0]))  # 1024
    print(response.usage.tokens)        # total tokens consumed
    print(response.source)              # "api" or "cache"

asyncio.run(main())

Each EmbeddingResponse carries:

Field	Type	Description
`embeddings`	`list[Embedding]`	One vector per input, in input order
`usage`	`EmbeddingUsage \| None`	`tokens` consumed and `time` elapsed in seconds
`source`	`"api" \| "cache"`	Whether the result came from the API or the cache
`id` / `created_at` / `type`	`str`	Response identity and timestamp; `type` is always `"embedding"`

Multimodal Embedding

Multimodal models (DashScopeEmbeddingModel with qwen3-vl-embedding etc., GeminiEmbeddingModel with gemini-embedding-2) accept DataBlock inputs alongside strings — images as URL or base64, videos as URL:

import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential
from agentscope.message import DataBlock, URLSource

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-vl-embedding",
    )
    response = await model([
        "A cat sitting on a windowsill",
        DataBlock(
            source=URLSource(
                url="https://example.com/cat.png",
                media_type="image/png",
            ),
        ),
    ])
    print(len(response.embeddings))  # 2 — one vector per input

asyncio.run(main())

Multimodal models replace the plain batch-size split with content-aware batching: inputs are greedily packed into batches that respect the model’s per-request limits on total elements, images, and videos (e.g. qwen3-vl-embedding allows 20 elements / 5 images / 1 video per request, tongyi-embedding-vision-plus allows 20 / 64 / 8). You never need to split inputs yourself.

Embedding Cache

Pass an EmbeddingCacheBase implementation through the embedding_cache argument to reuse previously computed vectors. The built-in FileEmbeddingCache stores each result as a .npy file keyed by the SHA-256 hash of the request:

import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel, FileEmbeddingCache
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        embedding_cache=FileEmbeddingCache(
            cache_dir="./.cache/embeddings",
            max_file_number=1000,
            max_cache_size=100,  # MB
        ),
    )
    r1 = await model(["What is AgentScope?"])
    print(r1.source)  # "api" — first call hits the API

    r2 = await model(["What is AgentScope?"])
    print(r2.source)  # "cache" — identical request served locally

asyncio.run(main())

When max_file_number or max_cache_size is exceeded, the oldest files are evicted first. To use a different backend (Redis, SQLite, …), subclass EmbeddingCacheBase and implement its four methods: store, retrieve, remove, and clear.

Custom Embedding Provider

Adding an embedding provider follows the same steps as a chat provider.

Step 1: Link the Credential

Override get_embedding_model_class() on your credential (the base implementation returns None, meaning “no embedding support”):

from typing import Type, TYPE_CHECKING
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.embedding import EmbeddingModelBase

class MyProviderCredential(CredentialBase):
    # ... fields and get_chat_model_class() as before ...

    @classmethod
    def get_embedding_model_class(cls) -> Type["EmbeddingModelBase"]:
        from .my_embedding import MyProviderEmbeddingModel
        return MyProviderEmbeddingModel

Step 2: Implement the Embedding Model

Subclass EmbeddingModelBase and implement _call_api for a single batch — batching, concurrency, and retry are inherited from the base class. Declare provider-specific transient errors via _get_retryable_exceptions:

from typing import Any, Type
from agentscope.embedding import EmbeddingModelBase, EmbeddingResponse, EmbeddingUsage

class MyProviderEmbeddingModel(EmbeddingModelBase[str]):
    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: "MyProviderEmbeddingModel.Parameters | None" = None,
        context_size: int = 8192,
        max_retries: int = 3,
        retry_delay: float = 1.0,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters,
            context_size=context_size,
            batch_size=100,          # max items per API call
            max_retries=max_retries,
            retry_delay=retry_delay,
        )

    @classmethod
    def _get_retryable_exceptions(cls) -> tuple[Type[Exception], ...]:
        return (TimeoutError,)       # retried up to max_retries times

    async def _call_api(
        self,
        inputs: list[str],
        **kwargs: Any,
    ) -> EmbeddingResponse:
        # len(inputs) <= self.batch_size is guaranteed.
        # Call your provider's API and return the vectors.
        ...

Bind the generic parameter to the input type your provider supports: EmbeddingModelBase[str] for text-only, EmbeddingModelBase[str | DataBlock] for multimodal — IDEs then surface the correct inputs type to callers.

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your implementation; MyProviderEmbeddingModel.list_models() then picks them up — exactly like chat model cards.

EmbeddingModelCard

EmbeddingModelCard mirrors ModelCard for the frontend, with embedding-specific defaults — the output type application/x-embedding marks a model as producing dense vectors:

Field	Difference from `ModelCard`
`type`	Always `"embedding_model"`
`input_types`	Defaults to `["text/plain"]`; multimodal cards add `image/`, `video/`, …
`output_types`	Defaults to `["application/x-embedding"]`
`parameter_schema`	Built from the embedding `Parameters` class (`dimensions`) merged with YAML `parameter_overrides` — same override semantics as chat cards
`output_size`	Not present — embedding models have no output token limit

A typical YAML card:

name: text-embedding-v4
label: Text Embedding v4
status: active

input_types:
  - text/plain

output_types:
  - application/x-embedding

context_size: 8192

parameter_overrides:
  dimensions:
    default: 1024
    enum: [2048, 1536, 1024, 768, 512, 256, 128, 64]

Retrieve cards from the model class directly, or discover the class from a credential via get_embedding_model_class():

from agentscope.credential import DashScopeCredential
from agentscope.embedding import OpenAIEmbeddingModel

# Directly on the model class
cards = OpenAIEmbeddingModel.list_models()

# Or discover the class from a credential
embed_cls = DashScopeCredential.get_embedding_model_class()
cards = embed_cls.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")

Realtime Model

Coming soon — we are migrating Realtime Model support from v1.0 to v2.0.

​Overview

​Chat Model

​Create Chat Model

​Call Chat Model

​Generate Structured Output

​Formatter

​Custom Provider

​Step 1: Define the Credential

​Step 2: Implement the Chat Model

​Step 3: Add Model Cards (optional)

​Integrate with Frontend

​What is ModelCard

​Parameter schema and overrides

​Retrieve ModelCards

​TTS

​Create TTS Model

​Call TTS Model

​Realtime TTS (Streaming Input)

​Integrate with Agent

​TTS Model Card

​Custom TTS Provider

​Embedding

​Create Embedding Model

​Call Embedding Model

​Multimodal Embedding

​Embedding Cache

​Custom Embedding Provider

​Step 1: Link the Credential

​Step 2: Implement the Embedding Model

​Step 3: Add Model Cards (optional)

​EmbeddingModelCard

​Realtime Model

Overview

Chat Model

Create Chat Model

Call Chat Model

Generate Structured Output

Formatter

Custom Provider

Step 1: Define the Credential

Step 2: Implement the Chat Model

Step 3: Add Model Cards (optional)

Integrate with Frontend

What is ModelCard

Parameter schema and overrides

Retrieve ModelCards

TTS

Create TTS Model

Call TTS Model

Realtime TTS (Streaming Input)

Integrate with Agent

TTS Model Card

Custom TTS Provider

Embedding

Create Embedding Model

Call Embedding Model

Multimodal Embedding

Embedding Cache

Custom Embedding Provider

Step 1: Link the Credential

Step 2: Implement the Embedding Model

Step 3: Add Model Cards (optional)

EmbeddingModelCard

Realtime Model