Skip to main content

Overview

The model layer is organized as a two-tier hierarchy: a Credential at the top, and the model families a provider exposes beneath it — Chat Model, TTS, Embedding, and Realtime Model.
Credential
ChatModelBase
OpenAIChatModel
OpenAIResponseModel
AnthropicChatModel
DashScopeChatModel
DeepSeekChatModel
GeminiChatModel
MoonshotChatModel
XAIChatModel
OllamaChatModel
TTSModelBase
DashScopeTTSModel
DashScopeRealtimeTTSModel
DashScopeCosyVoiceRealtimeTTSModel
EmbeddingModelBase
DashScopeEmbeddingModel
OpenAIEmbeddingModel
GeminiEmbeddingModel
OllamaEmbeddingModel
RealtimeModelBase (coming soon)
A Credential carries the API authentication fields a provider requires (api_key, base_url, …). From a credential, you can retrieve the list of available models for each model family that provider supports. This layering mirrors the natural frontend flow — register a credential first, then pick a model from under it — letting the UI authenticate once and surface every model family the provider supports.

Chat Model

A chat model is the LLM that drives an agent’s conversation and tool calls, accepting and producing multimodal content beyond plain text. AgentScope currently ships the following chat model classes:
ProviderModel Class
OpenAIOpenAIChatModel
OpenAI (Responses API)OpenAIResponseModel
AnthropicAnthropicChatModel
DashScopeDashScopeChatModel
DeepSeekDeepSeekChatModel
GeminiGeminiChatModel
MoonshotMoonshotChatModel
xAIXAIChatModel
OllamaOllamaChatModel

Create Chat Model

Every chat model takes a credential, a model name, and an optional provider-specific Parameters object. The three tabs below show typical setups for streaming, tool calling, and reasoning:
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential

model = DashScopeChatModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="qwen-plus",
    stream=True,
)
Common constructor arguments shared by every chat model:
ArgumentTypeDescription
credentialCredentialBaseProvider-specific credential
modelstrModel identifier (e.g. "qwen-plus")
parametersParameters | NoneProvider-specific parameters such as temperature, thinking_enable, parallel_tool_calls
streamboolWhether to stream output
max_retriesintMaximum API retries on failure
context_sizeintContext window used for context compression
formatterFormatterBase | NoneOverride message formatter

Call Chat Model

Invoke the model by calling it with a list of Msg objects, plus optional tools and tool_choice:
async def __call__(
    self,
    messages: list[Msg],
    tools: list[dict] | None = None,
    tool_choice: ToolChoice | None = None,
    **kwargs: Any,
) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
The return type follows the model’s stream setting:
  • stream=False — awaits a single ChatResponse carrying the full output.
  • stream=True — awaits an AsyncGenerator[ChatResponse, None]. Intermediate chunks (is_last=False) carry only the delta generated in that step. So that callers don’t have to accumulate deltas themselves, AgentScope appends one final chunk with is_last=True that carries the full accumulated content.
import asyncio
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=True,
    )
    msgs = [UserMsg(name="user", content="Count from 1 to 5.")]

    async for chunk in await model(msgs):
        if chunk.is_last:
            print("Final:", chunk.content)   # full accumulated content
        else:
            print("Delta:", chunk.content)   # delta only

asyncio.run(main())
A representative streaming trace, illustrating the delta-then-accumulated pattern:
Delta: [TextBlock(text='1')]
Delta: [TextBlock(text=', 2,')]
Delta: [TextBlock(text=' 3, ')]
Delta: [TextBlock(text='4, 5')]
Final: [TextBlock(text='1, 2, 3, 4, 5')]
Each ChatResponse carries content blocks (TextBlock, ThinkingBlock, ToolCallBlock, DataBlock), an is_last flag, and a ChatUsage recording token counts and elapsed time.

Generate Structured Output

When you need a JSON object that conforms to a Pydantic model or JSON schema, call generate_structured_output instead of __call__. It returns a StructuredResponse whose content is a validated dict matching the schema:
import asyncio
import os
from pydantic import BaseModel
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

class WeatherInfo(BaseModel):
    city: str
    temperature: float
    unit: str

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=False,
    )
    response = await model.generate_structured_output(
        messages=[UserMsg(name="user", content="What's the weather in Shanghai?")],
        structured_model=WeatherInfo,
    )
    print(response.content)  # validated dict matching WeatherInfo

asyncio.run(main())
generate_structured_output synthesizes a forced tool call from the schema, then validates and repairs the model’s response.

Formatter

A formatter translates AgentScope’s Msg objects into the list[dict] payload that each provider’s API expects. It is configured via the optional formatter argument on the chat model constructor. Every provider ships two built-in variants:
VariantUse Case
ChatFormatter (default)Standard single-agent dialog. Each Msg maps 1:1 to an API message, preserving native roles (user, assistant, system).
MultiAgentFormatterMulti-agent scenarios such as debate or moderation. Consecutive agent messages are grouped and wrapped in <history> tags with the sender’s name, while tool call / result sequences keep their native API format.
Switch to multi-agent mode by passing the MultiAgent variant — no agent code changes are required:
import os
from agentscope.model import OpenAIChatModel
from agentscope.credential import OpenAICredential
from agentscope.formatter import OpenAIMultiAgentFormatter

model = OpenAIChatModel(
    credential=OpenAICredential(api_key=os.environ["OPENAI_API_KEY"]),
    model="gpt-4.1",
    formatter=OpenAIMultiAgentFormatter(),
)
For non-standard payload shapes (e.g. a provider whose API doesn’t follow the OpenAI or Anthropic conventions), subclass FormatterBase and pass an instance through the same formatter argument.

Custom Provider

You can extend AgentScope with your own model provider by implementing a credential and a chat model, then registering the credential.

Step 1: Define the Credential

Subclass CredentialBase with a unique type discriminator and implement get_chat_model_class():
from typing import Literal, Type, TYPE_CHECKING
from pydantic import ConfigDict, Field, SecretStr
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.model import ChatModelBase

class MyProviderCredential(CredentialBase):
    model_config = ConfigDict(title="My Provider API")
    type: Literal["my_provider_credential"] = "my_provider_credential"

    api_key: SecretStr = Field(description="API key for My Provider.")
    base_url: str = Field(default="https://api.myprovider.com/v1")

    @classmethod
    def get_chat_model_class(cls) -> Type["ChatModelBase"]:
        from .my_model import MyProviderChatModel
        return MyProviderChatModel

Step 2: Implement the Chat Model

Subclass ChatModelBase, define a Parameters inner class, and implement _call_api:
from typing import Literal, Any, AsyncGenerator
from pydantic import BaseModel, Field
from agentscope.model import ChatModelBase, ChatResponse
from agentscope.message import Msg
from agentscope.tool import ToolChoice
from agentscope.formatter import FormatterBase, OpenAIChatFormatter

class MyProviderChatModel(ChatModelBase):
    class Parameters(BaseModel):
        max_tokens: int | None = Field(default=None, gt=0)
        temperature: float | None = Field(default=None, ge=0, le=2)

    type: Literal["my_provider_chat"] = "my_provider_chat"

    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: Parameters | None = None,
        stream: bool = True,
        max_retries: int = 3,
        context_size: int = 128000,
        formatter: FormatterBase | None = None,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters or self.Parameters(),
            stream=stream,
            max_retries=max_retries,
            context_size=context_size,
        )
        # If your API follows the OpenAI format, reuse OpenAIChatFormatter;
        # otherwise implement your own FormatterBase subclass.
        self.formatter = formatter or OpenAIChatFormatter()

    async def _call_api(
        self,
        model_name: str,
        messages: list[Msg],
        tools: list[dict] | None = None,
        tool_choice: ToolChoice | None = None,
        **kwargs: Any,
    ) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
        formatted_messages = await self.formatter.format(messages)
        # Call your provider's API using self.credential.api_key, etc.
        ...

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your model implementation. Each file describes one model — its capabilities (input_types, output_types), limits (context_size, output_size), and any per-model parameter_overrides:
name: my-model-v1
label: My Model V1
status: active
input_types:
  - text/plain
output_types:
  - text/plain
context_size: 128000
output_size: 4096
parameter_overrides:
  max_tokens: {"maximum": 4096}
MyProviderChatModel.list_models() then loads every YAML in that directory. To pull cards from a different location — for example, a registry your application maintains separately — pass custom_yaml_dir:
cards = MyProviderChatModel.list_models(custom_yaml_dir="/path/to/cards")

Integrate with Frontend

What is ModelCard

ModelCard is a declarative description of a model’s capabilities and constraints, designed to drive the frontend — model selectors, parameter forms, and feature toggles can be rendered dynamically without hardcoding any provider-specific knowledge. Each ModelCard contains:
FieldTypeDescription
namestrModel identifier (e.g. "claude-sonnet-4-6")
labelstrHuman-readable display name (e.g. "Claude Sonnet 4.6")
status"active" | "deprecated" | "sunset"Model lifecycle status
input_typeslist[str]Accepted input MIME types — used by the frontend to filter attachment uploads (e.g. only show an image button when image/* is supported)
output_typeslist[str]Output MIME types the model can produce — advertises capabilities such as a thinking toggle when application/x-thinking is present
context_sizeintMaximum context window in tokens
output_sizeintMaximum output tokens
parameter_schemadictFinal JSON Schema for the parameter form — base schema merged with per-model overrides (see below)
parameters_overridesdict[str, dict]The raw per-model overrides, before merging
input_types and output_types use MIME types to describe modality. Common values:
MIME TypeMeaning
text/plainText
application/x-thinkingReasoning / chain-of-thought
image/* (e.g. image/png, image/jpeg)Image
audio/* (e.g. audio/wav, audio/mp3)Audio
video/* (e.g. video/mp4)Video
A typical YAML card for claude-sonnet-4-6:
name: claude-sonnet-4-6
label: Claude Sonnet 4.6
status: active

input_types:
  - text/plain
  - image/jpeg
  - image/png
  - image/gif
  - image/webp

output_types:
  - text/plain
  - application/x-thinking

context_size: 1000000
output_size: 65536

parameter_overrides:
  max_tokens: {"maximum": 65536}

Parameter schema and overrides

The parameter_schema exposed to the frontend is built in two layers:
  1. Base schema — auto-derived from the chat model’s Parameters class via model_json_schema(). This lists every adjustable parameter (temperature, max_tokens, thinking_enable, …) along with its type and the API-wide range.
  2. Per-model overrides — the YAML’s parameter_overrides block is merged on top, field by field.
Overrides matter because adjustable ranges are not uniform across an API: every Qwen model accepts max_tokens, but each one has a different ceiling. Overrides let a card tighten a range, pin a default, or hide a parameter that doesn’t apply.
Override syntaxEffect
param: { ... }Shallow-merge into the base field (e.g. max_tokens: {maximum: 16384})
param: { hidden: true }Hide the parameter from the frontend
param: nullRemove the parameter entirely

Retrieve ModelCards

You retrieve model cards by calling list_models() on either the credential class or the model class. Internally, CredentialBase.list_models() delegates to its linked ChatModelBase subclass (obtained via get_chat_model_class()), which loads YAML card definitions from its _models/ directory.
from agentscope.credential import DashScopeCredential
from agentscope.model import AnthropicChatModel

# Via credential class
cards = DashScopeCredential.list_models()

# Or directly on the model class
cards = AnthropicChatModel.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")
The credential’s get_chat_model_class() returns the corresponding ChatModelBase subclass, which in turn knows where to find its model card YAML files:
model_cls = DashScopeCredential.get_chat_model_class()  # -> DashScopeChatModel
cards = model_cls.list_models()                          # -> list[ModelCard]
This design allows the frontend to discover available models, their capabilities, and valid parameter ranges — all from a single credential, without any hardcoded provider logic.

TTS

A TTS Model converts text into synthesized speech audio, supporting both standard and realtime (streaming-input) synthesis modes. AgentScope currently ships the following TTS model classes:
ProviderModel ClassHighlights
DashScopeDashScopeTTSModelQwen3-TTS, multiple voices, streaming output
DashScope (Realtime)DashScopeRealtimeTTSModelQwen3-TTS WebSocket streaming input, ideal for LLM output piping
DashScope (CosyVoice Realtime)DashScopeCosyVoiceRealtimeTTSModelCosyVoice-v3 streaming input, supports cosyvoice-v3-plus/flash/sambert

Create TTS Model

Every TTS model takes a credential, a model name, and an optional provider-specific Parameters object. The two tabs below show the standard and realtime setups:
import os
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

tts = DashScopeTTSModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="qwen3-tts-flash",
    parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
    stream=True,
)
Common constructor arguments shared by every TTS model:
ArgumentTypeDescription
credentialCredentialBaseProvider-specific credential
modelstrModel identifier (e.g. "qwen3-tts-flash")
parametersParameters | NoneProvider-specific parameters such as voice
streamboolWhether to stream audio output
Additional arguments for DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel:
ArgumentTypeDefaultDescription
cold_start_lengthint | NoneNoneMinimum character count before first text chunk is sent to the API
cold_start_wordsint | NoneNoneMinimum word count before first text chunk is sent
max_retriesint3Maximum retry attempts on WebSocket failure
retry_delayfloat5.0Initial retry delay in seconds (exponential backoff)

Call TTS Model

Invoke the model by calling synthesize() with the text to speak:
async def synthesize(
    self,
    text: str | None = None,
    **kwargs: Any,
) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
The return type follows the model’s stream setting:
  • stream=False — returns a single TTSResponse with the complete audio.
  • stream=True — returns an AsyncGenerator[TTSResponse, None]. Each chunk carries an incremental audio delta; the final chunk has is_last=True.
Each TTSResponse carries:
FieldTypeDescription
contentDataBlock | NoneAudio data. Format indicated by content.source.media_type (e.g. "audio/wav", "audio/pcm;rate=24000")
is_lastboolTrue on the final streaming chunk
usageTTSUsage | NoneToken counts (input_tokens, output_tokens) and elapsed time in seconds
idstrAuto-generated unique identifier
metadatadict | NoneOptional provider-specific metadata
import asyncio
import os
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash",
        parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    # Streaming synthesis
    async for chunk in await tts.synthesize("Hello, world!"):
        if chunk.content:
            # chunk.content is a DataBlock with base64-encoded audio/wav
            print(f"Audio chunk: {len(chunk.content.source.data)} bytes")

asyncio.run(main())

Realtime TTS (Streaming Input)

For realtime models (DashScopeRealtimeTTSModel and DashScopeCosyVoiceRealtimeTTSModel), text can be pushed incrementally as it arrives from a streaming LLM. Both share the same push() / synthesize() interface. The lifecycle is managed via async with or manual connect() / close():
DashScopeRealtimeTTSModel (Qwen3) produces audio at token-level granularity — each push() call typically returns audio data. In contrast, DashScopeCosyVoiceRealtimeTTSModel relies on the CosyVoice server which automatically segments text into sentences before synthesizing. Audio is only returned after a complete sentence boundary is detected, so push() may return empty responses for partial sentences. Calling synthesize() forces synthesis of all remaining text including incomplete sentences.
import asyncio
import os
from agentscope.tts import DashScopeRealtimeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeRealtimeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash-realtime",
        parameters=DashScopeRealtimeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    async with tts:
        # Push text incrementally as it arrives from a streaming LLM.
        # Each push() returns a TTSResponse with audio accumulated so far
        # (or content=None if not yet available).
        resp1 = await tts.push("Hello, ")
        if resp1.content:
            print("Audio available after first push")

        resp2 = await tts.push("how are you today?")
        if resp2.content:
            print("Audio available after second push")

        # Finalize: flush remaining buffered text and collect final audio.
        # text= is optional — pass extra text to append before finalizing,
        # or omit to finalize previously pushed text only.
        response = await tts.synthesize()

asyncio.run(main())
MethodDescription
connect()Open WebSocket connection
push(text)Append text incrementally (non-blocking), returns audio accumulated so far
synthesize()Finalize and return remaining audio
close()Tear down connection

Integrate with Agent

In the agent layer, TTS is integrated via TTSMiddleware — it intercepts the agent’s text output and synthesizes speech automatically:
from agentscope.agent import Agent
from agentscope.middleware import TTSMiddleware
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

agent = Agent(
    name="assistant",
    model=chat_model,
    middlewares=[
        TTSMiddleware(
            DashScopeTTSModel(
                credential=DashScopeCredential(api_key="..."),
                model="qwen3-tts-flash",
                parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
                stream=True,
            ),
        ),
    ],
)

# The agent's reply stream now includes audio events
async for event in agent.reply_stream(user_msg):
    # TextBlockDeltaEvent — text content
    # DataBlockDeltaEvent — audio content (WAV)
    ...
The middleware automatically selects the optimal synthesis strategy:
TTS ModeMiddleware Behavior
Non-realtimeWaits for full text, then synthesizes all at once
RealtimePushes text deltas as they arrive, streams audio back concurrently

TTS Model Card

TTSModelCard describes a TTS model’s capabilities — available voices, streaming support, and parameter ranges — and is used to drive the frontend model picker. Each card is defined by a YAML file alongside the model implementation:
Qwen3 TTS
name: qwen3-tts-flash
label: Qwen3-TTS-Flash
status: active
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - Cherry
  - Serena
  - Ethan
  - Chelsie
parameter_overrides: {}
CosyVoice Realtime
name: cosyvoice-v3-plus
label: CosyVoice-v3-Plus
status: active
realtime: true
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - longanyang
  - longxiaochun
  - longshuo
parameter_overrides: {}
The voices list is automatically injected into the parameter_schema as an enum constraint on the voice field, so the frontend renders a dropdown selector.
FieldTypeDescription
namestrModel identifier (e.g. "qwen3-tts-flash")
labelstrDisplay name (e.g. "Qwen3-TTS-Flash")
statusstr"active", "deprecated", or "sunset"
realtimeboolWhether model supports streaming input
input_typeslist[str]Accepted input MIME types (always ["text/plain"])
output_typeslist[str]Output MIME types (typically ["audio/wav"])
parameter_schemadictMerged JSON Schema for the parameter form — base schema from Parameters class, enriched with voices enum from YAML
parameters_overridesdictPer-model overrides (same syntax as chat model cards)
Retrieve TTS model cards via the credential:
from agentscope.credential import DashScopeCredential

cards = DashScopeCredential.list_tts_models()
for card in cards:
    print(f"{card.name} (realtime={card.realtime}): {card.label}")
Or directly on the model class:
from agentscope.tts import DashScopeTTSModel, DashScopeCosyVoiceRealtimeTTSModel

# Qwen3 TTS models
cards = DashScopeTTSModel.list_models()

# CosyVoice Realtime models
cosyvoice_cards = DashScopeCosyVoiceRealtimeTTSModel.list_models()

Custom TTS Provider

To add a new TTS provider, implement a TTSModelBase subclass and register it on the credential:
from typing import Literal, Type, TYPE_CHECKING, AsyncGenerator, Any
from pydantic import BaseModel, Field
from agentscope.tts import TTSModelBase, TTSResponse
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.tts import TTSModelBase as TTSBase

class MyTTSModel(TTSModelBase):
    class Parameters(BaseModel):
        voice: str = Field(default="default", title="Voice")

    type: Literal["my_tts"] = "my_tts"

    async def synthesize(
        self, text: str | None = None, **kwargs: Any
    ) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
        # Call your provider's API here
        ...

# Register on your credential
class MyCredential(CredentialBase):
    @classmethod
    def get_tts_model_classes(cls) -> list[Type["TTSBase"]]:
        return [MyTTSModel]

Embedding

An Embedding Model converts text — and, for multimodal models, images, videos, and other media — into dense vectors that power semantic search, RAG, and memory retrieval. AgentScope currently ships the following embedding model classes:
ProviderModel ClassHighlights
DashScopeDashScopeEmbeddingModelUnified text + multimodal API (text-embedding-v4, qwen3-vl-embedding, …), content-aware batching
OpenAIOpenAIEmbeddingModeltext-embedding-3-small/large, compatible with OpenAI-compatible endpoints
GeminiGeminiEmbeddingModelText (gemini-embedding-001) and multimodal (gemini-embedding-2, image / video / audio / PDF)
OllamaOllamaEmbeddingModelLocal embedding models (nomic-embed-text, …), credential carries the host URL

Create Embedding Model

Every embedding model takes a credential, a model name, and an optional Parameters object — the same pattern as chat models. Parameters carries dimensions, the output vector size:
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

model = DashScopeEmbeddingModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="text-embedding-v4",
    parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
)
Common constructor arguments shared by every embedding model:
ArgumentTypeDescription
credentialCredentialBaseProvider-specific credential
modelstrModel identifier (e.g. "text-embedding-v4")
parametersParameters | Nonedimensions — the output vector size (default 512)
embedding_cacheEmbeddingCacheBase | NoneOptional cache that skips repeated API calls (see below)
context_sizeintMaximum input tokens per item
max_retriesintMaximum retries per batch on retryable failures
retry_delayfloatSeconds between retry attempts
Valid dimensions values differ per model — each model card pins the supported enum and default via parameter_overrides (e.g. text-embedding-v4 accepts 2048 / 1536 / 1024 / … / 64). See EmbeddingModelCard.

Call Embedding Model

Invoke the model by calling it with a list of inputs. Text-only models accept list[str]; multimodal models also accept DataBlock elements:
async def __call__(
    self,
    inputs: list[str | DataBlock],
    **kwargs: Any,
) -> EmbeddingResponse:
Batching and retry are handled for you:
  1. Inputs are split into chunks of the model’s batch size (10 for DashScope text, 2048 for OpenAI, 100 for Gemini, 512 for Ollama).
  2. All chunks are dispatched concurrently via asyncio.gather.
  3. Each chunk is retried independently up to max_retries times on provider-specific retryable errors.
  4. Results are merged into a single EmbeddingResponse, preserving input order.
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
    )
    response = await model(
        ["What is AgentScope?", "A multi-agent framework."],
    )
    print(len(response.embeddings))     # 2 — one vector per input
    print(len(response.embeddings[0]))  # 1024
    print(response.usage.tokens)        # total tokens consumed
    print(response.source)              # "api" or "cache"

asyncio.run(main())
Each EmbeddingResponse carries:
FieldTypeDescription
embeddingslist[Embedding]One vector per input, in input order
usageEmbeddingUsage | Nonetokens consumed and time elapsed in seconds
source"api" | "cache"Whether the result came from the API or the cache
id / created_at / typestrResponse identity and timestamp; type is always "embedding"

Multimodal Embedding

Multimodal models (DashScopeEmbeddingModel with qwen3-vl-embedding etc., GeminiEmbeddingModel with gemini-embedding-2) accept DataBlock inputs alongside strings — images as URL or base64, videos as URL:
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential
from agentscope.message import DataBlock, URLSource

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-vl-embedding",
    )
    response = await model([
        "A cat sitting on a windowsill",
        DataBlock(
            source=URLSource(
                url="https://example.com/cat.png",
                media_type="image/png",
            ),
        ),
    ])
    print(len(response.embeddings))  # 2 — one vector per input

asyncio.run(main())
Multimodal models replace the plain batch-size split with content-aware batching: inputs are greedily packed into batches that respect the model’s per-request limits on total elements, images, and videos (e.g. qwen3-vl-embedding allows 20 elements / 5 images / 1 video per request, tongyi-embedding-vision-plus allows 20 / 64 / 8). You never need to split inputs yourself.

Embedding Cache

Pass an EmbeddingCacheBase implementation through the embedding_cache argument to reuse previously computed vectors. The built-in FileEmbeddingCache stores each result as a .npy file keyed by the SHA-256 hash of the request:
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel, FileEmbeddingCache
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        embedding_cache=FileEmbeddingCache(
            cache_dir="./.cache/embeddings",
            max_file_number=1000,
            max_cache_size=100,  # MB
        ),
    )
    r1 = await model(["What is AgentScope?"])
    print(r1.source)  # "api" — first call hits the API

    r2 = await model(["What is AgentScope?"])
    print(r2.source)  # "cache" — identical request served locally

asyncio.run(main())
When max_file_number or max_cache_size is exceeded, the oldest files are evicted first. To use a different backend (Redis, SQLite, …), subclass EmbeddingCacheBase and implement its four methods: store, retrieve, remove, and clear.

Custom Embedding Provider

Adding an embedding provider follows the same steps as a chat provider. Override get_embedding_model_class() on your credential (the base implementation returns None, meaning “no embedding support”):
from typing import Type, TYPE_CHECKING
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.embedding import EmbeddingModelBase

class MyProviderCredential(CredentialBase):
    # ... fields and get_chat_model_class() as before ...

    @classmethod
    def get_embedding_model_class(cls) -> Type["EmbeddingModelBase"]:
        from .my_embedding import MyProviderEmbeddingModel
        return MyProviderEmbeddingModel

Step 2: Implement the Embedding Model

Subclass EmbeddingModelBase and implement _call_api for a single batch — batching, concurrency, and retry are inherited from the base class. Declare provider-specific transient errors via _get_retryable_exceptions:
from typing import Any, Type
from agentscope.embedding import EmbeddingModelBase, EmbeddingResponse, EmbeddingUsage

class MyProviderEmbeddingModel(EmbeddingModelBase[str]):
    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: "MyProviderEmbeddingModel.Parameters | None" = None,
        context_size: int = 8192,
        max_retries: int = 3,
        retry_delay: float = 1.0,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters,
            context_size=context_size,
            batch_size=100,          # max items per API call
            max_retries=max_retries,
            retry_delay=retry_delay,
        )

    @classmethod
    def _get_retryable_exceptions(cls) -> tuple[Type[Exception], ...]:
        return (TimeoutError,)       # retried up to max_retries times

    async def _call_api(
        self,
        inputs: list[str],
        **kwargs: Any,
    ) -> EmbeddingResponse:
        # len(inputs) <= self.batch_size is guaranteed.
        # Call your provider's API and return the vectors.
        ...
Bind the generic parameter to the input type your provider supports: EmbeddingModelBase[str] for text-only, EmbeddingModelBase[str | DataBlock] for multimodal — IDEs then surface the correct inputs type to callers.

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your implementation; MyProviderEmbeddingModel.list_models() then picks them up — exactly like chat model cards.

EmbeddingModelCard

EmbeddingModelCard mirrors ModelCard for the frontend, with embedding-specific defaults — the output type application/x-embedding marks a model as producing dense vectors:
FieldDifference from ModelCard
typeAlways "embedding_model"
input_typesDefaults to ["text/plain"]; multimodal cards add image/*, video/*, …
output_typesDefaults to ["application/x-embedding"]
parameter_schemaBuilt from the embedding Parameters class (dimensions) merged with YAML parameter_overrides — same override semantics as chat cards
output_sizeNot present — embedding models have no output token limit
A typical YAML card:
name: text-embedding-v4
label: Text Embedding v4
status: active

input_types:
  - text/plain

output_types:
  - application/x-embedding

context_size: 8192

parameter_overrides:
  dimensions:
    default: 1024
    enum: [2048, 1536, 1024, 768, 512, 256, 128, 64]
Retrieve cards from the model class directly, or discover the class from a credential via get_embedding_model_class():
from agentscope.credential import DashScopeCredential
from agentscope.embedding import OpenAIEmbeddingModel

# Directly on the model class
cards = OpenAIEmbeddingModel.list_models()

# Or discover the class from a credential
embed_cls = DashScopeCredential.get_embedding_model_class()
cards = embed_cls.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")

Realtime Model

Coming soon — we are migrating Realtime Model support from v1.0 to v2.0.