> ## Documentation Index
> Fetch the complete documentation index at: https://docs.agentscope.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Model

> Configure and connect LLM model providers in AgentScope

## Overview

The model layer is organized as a two-tier hierarchy: a **Credential** at the top, and the model families a provider exposes beneath it — **Chat Model**, **TTS**, **Embedding**, and **Realtime Model**.

<Tree>
  <Tree.Folder name="Credential" defaultOpen>
    <Tree.Folder name="ChatModelBase" defaultOpen>
      <Tree.File name="OpenAIChatModel" />

      <Tree.File name="OpenAIResponseModel" />

      <Tree.File name="AnthropicChatModel" />

      <Tree.File name="DashScopeChatModel" />

      <Tree.File name="DeepSeekChatModel" />

      <Tree.File name="GeminiChatModel" />

      <Tree.File name="MoonshotChatModel" />

      <Tree.File name="XAIChatModel" />

      <Tree.File name="OllamaChatModel" />
    </Tree.Folder>

    <Tree.Folder name="TTSModelBase" defaultOpen>
      <Tree.File name="DashScopeTTSModel" />

      <Tree.File name="DashScopeRealtimeTTSModel" />

      <Tree.File name="DashScopeCosyVoiceRealtimeTTSModel" />
    </Tree.Folder>

    <Tree.Folder name="EmbeddingModelBase" defaultOpen>
      <Tree.File name="DashScopeEmbeddingModel" />

      <Tree.File name="OpenAIEmbeddingModel" />

      <Tree.File name="GeminiEmbeddingModel" />

      <Tree.File name="OllamaEmbeddingModel" />
    </Tree.Folder>

    <Tree.Folder name="RealtimeModelBase (coming soon)" />
  </Tree.Folder>
</Tree>

A **Credential** carries the API authentication fields a provider requires (`api_key`, `base_url`, ...). From a credential, you can retrieve the list of available models for each model family that provider supports.

This layering mirrors the natural frontend flow — register a credential first, then pick a model from under it — letting the UI authenticate once and surface every model family the provider supports.

## Chat Model

A chat model is the LLM that drives an agent's conversation and tool calls, accepting and producing multimodal content beyond plain text. AgentScope currently ships the following chat model classes:

| Provider               | Model Class           |
| ---------------------- | --------------------- |
| OpenAI                 | `OpenAIChatModel`     |
| OpenAI (Responses API) | `OpenAIResponseModel` |
| Anthropic              | `AnthropicChatModel`  |
| DashScope              | `DashScopeChatModel`  |
| DeepSeek               | `DeepSeekChatModel`   |
| Gemini                 | `GeminiChatModel`     |
| Moonshot               | `MoonshotChatModel`   |
| xAI                    | `XAIChatModel`        |
| Ollama                 | `OllamaChatModel`     |

### Create Chat Model

Every chat model takes a credential, a model name, and an optional provider-specific `Parameters` object. The three tabs below show typical setups for streaming, tool calling, and reasoning:

<CodeGroup>
  ```python Streaming theme={null}
  import os
  from agentscope.model import DashScopeChatModel
  from agentscope.credential import DashScopeCredential

  model = DashScopeChatModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="qwen-plus",
      stream=True,
  )
  ```

  ```python Tools theme={null}
  import os
  from agentscope.model import DashScopeChatModel
  from agentscope.credential import DashScopeCredential

  model = DashScopeChatModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="qwen-plus",
      stream=False,
      parameters=DashScopeChatModel.Parameters(
          parallel_tool_calls=False,
      ),
  )
  ```

  ```python Reasoning theme={null}
  import os
  from agentscope.model import DashScopeChatModel
  from agentscope.credential import DashScopeCredential

  model = DashScopeChatModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="qwen3-235b-a22b-thinking-2507",
      parameters=DashScopeChatModel.Parameters(
          thinking_enable=True,
          thinking_budget=2048,
      ),
  )
  ```
</CodeGroup>

Common constructor arguments shared by every chat model:

| Argument       | Type                    | Description                                                                                  |
| -------------- | ----------------------- | -------------------------------------------------------------------------------------------- |
| `credential`   | `CredentialBase`        | Provider-specific credential                                                                 |
| `model`        | `str`                   | Model identifier (e.g. `"qwen-plus"`)                                                        |
| `parameters`   | `Parameters \| None`    | Provider-specific parameters such as `temperature`, `thinking_enable`, `parallel_tool_calls` |
| `stream`       | `bool`                  | Whether to stream output                                                                     |
| `max_retries`  | `int`                   | Maximum API retries on failure                                                               |
| `context_size` | `int`                   | Context window used for context compression                                                  |
| `formatter`    | `FormatterBase \| None` | Override message formatter                                                                   |

### Call Chat Model

Invoke the model by calling it with a list of `Msg` objects, plus optional `tools` and `tool_choice`:

```python theme={null}
async def __call__(
    self,
    messages: list[Msg],
    tools: list[dict] | None = None,
    tool_choice: ToolChoice | None = None,
    **kwargs: Any,
) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
```

The return type follows the model's `stream` setting:

* **`stream=False`** — awaits a single `ChatResponse` carrying the full output.
* **`stream=True`** — awaits an `AsyncGenerator[ChatResponse, None]`. Intermediate chunks (`is_last=False`) carry only the **delta** generated in that step. So that callers don't have to accumulate deltas themselves, AgentScope appends one final chunk with `is_last=True` that carries the **full accumulated content**.

```python theme={null}
import asyncio
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=True,
    )
    msgs = [UserMsg(name="user", content="Count from 1 to 5.")]

    async for chunk in await model(msgs):
        if chunk.is_last:
            print("Final:", chunk.content)   # full accumulated content
        else:
            print("Delta:", chunk.content)   # delta only

asyncio.run(main())
```

A representative streaming trace, illustrating the delta-then-accumulated pattern:

```
Delta: [TextBlock(text='1')]
Delta: [TextBlock(text=', 2,')]
Delta: [TextBlock(text=' 3, ')]
Delta: [TextBlock(text='4, 5')]
Final: [TextBlock(text='1, 2, 3, 4, 5')]
```

Each `ChatResponse` carries content blocks (`TextBlock`, `ThinkingBlock`, `ToolCallBlock`, `DataBlock`), an `is_last` flag, and a `ChatUsage` recording token counts and elapsed time.

### Generate Structured Output

When you need a JSON object that conforms to a Pydantic model or JSON schema, call `generate_structured_output` instead of `__call__`. It returns a `StructuredResponse` whose `content` is a validated dict matching the schema:

```python theme={null}
import asyncio
import os
from pydantic import BaseModel
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

class WeatherInfo(BaseModel):
    city: str
    temperature: float
    unit: str

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=False,
    )
    response = await model.generate_structured_output(
        messages=[UserMsg(name="user", content="What's the weather in Shanghai?")],
        structured_model=WeatherInfo,
    )
    print(response.content)  # validated dict matching WeatherInfo

asyncio.run(main())
```

<Info>
  `generate_structured_output` synthesizes a forced tool call from the schema, then validates and repairs the model's response.
</Info>

### Formatter

A formatter translates AgentScope's `Msg` objects into the `list[dict]` payload that each provider's API expects. It is configured via the optional `formatter` argument on the chat model constructor. Every provider ships two built-in variants:

| Variant                     | Use Case                                                                                                                                                                                                            |
| --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **ChatFormatter** (default) | Standard single-agent dialog. Each `Msg` maps 1:1 to an API message, preserving native roles (`user`, `assistant`, `system`).                                                                                       |
| **MultiAgentFormatter**     | Multi-agent scenarios such as debate or moderation. Consecutive agent messages are grouped and wrapped in `<history>` tags with the sender's name, while tool call / result sequences keep their native API format. |

Switch to multi-agent mode by passing the MultiAgent variant — no agent code changes are required:

```python theme={null}
import os
from agentscope.model import OpenAIChatModel
from agentscope.credential import OpenAICredential
from agentscope.formatter import OpenAIMultiAgentFormatter

model = OpenAIChatModel(
    credential=OpenAICredential(api_key=os.environ["OPENAI_API_KEY"]),
    model="gpt-4.1",
    formatter=OpenAIMultiAgentFormatter(),
)
```

For non-standard payload shapes (e.g. a provider whose API doesn't follow the OpenAI or Anthropic conventions), subclass `FormatterBase` and pass an instance through the same `formatter` argument.

### Custom Provider

You can extend AgentScope with your own model provider by implementing a credential and a chat model, then registering the credential.

#### Step 1: Define the Credential

Subclass `CredentialBase` with a unique `type` discriminator and implement `get_chat_model_class()`:

```python theme={null}
from typing import Literal, Type, TYPE_CHECKING
from pydantic import ConfigDict, Field, SecretStr
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.model import ChatModelBase

class MyProviderCredential(CredentialBase):
    model_config = ConfigDict(title="My Provider API")
    type: Literal["my_provider_credential"] = "my_provider_credential"

    api_key: SecretStr = Field(description="API key for My Provider.")
    base_url: str = Field(default="https://api.myprovider.com/v1")

    @classmethod
    def get_chat_model_class(cls) -> Type["ChatModelBase"]:
        from .my_model import MyProviderChatModel
        return MyProviderChatModel
```

#### Step 2: Implement the Chat Model

Subclass `ChatModelBase`, define a `Parameters` inner class, and implement `_call_api`:

```python theme={null}
from typing import Literal, Any, AsyncGenerator
from pydantic import BaseModel, Field
from agentscope.model import ChatModelBase, ChatResponse
from agentscope.message import Msg
from agentscope.tool import ToolChoice
from agentscope.formatter import FormatterBase, OpenAIChatFormatter

class MyProviderChatModel(ChatModelBase):
    class Parameters(BaseModel):
        max_tokens: int | None = Field(default=None, gt=0)
        temperature: float | None = Field(default=None, ge=0, le=2)

    type: Literal["my_provider_chat"] = "my_provider_chat"

    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: Parameters | None = None,
        stream: bool = True,
        max_retries: int = 3,
        context_size: int = 128000,
        formatter: FormatterBase | None = None,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters or self.Parameters(),
            stream=stream,
            max_retries=max_retries,
            context_size=context_size,
        )
        # If your API follows the OpenAI format, reuse OpenAIChatFormatter;
        # otherwise implement your own FormatterBase subclass.
        self.formatter = formatter or OpenAIChatFormatter()

    async def _call_api(
        self,
        model_name: str,
        messages: list[Msg],
        tools: list[dict] | None = None,
        tool_choice: ToolChoice | None = None,
        **kwargs: Any,
    ) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
        formatted_messages = await self.formatter.format(messages)
        # Call your provider's API using self.credential.api_key, etc.
        ...
```

#### Step 3: Add Model Cards (optional)

Drop YAML files into a `_models/` directory next to your model implementation. Each file describes one model — its capabilities (`input_types`, `output_types`), limits (`context_size`, `output_size`), and any per-model `parameter_overrides`:

```yaml theme={null}
name: my-model-v1
label: My Model V1
status: active
input_types:
  - text/plain
output_types:
  - text/plain
context_size: 128000
output_size: 4096
parameter_overrides:
  max_tokens: {"maximum": 4096}
```

`MyProviderChatModel.list_models()` then loads every YAML in that directory. To pull cards from a different location — for example, a registry your application maintains separately — pass `custom_yaml_dir`:

```python theme={null}
cards = MyProviderChatModel.list_models(custom_yaml_dir="/path/to/cards")
```

## Integrate with Frontend

### What is ModelCard

`ModelCard` is a declarative description of a model's capabilities and constraints, designed to drive the frontend — model selectors, parameter forms, and feature toggles can be rendered dynamically without hardcoding any provider-specific knowledge.

Each `ModelCard` contains:

| Field                  | Type                                   | Description                                                                                                                                |
| ---------------------- | -------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
| `name`                 | `str`                                  | Model identifier (e.g. `"claude-sonnet-4-6"`)                                                                                              |
| `label`                | `str`                                  | Human-readable display name (e.g. `"Claude Sonnet 4.6"`)                                                                                   |
| `status`               | `"active" \| "deprecated" \| "sunset"` | Model lifecycle status                                                                                                                     |
| `input_types`          | `list[str]`                            | Accepted input MIME types — used by the frontend to filter attachment uploads (e.g. only show an image button when `image/*` is supported) |
| `output_types`         | `list[str]`                            | Output MIME types the model can produce — advertises capabilities such as a thinking toggle when `application/x-thinking` is present       |
| `context_size`         | `int`                                  | Maximum context window in tokens                                                                                                           |
| `output_size`          | `int`                                  | Maximum output tokens                                                                                                                      |
| `parameter_schema`     | `dict`                                 | Final JSON Schema for the parameter form — base schema merged with per-model overrides (see below)                                         |
| `parameters_overrides` | `dict[str, dict]`                      | The raw per-model overrides, before merging                                                                                                |

`input_types` and `output_types` use MIME types to describe modality. Common values:

| MIME Type                                  | Meaning                      |
| ------------------------------------------ | ---------------------------- |
| `text/plain`                               | Text                         |
| `application/x-thinking`                   | Reasoning / chain-of-thought |
| `image/*` (e.g. `image/png`, `image/jpeg`) | Image                        |
| `audio/*` (e.g. `audio/wav`, `audio/mp3`)  | Audio                        |
| `video/*` (e.g. `video/mp4`)               | Video                        |

A typical YAML card for `claude-sonnet-4-6`:

```yaml theme={null}
name: claude-sonnet-4-6
label: Claude Sonnet 4.6
status: active

input_types:
  - text/plain
  - image/jpeg
  - image/png
  - image/gif
  - image/webp

output_types:
  - text/plain
  - application/x-thinking

context_size: 1000000
output_size: 65536

parameter_overrides:
  max_tokens: {"maximum": 65536}
```

#### Parameter schema and overrides

The `parameter_schema` exposed to the frontend is built in two layers:

1. **Base schema** — auto-derived from the chat model's `Parameters` class via `model_json_schema()`. This lists every adjustable parameter (`temperature`, `max_tokens`, `thinking_enable`, ...) along with its type and the API-wide range.
2. **Per-model overrides** — the YAML's `parameter_overrides` block is merged on top, field by field.

Overrides matter because adjustable ranges are not uniform across an API: every Qwen model accepts `max_tokens`, but each one has a different ceiling. Overrides let a card tighten a range, pin a default, or hide a parameter that doesn't apply.

| Override syntax           | Effect                                                                  |
| ------------------------- | ----------------------------------------------------------------------- |
| `param: { ... }`          | Shallow-merge into the base field (e.g. `max_tokens: {maximum: 16384}`) |
| `param: { hidden: true }` | Hide the parameter from the frontend                                    |
| `param: null`             | Remove the parameter entirely                                           |

### Retrieve ModelCards

You retrieve model cards by calling `list_models()` on either the credential class or the model class. Internally, `CredentialBase.list_models()` delegates to its linked `ChatModelBase` subclass (obtained via `get_chat_model_class()`), which loads YAML card definitions from its `_models/` directory.

```python theme={null}
from agentscope.credential import DashScopeCredential
from agentscope.model import AnthropicChatModel

# Via credential class
cards = DashScopeCredential.list_models()

# Or directly on the model class
cards = AnthropicChatModel.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")
```

The credential's `get_chat_model_class()` returns the corresponding `ChatModelBase` subclass, which in turn knows where to find its model card YAML files:

```python theme={null}
model_cls = DashScopeCredential.get_chat_model_class()  # -> DashScopeChatModel
cards = model_cls.list_models()                          # -> list[ModelCard]
```

This design allows the frontend to discover available models, their capabilities, and valid parameter ranges — all from a single credential, without any hardcoded provider logic.

## TTS

A **TTS Model** converts text into synthesized speech audio, supporting both standard and realtime (streaming-input) synthesis modes. AgentScope currently ships the following TTS model classes:

| Provider                       | Model Class                          | Highlights                                                             |
| ------------------------------ | ------------------------------------ | ---------------------------------------------------------------------- |
| DashScope                      | `DashScopeTTSModel`                  | Qwen3-TTS, multiple voices, streaming output                           |
| DashScope (Realtime)           | `DashScopeRealtimeTTSModel`          | Qwen3-TTS WebSocket streaming input, ideal for LLM output piping       |
| DashScope (CosyVoice Realtime) | `DashScopeCosyVoiceRealtimeTTSModel` | CosyVoice-v3 streaming input, supports cosyvoice-v3-plus/flash/sambert |

### Create TTS Model

Every TTS model takes a credential, a model name, and an optional provider-specific `Parameters` object. The two tabs below show the standard and realtime setups:

<CodeGroup>
  ```python Non-Realtime (Standard) theme={null}
  import os
  from agentscope.tts import DashScopeTTSModel
  from agentscope.credential import DashScopeCredential

  tts = DashScopeTTSModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="qwen3-tts-flash",
      parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
      stream=True,
  )
  ```

  ```python Realtime (Qwen3 Streaming Input) theme={null}
  import os
  from agentscope.tts import DashScopeRealtimeTTSModel
  from agentscope.credential import DashScopeCredential

  tts = DashScopeRealtimeTTSModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="qwen3-tts-flash-realtime",
      parameters=DashScopeRealtimeTTSModel.Parameters(voice="Serena"),
      stream=True,
  )
  ```

  ```python Realtime (CosyVoice Streaming Input) theme={null}
  import os
  from agentscope.tts import DashScopeCosyVoiceRealtimeTTSModel
  from agentscope.credential import DashScopeCredential

  tts = DashScopeCosyVoiceRealtimeTTSModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="cosyvoice-v3-plus",
      parameters=DashScopeCosyVoiceRealtimeTTSModel.Parameters(voice="longanyang"),
      stream=True,
  )
  ```
</CodeGroup>

Common constructor arguments shared by every TTS model:

| Argument     | Type                 | Description                                  |
| ------------ | -------------------- | -------------------------------------------- |
| `credential` | `CredentialBase`     | Provider-specific credential                 |
| `model`      | `str`                | Model identifier (e.g. `"qwen3-tts-flash"`)  |
| `parameters` | `Parameters \| None` | Provider-specific parameters such as `voice` |
| `stream`     | `bool`               | Whether to stream audio output               |

Additional arguments for `DashScopeRealtimeTTSModel` and `DashScopeCosyVoiceRealtimeTTSModel`:

| Argument            | Type          | Default | Description                                                        |
| ------------------- | ------------- | ------- | ------------------------------------------------------------------ |
| `cold_start_length` | `int \| None` | `None`  | Minimum character count before first text chunk is sent to the API |
| `cold_start_words`  | `int \| None` | `None`  | Minimum word count before first text chunk is sent                 |
| `max_retries`       | `int`         | `3`     | Maximum retry attempts on WebSocket failure                        |
| `retry_delay`       | `float`       | `5.0`   | Initial retry delay in seconds (exponential backoff)               |

### Call TTS Model

Invoke the model by calling `synthesize()` with the text to speak:

```python theme={null}
async def synthesize(
    self,
    text: str | None = None,
    **kwargs: Any,
) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
```

The return type follows the model's `stream` setting:

* **`stream=False`** — returns a single `TTSResponse` with the complete audio.
* **`stream=True`** — returns an `AsyncGenerator[TTSResponse, None]`. Each chunk carries an incremental audio delta; the final chunk has `is_last=True`.

Each `TTSResponse` carries:

| Field      | Type                | Description                                                                                                |
| ---------- | ------------------- | ---------------------------------------------------------------------------------------------------------- |
| `content`  | `DataBlock \| None` | Audio data. Format indicated by `content.source.media_type` (e.g. `"audio/wav"`, `"audio/pcm;rate=24000"`) |
| `is_last`  | `bool`              | `True` on the final streaming chunk                                                                        |
| `usage`    | `TTSUsage \| None`  | Token counts (`input_tokens`, `output_tokens`) and elapsed `time` in seconds                               |
| `id`       | `str`               | Auto-generated unique identifier                                                                           |
| `metadata` | `dict \| None`      | Optional provider-specific metadata                                                                        |

```python theme={null}
import asyncio
import os
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash",
        parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    # Streaming synthesis
    async for chunk in await tts.synthesize("Hello, world!"):
        if chunk.content:
            # chunk.content is a DataBlock with base64-encoded audio/wav
            print(f"Audio chunk: {len(chunk.content.source.data)} bytes")

asyncio.run(main())
```

### Realtime TTS (Streaming Input)

For realtime models (`DashScopeRealtimeTTSModel` and `DashScopeCosyVoiceRealtimeTTSModel`), text can be pushed incrementally as it arrives from a streaming LLM. Both share the same `push()` / `synthesize()` interface. The lifecycle is managed via `async with` or manual `connect()` / `close()`:

<Note>
  `DashScopeRealtimeTTSModel` (Qwen3) produces audio at token-level granularity — each `push()` call typically returns audio data. In contrast, `DashScopeCosyVoiceRealtimeTTSModel` relies on the CosyVoice server which automatically segments text into sentences before synthesizing. Audio is only returned after a complete sentence boundary is detected, so `push()` may return empty responses for partial sentences. Calling `synthesize()` forces synthesis of all remaining text including incomplete sentences.
</Note>

```python theme={null}
import asyncio
import os
from agentscope.tts import DashScopeRealtimeTTSModel
from agentscope.credential import DashScopeCredential

async def main():
    tts = DashScopeRealtimeTTSModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-tts-flash-realtime",
        parameters=DashScopeRealtimeTTSModel.Parameters(voice="Cherry"),
        stream=True,
    )

    async with tts:
        # Push text incrementally as it arrives from a streaming LLM.
        # Each push() returns a TTSResponse with audio accumulated so far
        # (or content=None if not yet available).
        resp1 = await tts.push("Hello, ")
        if resp1.content:
            print("Audio available after first push")

        resp2 = await tts.push("how are you today?")
        if resp2.content:
            print("Audio available after second push")

        # Finalize: flush remaining buffered text and collect final audio.
        # text= is optional — pass extra text to append before finalizing,
        # or omit to finalize previously pushed text only.
        response = await tts.synthesize()

asyncio.run(main())
```

| Method         | Description                                                                |
| -------------- | -------------------------------------------------------------------------- |
| `connect()`    | Open WebSocket connection                                                  |
| `push(text)`   | Append text incrementally (non-blocking), returns audio accumulated so far |
| `synthesize()` | Finalize and return remaining audio                                        |
| `close()`      | Tear down connection                                                       |

### Integrate with Agent

In the agent layer, TTS is integrated via [`TTSMiddleware`](/versions/2.0.3/en/building-blocks/middleware#ttsmiddleware) — it intercepts the agent's text output and synthesizes speech automatically:

```python theme={null}
from agentscope.agent import Agent
from agentscope.middleware import TTSMiddleware
from agentscope.tts import DashScopeTTSModel
from agentscope.credential import DashScopeCredential

agent = Agent(
    name="assistant",
    model=chat_model,
    middlewares=[
        TTSMiddleware(
            DashScopeTTSModel(
                credential=DashScopeCredential(api_key="..."),
                model="qwen3-tts-flash",
                parameters=DashScopeTTSModel.Parameters(voice="Cherry"),
                stream=True,
            ),
        ),
    ],
)

# The agent's reply stream now includes audio events
async for event in agent.reply_stream(user_msg):
    # TextBlockDeltaEvent — text content
    # DataBlockDeltaEvent — audio content (WAV)
    ...
```

The middleware automatically selects the optimal synthesis strategy:

| TTS Mode     | Middleware Behavior                                                |
| ------------ | ------------------------------------------------------------------ |
| Non-realtime | Waits for full text, then synthesizes all at once                  |
| Realtime     | Pushes text deltas as they arrive, streams audio back concurrently |

### TTS Model Card

`TTSModelCard` describes a TTS model's capabilities — available voices, streaming support, and parameter ranges — and is used to drive the frontend model picker. Each card is defined by a YAML file alongside the model implementation:

```yaml Qwen3 TTS theme={null}
name: qwen3-tts-flash
label: Qwen3-TTS-Flash
status: active
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - Cherry
  - Serena
  - Ethan
  - Chelsie
parameter_overrides: {}
```

```yaml CosyVoice Realtime theme={null}
name: cosyvoice-v3-plus
label: CosyVoice-v3-Plus
status: active
realtime: true
input_types:
  - text/plain
output_types:
  - audio/wav
voices:
  - longanyang
  - longxiaochun
  - longshuo
parameter_overrides: {}
```

The `voices` list is automatically injected into the `parameter_schema` as an enum constraint on the `voice` field, so the frontend renders a dropdown selector.

| Field                  | Type        | Description                                                                                                            |
| ---------------------- | ----------- | ---------------------------------------------------------------------------------------------------------------------- |
| `name`                 | `str`       | Model identifier (e.g. `"qwen3-tts-flash"`)                                                                            |
| `label`                | `str`       | Display name (e.g. `"Qwen3-TTS-Flash"`)                                                                                |
| `status`               | `str`       | `"active"`, `"deprecated"`, or `"sunset"`                                                                              |
| `realtime`             | `bool`      | Whether model supports streaming input                                                                                 |
| `input_types`          | `list[str]` | Accepted input MIME types (always `["text/plain"]`)                                                                    |
| `output_types`         | `list[str]` | Output MIME types (typically `["audio/wav"]`)                                                                          |
| `parameter_schema`     | `dict`      | Merged JSON Schema for the parameter form — base schema from `Parameters` class, enriched with `voices` enum from YAML |
| `parameters_overrides` | `dict`      | Per-model overrides (same syntax as chat model cards)                                                                  |

Retrieve TTS model cards via the credential:

```python theme={null}
from agentscope.credential import DashScopeCredential

cards = DashScopeCredential.list_tts_models()
for card in cards:
    print(f"{card.name} (realtime={card.realtime}): {card.label}")
```

Or directly on the model class:

```python theme={null}
from agentscope.tts import DashScopeTTSModel, DashScopeCosyVoiceRealtimeTTSModel

# Qwen3 TTS models
cards = DashScopeTTSModel.list_models()

# CosyVoice Realtime models
cosyvoice_cards = DashScopeCosyVoiceRealtimeTTSModel.list_models()
```

### Custom TTS Provider

To add a new TTS provider, implement a `TTSModelBase` subclass and register it on the credential:

```python theme={null}
from typing import Literal, Type, TYPE_CHECKING, AsyncGenerator, Any
from pydantic import BaseModel, Field
from agentscope.tts import TTSModelBase, TTSResponse
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.tts import TTSModelBase as TTSBase

class MyTTSModel(TTSModelBase):
    class Parameters(BaseModel):
        voice: str = Field(default="default", title="Voice")

    type: Literal["my_tts"] = "my_tts"

    async def synthesize(
        self, text: str | None = None, **kwargs: Any
    ) -> TTSResponse | AsyncGenerator[TTSResponse, None]:
        # Call your provider's API here
        ...

# Register on your credential
class MyCredential(CredentialBase):
    @classmethod
    def get_tts_model_classes(cls) -> list[Type["TTSBase"]]:
        return [MyTTSModel]
```

## Embedding

An **Embedding Model** converts text — and, for multimodal models, images, videos, and other media — into dense vectors that power semantic search, RAG, and memory retrieval. AgentScope currently ships the following embedding model classes:

| Provider  | Model Class               | Highlights                                                                                             |
| --------- | ------------------------- | ------------------------------------------------------------------------------------------------------ |
| DashScope | `DashScopeEmbeddingModel` | Unified text + multimodal API (`text-embedding-v4`, `qwen3-vl-embedding`, ...), content-aware batching |
| OpenAI    | `OpenAIEmbeddingModel`    | `text-embedding-3-small/large`, compatible with OpenAI-compatible endpoints                            |
| Gemini    | `GeminiEmbeddingModel`    | Text (`gemini-embedding-001`) and multimodal (`gemini-embedding-2`, image / video / audio / PDF)       |
| Ollama    | `OllamaEmbeddingModel`    | Local embedding models (`nomic-embed-text`, ...), credential carries the host URL                      |

### Create Embedding Model

Every embedding model takes a credential, a model name, and an optional `Parameters` object — the same pattern as chat models. `Parameters` carries `dimensions`, the output vector size:

<CodeGroup>
  ```python DashScope theme={null}
  import os
  from agentscope.embedding import DashScopeEmbeddingModel
  from agentscope.credential import DashScopeCredential

  model = DashScopeEmbeddingModel(
      credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
      model="text-embedding-v4",
      parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
  )
  ```

  ```python OpenAI theme={null}
  import os
  from agentscope.embedding import OpenAIEmbeddingModel
  from agentscope.credential import OpenAICredential

  model = OpenAIEmbeddingModel(
      credential=OpenAICredential(api_key=os.environ["OPENAI_API_KEY"]),
      model="text-embedding-3-small",
      parameters=OpenAIEmbeddingModel.Parameters(dimensions=1536),
  )
  ```

  ```python Gemini theme={null}
  import os
  from agentscope.embedding import GeminiEmbeddingModel
  from agentscope.credential import GeminiCredential

  model = GeminiEmbeddingModel(
      credential=GeminiCredential(api_key=os.environ["GEMINI_API_KEY"]),
      model="gemini-embedding-001",
      parameters=GeminiEmbeddingModel.Parameters(dimensions=768),
  )
  ```

  ```python Ollama theme={null}
  from agentscope.embedding import OllamaEmbeddingModel
  from agentscope.credential import OllamaCredential

  model = OllamaEmbeddingModel(
      credential=OllamaCredential(host="http://localhost:11434"),
      model="nomic-embed-text",
  )
  ```
</CodeGroup>

Common constructor arguments shared by every embedding model:

| Argument          | Type                         | Description                                              |
| ----------------- | ---------------------------- | -------------------------------------------------------- |
| `credential`      | `CredentialBase`             | Provider-specific credential                             |
| `model`           | `str`                        | Model identifier (e.g. `"text-embedding-v4"`)            |
| `parameters`      | `Parameters \| None`         | `dimensions` — the output vector size (default `512`)    |
| `embedding_cache` | `EmbeddingCacheBase \| None` | Optional cache that skips repeated API calls (see below) |
| `context_size`    | `int`                        | Maximum input tokens per item                            |
| `max_retries`     | `int`                        | Maximum retries per batch on retryable failures          |
| `retry_delay`     | `float`                      | Seconds between retry attempts                           |

<Info>
  Valid `dimensions` values differ per model — each model card pins the supported `enum` and default via `parameter_overrides` (e.g. `text-embedding-v4` accepts 2048 / 1536 / 1024 / ... / 64). See [EmbeddingModelCard](#embeddingmodelcard).
</Info>

### Call Embedding Model

Invoke the model by calling it with a list of inputs. Text-only models accept `list[str]`; multimodal models also accept `DataBlock` elements:

```python theme={null}
async def __call__(
    self,
    inputs: list[str | DataBlock],
    **kwargs: Any,
) -> EmbeddingResponse:
```

Batching and retry are handled for you:

1. Inputs are split into chunks of the model's batch size (10 for DashScope text, 2048 for OpenAI, 100 for Gemini, 512 for Ollama).
2. All chunks are dispatched **concurrently** via `asyncio.gather`.
3. Each chunk is retried independently up to `max_retries` times on provider-specific retryable errors.
4. Results are merged into a single `EmbeddingResponse`, preserving input order.

```python theme={null}
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
    )
    response = await model(
        ["What is AgentScope?", "A multi-agent framework."],
    )
    print(len(response.embeddings))     # 2 — one vector per input
    print(len(response.embeddings[0]))  # 1024
    print(response.usage.tokens)        # total tokens consumed
    print(response.source)              # "api" or "cache"

asyncio.run(main())
```

Each `EmbeddingResponse` carries:

| Field                        | Type                     | Description                                                     |
| ---------------------------- | ------------------------ | --------------------------------------------------------------- |
| `embeddings`                 | `list[Embedding]`        | One vector per input, in input order                            |
| `usage`                      | `EmbeddingUsage \| None` | `tokens` consumed and `time` elapsed in seconds                 |
| `source`                     | `"api" \| "cache"`       | Whether the result came from the API or the cache               |
| `id` / `created_at` / `type` | `str`                    | Response identity and timestamp; `type` is always `"embedding"` |

#### Multimodal Embedding

Multimodal models (`DashScopeEmbeddingModel` with `qwen3-vl-embedding` etc., `GeminiEmbeddingModel` with `gemini-embedding-2`) accept `DataBlock` inputs alongside strings — images as URL or base64, videos as URL:

```python theme={null}
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential
from agentscope.message import DataBlock, URLSource

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-vl-embedding",
    )
    response = await model([
        "A cat sitting on a windowsill",
        DataBlock(
            source=URLSource(
                url="https://example.com/cat.png",
                media_type="image/png",
            ),
        ),
    ])
    print(len(response.embeddings))  # 2 — one vector per input

asyncio.run(main())
```

<Info>
  Multimodal models replace the plain batch-size split with **content-aware batching**: inputs are greedily packed into batches that respect the model's per-request limits on total elements, images, and videos (e.g. `qwen3-vl-embedding` allows 20 elements / 5 images / 1 video per request, `tongyi-embedding-vision-plus` allows 20 / 64 / 8). You never need to split inputs yourself.
</Info>

### Embedding Cache

Pass an `EmbeddingCacheBase` implementation through the `embedding_cache` argument to reuse previously computed vectors. The built-in `FileEmbeddingCache` stores each result as a `.npy` file keyed by the SHA-256 hash of the request:

```python theme={null}
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel, FileEmbeddingCache
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        embedding_cache=FileEmbeddingCache(
            cache_dir="./.cache/embeddings",
            max_file_number=1000,
            max_cache_size=100,  # MB
        ),
    )
    r1 = await model(["What is AgentScope?"])
    print(r1.source)  # "api" — first call hits the API

    r2 = await model(["What is AgentScope?"])
    print(r2.source)  # "cache" — identical request served locally

asyncio.run(main())
```

When `max_file_number` or `max_cache_size` is exceeded, the oldest files are evicted first. To use a different backend (Redis, SQLite, ...), subclass `EmbeddingCacheBase` and implement its four methods: `store`, `retrieve`, `remove`, and `clear`.

### Custom Embedding Provider

Adding an embedding provider follows the same steps as a chat provider.

#### Step 1: Link the Credential

Override `get_embedding_model_class()` on your credential (the base implementation returns `None`, meaning "no embedding support"):

```python theme={null}
from typing import Type, TYPE_CHECKING
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.embedding import EmbeddingModelBase

class MyProviderCredential(CredentialBase):
    # ... fields and get_chat_model_class() as before ...

    @classmethod
    def get_embedding_model_class(cls) -> Type["EmbeddingModelBase"]:
        from .my_embedding import MyProviderEmbeddingModel
        return MyProviderEmbeddingModel
```

#### Step 2: Implement the Embedding Model

Subclass `EmbeddingModelBase` and implement `_call_api` for a **single batch** — batching, concurrency, and retry are inherited from the base class. Declare provider-specific transient errors via `_get_retryable_exceptions`:

```python theme={null}
from typing import Any, Type
from agentscope.embedding import EmbeddingModelBase, EmbeddingResponse, EmbeddingUsage

class MyProviderEmbeddingModel(EmbeddingModelBase[str]):
    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: "MyProviderEmbeddingModel.Parameters | None" = None,
        context_size: int = 8192,
        max_retries: int = 3,
        retry_delay: float = 1.0,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters,
            context_size=context_size,
            batch_size=100,          # max items per API call
            max_retries=max_retries,
            retry_delay=retry_delay,
        )

    @classmethod
    def _get_retryable_exceptions(cls) -> tuple[Type[Exception], ...]:
        return (TimeoutError,)       # retried up to max_retries times

    async def _call_api(
        self,
        inputs: list[str],
        **kwargs: Any,
    ) -> EmbeddingResponse:
        # len(inputs) <= self.batch_size is guaranteed.
        # Call your provider's API and return the vectors.
        ...
```

Bind the generic parameter to the input type your provider supports: `EmbeddingModelBase[str]` for text-only, `EmbeddingModelBase[str | DataBlock]` for multimodal — IDEs then surface the correct `inputs` type to callers.

#### Step 3: Add Model Cards (optional)

Drop YAML files into a `_models/` directory next to your implementation; `MyProviderEmbeddingModel.list_models()` then picks them up — exactly like chat model cards.

### EmbeddingModelCard

`EmbeddingModelCard` mirrors `ModelCard` for the frontend, with embedding-specific defaults — the output type `application/x-embedding` marks a model as producing dense vectors:

| Field              | Difference from `ModelCard`                                                                                                               |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------- |
| `type`             | Always `"embedding_model"`                                                                                                                |
| `input_types`      | Defaults to `["text/plain"]`; multimodal cards add `image/*`, `video/*`, ...                                                              |
| `output_types`     | Defaults to `["application/x-embedding"]`                                                                                                 |
| `parameter_schema` | Built from the embedding `Parameters` class (`dimensions`) merged with YAML `parameter_overrides` — same override semantics as chat cards |
| `output_size`      | Not present — embedding models have no output token limit                                                                                 |

A typical YAML card:

```yaml theme={null}
name: text-embedding-v4
label: Text Embedding v4
status: active

input_types:
  - text/plain

output_types:
  - application/x-embedding

context_size: 8192

parameter_overrides:
  dimensions:
    default: 1024
    enum: [2048, 1536, 1024, 768, 512, 256, 128, 64]
```

Retrieve cards from the model class directly, or discover the class from a credential via `get_embedding_model_class()`:

```python theme={null}
from agentscope.credential import DashScopeCredential
from agentscope.embedding import OpenAIEmbeddingModel

# Directly on the model class
cards = OpenAIEmbeddingModel.list_models()

# Or discover the class from a credential
embed_cls = DashScopeCredential.get_embedding_model_class()
cards = embed_cls.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")
```

## Realtime Model

<Tip>
  Coming soon — we are migrating Realtime Model support from v1.0 to v2.0.
</Tip>