Skip to main content

Overview

The model layer is organized as a two-tier hierarchy: a Credential at the top, and the model families a provider exposes beneath it — Chat Model, TTS, Embedding, and Realtime Model.
Credential
ChatModelBase
OpenAIChatModel
OpenAIResponseModel
AnthropicChatModel
DashScopeChatModel
DeepSeekChatModel
GeminiChatModel
MoonshotChatModel
XAIChatModel
OllamaChatModel
TTSModelBase (coming soon)
EmbeddingModelBase
DashScopeEmbeddingModel
OpenAIEmbeddingModel
GeminiEmbeddingModel
OllamaEmbeddingModel
RealtimeModelBase (coming soon)
A Credential carries the API authentication fields a provider requires (api_key, base_url, …). From a credential, you can retrieve the list of available models for each model family that provider supports. This layering mirrors the natural frontend flow — register a credential first, then pick a model from under it — letting the UI authenticate once and surface every model family the provider supports.

Chat Model

A Chat Model is the LLM that drives an agent’s conversation and tool calls, accepting and producing multimodal content beyond plain text. AgentScope currently ships the following chat model classes:
ProviderModel ClassHighlights
OpenAIOpenAIChatModelChat Completions API, compatible with vLLM and OpenAI-compatible endpoints
OpenAI (Responses)OpenAIResponseModelResponses API with native reasoning support (o3, o4-mini)
AnthropicAnthropicChatModelClaude models with extended thinking and prompt caching
DashScopeDashScopeChatModelQwen models, multimodal (vision/audio/video), reasoning
DeepSeekDeepSeekChatModelOpenAI-compatible with prompt cache hit tokens
GeminiGeminiChatModelGoogle Gemini models with multimodal support
MoonshotMoonshotChatModelKimi models (OpenAI-compatible)
xAIXAIChatModelGrok models with native reasoning effort
OllamaOllamaChatModelLocal LLM hosting, credential is optional

Create Chat Model

Every chat model takes a credential, a model name, and an optional provider-specific Parameters object. The three tabs below show typical setups for streaming, tool calling, and reasoning:
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential

model = DashScopeChatModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="qwen-plus",
    stream=True,
)
Common constructor arguments shared by every chat model:
ArgumentTypeDescription
credentialCredentialBaseProvider-specific credential
modelstrModel identifier (e.g. "qwen-plus")
parametersParameters | NoneProvider-specific parameters such as temperature, thinking_enable, parallel_tool_calls
streamboolWhether to stream output
max_retriesintMaximum API retries on failure
context_sizeintContext window used for context compression
formatterFormatterBase | NoneOverride message formatter

Call Chat Model

Invoke the model by calling it with a list of Msg objects, plus optional tools and tool_choice:
async def __call__(
    self,
    messages: list[Msg],
    tools: list[dict] | None = None,
    tool_choice: ToolChoice | None = None,
    **kwargs: Any,
) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
The return type follows the model’s stream setting:
  • stream=False — awaits a single ChatResponse carrying the full output.
  • stream=True — awaits an AsyncGenerator[ChatResponse, None]. Intermediate chunks (is_last=False) carry only the delta generated in that step. So that callers don’t have to accumulate deltas themselves, AgentScope appends one final chunk with is_last=True that carries the full accumulated content.
import asyncio
import os
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=True,
    )
    msgs = [UserMsg(name="user", content="Count from 1 to 5.")]

    async for chunk in await model(msgs):
        if chunk.is_last:
            print("Final:", chunk.content)   # full accumulated content
        else:
            print("Delta:", chunk.content)   # delta only

asyncio.run(main())
A representative streaming trace, illustrating the delta-then-accumulated pattern:
Delta: [TextBlock(text='1')]
Delta: [TextBlock(text=', 2,')]
Delta: [TextBlock(text=' 3, ')]
Delta: [TextBlock(text='4, 5')]
Final: [TextBlock(text='1, 2, 3, 4, 5')]
Each ChatResponse carries content blocks (TextBlock, ThinkingBlock, ToolCallBlock, DataBlock), an is_last flag, and a ChatUsage recording token counts and elapsed time.

Generate Structured Output

When you need a JSON object that conforms to a Pydantic model or JSON schema, call generate_structured_output instead of __call__. It returns a StructuredResponse whose content is a validated dict matching the schema:
import asyncio
import os
from pydantic import BaseModel
from agentscope.model import DashScopeChatModel
from agentscope.credential import DashScopeCredential
from agentscope.message import UserMsg

class WeatherInfo(BaseModel):
    city: str
    temperature: float
    unit: str

async def main():
    model = DashScopeChatModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen-plus",
        stream=False,
    )
    response = await model.generate_structured_output(
        messages=[UserMsg(name="user", content="What's the weather in Shanghai?")],
        structured_model=WeatherInfo,
    )
    print(response.content)  # validated dict matching WeatherInfo

asyncio.run(main())
generate_structured_output synthesizes a forced tool call from the schema, then validates and repairs the model’s response.

Formatter

A formatter translates AgentScope’s Msg objects into the list[dict] payload that each provider’s API expects. It is configured via the optional formatter argument on the chat model constructor. Every provider ships two built-in variants:
VariantUse Case
ChatFormatter (default)Standard single-agent dialog. Each Msg maps 1:1 to an API message, preserving native roles (user, assistant, system).
MultiAgentFormatterMulti-agent scenarios such as debate or moderation. Consecutive agent messages are grouped and wrapped in <history> tags with the sender’s name, while tool call / result sequences keep their native API format.
Switch to multi-agent mode by passing the MultiAgent variant — no agent code changes are required:
import os
from agentscope.model import OpenAIChatModel
from agentscope.credential import OpenAICredential
from agentscope.formatter import OpenAIMultiAgentFormatter

model = OpenAIChatModel(
    credential=OpenAICredential(api_key=os.environ["OPENAI_API_KEY"]),
    model="gpt-4.1",
    formatter=OpenAIMultiAgentFormatter(),
)
For non-standard payload shapes (e.g. a provider whose API doesn’t follow the OpenAI or Anthropic conventions), subclass FormatterBase and pass an instance through the same formatter argument.

Custom Provider

You can extend AgentScope with your own model provider by implementing a credential and a chat model, then registering the credential.

Step 1: Define the Credential

Subclass CredentialBase with a unique type discriminator and implement get_chat_model_class():
from typing import Literal, Type, TYPE_CHECKING
from pydantic import ConfigDict, Field, SecretStr
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.model import ChatModelBase

class MyProviderCredential(CredentialBase):
    model_config = ConfigDict(title="My Provider API")
    type: Literal["my_provider_credential"] = "my_provider_credential"

    api_key: SecretStr = Field(description="API key for My Provider.")
    base_url: str = Field(default="https://api.myprovider.com/v1")

    @classmethod
    def get_chat_model_class(cls) -> Type["ChatModelBase"]:
        from .my_model import MyProviderChatModel
        return MyProviderChatModel

Step 2: Implement the Chat Model

Subclass ChatModelBase, define a Parameters inner class, and implement _call_api:
from typing import Literal, Any, AsyncGenerator
from pydantic import BaseModel, Field
from agentscope.model import ChatModelBase, ChatResponse
from agentscope.message import Msg
from agentscope.tool import ToolChoice
from agentscope.formatter import FormatterBase, OpenAIChatFormatter

class MyProviderChatModel(ChatModelBase):
    class Parameters(BaseModel):
        max_tokens: int | None = Field(default=None, gt=0)
        temperature: float | None = Field(default=None, ge=0, le=2)

    type: Literal["my_provider_chat"] = "my_provider_chat"

    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: Parameters | None = None,
        stream: bool = True,
        max_retries: int = 3,
        context_size: int = 128000,
        formatter: FormatterBase | None = None,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters or self.Parameters(),
            stream=stream,
            max_retries=max_retries,
            context_size=context_size,
        )
        # If your API follows the OpenAI format, reuse OpenAIChatFormatter;
        # otherwise implement your own FormatterBase subclass.
        self.formatter = formatter or OpenAIChatFormatter()

    async def _call_api(
        self,
        model_name: str,
        messages: list[Msg],
        tools: list[dict] | None = None,
        tool_choice: ToolChoice | None = None,
        **kwargs: Any,
    ) -> ChatResponse | AsyncGenerator[ChatResponse, None]:
        formatted_messages = await self.formatter.format(messages)
        # Call your provider's API using self.credential.api_key, etc.
        ...

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your model implementation. Each file describes one model — its capabilities (input_types, output_types), limits (context_size, output_size), and any per-model parameter_overrides:
name: my-model-v1
label: My Model V1
status: active
input_types:
  - text/plain
output_types:
  - text/plain
context_size: 128000
output_size: 4096
parameter_overrides:
  max_tokens: {"maximum": 4096}
MyProviderChatModel.list_models() then loads every YAML in that directory. To pull cards from a different location — for example, a registry your application maintains separately — pass custom_yaml_dir:
cards = MyProviderChatModel.list_models(custom_yaml_dir="/path/to/cards")

Integrate with Frontend

What is ModelCard

ModelCard is a declarative description of a model’s capabilities and constraints, designed to drive the frontend — model selectors, parameter forms, and feature toggles can be rendered dynamically without hardcoding any provider-specific knowledge. Each ModelCard contains:
FieldTypeDescription
namestrModel identifier (e.g. "claude-sonnet-4-6")
labelstrHuman-readable display name (e.g. "Claude Sonnet 4.6")
status"active" | "deprecated" | "sunset"Model lifecycle status
input_typeslist[str]Accepted input MIME types — used by the frontend to filter attachment uploads (e.g. only show an image button when image/* is supported)
output_typeslist[str]Output MIME types the model can produce — advertises capabilities such as a thinking toggle when application/x-thinking is present
context_sizeintMaximum context window in tokens
output_sizeintMaximum output tokens
parameter_schemadictFinal JSON Schema for the parameter form — base schema merged with per-model overrides (see below)
parameters_overridesdict[str, dict]The raw per-model overrides, before merging
input_types and output_types use MIME types to describe modality. Common values:
MIME TypeMeaning
text/plainText
application/x-thinkingReasoning / chain-of-thought
image/* (e.g. image/png, image/jpeg)Image
audio/* (e.g. audio/wav, audio/mp3)Audio
video/* (e.g. video/mp4)Video
A typical YAML card for claude-sonnet-4-6:
name: claude-sonnet-4-6
label: Claude Sonnet 4.6
status: active

input_types:
  - text/plain
  - image/jpeg
  - image/png
  - image/gif
  - image/webp

output_types:
  - text/plain
  - application/x-thinking

context_size: 1000000
output_size: 65536

parameter_overrides:
  max_tokens: {"maximum": 65536}

Parameter schema and overrides

The parameter_schema exposed to the frontend is built in two layers:
  1. Base schema — auto-derived from the chat model’s Parameters class via model_json_schema(). This lists every adjustable parameter (temperature, max_tokens, thinking_enable, …) along with its type and the API-wide range.
  2. Per-model overrides — the YAML’s parameter_overrides block is merged on top, field by field.
Overrides matter because adjustable ranges are not uniform across an API: every Qwen model accepts max_tokens, but each one has a different ceiling. Overrides let a card tighten a range, pin a default, or hide a parameter that doesn’t apply.
Override syntaxEffect
param: { ... }Shallow-merge into the base field (e.g. max_tokens: {maximum: 16384})
param: { hidden: true }Hide the parameter from the frontend
param: nullRemove the parameter entirely

Retrieve ModelCards

You retrieve model cards by calling list_models() on either the credential class or the model class. Internally, CredentialBase.list_models() delegates to its linked ChatModelBase subclass (obtained via get_chat_model_class()), which loads YAML card definitions from its _models/ directory.
from agentscope.credential import DashScopeCredential
from agentscope.model import AnthropicChatModel

# Via credential class
cards = DashScopeCredential.list_models()

# Or directly on the model class
cards = AnthropicChatModel.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")
The credential’s get_chat_model_class() returns the corresponding ChatModelBase subclass, which in turn knows where to find its model card YAML files:
model_cls = DashScopeCredential.get_chat_model_class()  # -> DashScopeChatModel
cards = model_cls.list_models()                          # -> list[ModelCard]
This design allows the frontend to discover available models, their capabilities, and valid parameter ranges — all from a single credential, without any hardcoded provider logic.

TTS

Coming soon — we are migrating TTS support from v1.0 to v2.0.

Embedding

An Embedding Model converts text — and, for multimodal models, images, videos, and other media — into dense vectors that power semantic search, RAG, and memory retrieval. AgentScope currently ships the following embedding model classes:
ProviderModel ClassHighlights
DashScopeDashScopeEmbeddingModelUnified text + multimodal API (text-embedding-v4, qwen3-vl-embedding, …), content-aware batching
OpenAIOpenAIEmbeddingModeltext-embedding-3-small/large, compatible with OpenAI-compatible endpoints
GeminiGeminiEmbeddingModelText (gemini-embedding-001) and multimodal (gemini-embedding-2, image / video / audio / PDF)
OllamaOllamaEmbeddingModelLocal embedding models (nomic-embed-text, …), credential carries the host URL

Create Embedding Model

Every embedding model takes a credential, a model name, and an optional Parameters object — the same pattern as chat models. Parameters carries dimensions, the output vector size:
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

model = DashScopeEmbeddingModel(
    credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
    model="text-embedding-v4",
    parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
)
Common constructor arguments shared by every embedding model:
ArgumentTypeDescription
credentialCredentialBaseProvider-specific credential
modelstrModel identifier (e.g. "text-embedding-v4")
parametersParameters | Nonedimensions — the output vector size (default 512)
embedding_cacheEmbeddingCacheBase | NoneOptional cache that skips repeated API calls (see below)
context_sizeintMaximum input tokens per item
max_retriesintMaximum retries per batch on retryable failures
retry_delayfloatSeconds between retry attempts
Valid dimensions values differ per model — each model card pins the supported enum and default via parameter_overrides (e.g. text-embedding-v4 accepts 2048 / 1536 / 1024 / … / 64). See EmbeddingModelCard.

Call Embedding Model

Invoke the model by calling it with a list of inputs. Text-only models accept list[str]; multimodal models also accept DataBlock elements:
async def __call__(
    self,
    inputs: list[str | DataBlock],
    **kwargs: Any,
) -> EmbeddingResponse:
Batching and retry are handled for you:
  1. Inputs are split into chunks of the model’s batch size (10 for DashScope text, 2048 for OpenAI, 100 for Gemini, 512 for Ollama).
  2. All chunks are dispatched concurrently via asyncio.gather.
  3. Each chunk is retried independently up to max_retries times on provider-specific retryable errors.
  4. Results are merged into a single EmbeddingResponse, preserving input order.
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        parameters=DashScopeEmbeddingModel.Parameters(dimensions=1024),
    )
    response = await model(
        ["What is AgentScope?", "A multi-agent framework."],
    )
    print(len(response.embeddings))     # 2 — one vector per input
    print(len(response.embeddings[0]))  # 1024
    print(response.usage.tokens)        # total tokens consumed
    print(response.source)              # "api" or "cache"

asyncio.run(main())
Each EmbeddingResponse carries:
FieldTypeDescription
embeddingslist[Embedding]One vector per input, in input order
usageEmbeddingUsage | Nonetokens consumed and time elapsed in seconds
source"api" | "cache"Whether the result came from the API or the cache
id / created_at / typestrResponse identity and timestamp; type is always "embedding"

Multimodal Embedding

Multimodal models (DashScopeEmbeddingModel with qwen3-vl-embedding etc., GeminiEmbeddingModel with gemini-embedding-2) accept DataBlock inputs alongside strings — images as URL or base64, videos as URL:
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel
from agentscope.credential import DashScopeCredential
from agentscope.message import DataBlock, URLSource

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="qwen3-vl-embedding",
    )
    response = await model([
        "A cat sitting on a windowsill",
        DataBlock(
            source=URLSource(
                url="https://example.com/cat.png",
                media_type="image/png",
            ),
        ),
    ])
    print(len(response.embeddings))  # 2 — one vector per input

asyncio.run(main())
Multimodal models replace the plain batch-size split with content-aware batching: inputs are greedily packed into batches that respect the model’s per-request limits on total elements, images, and videos (e.g. qwen3-vl-embedding allows 20 elements / 5 images / 1 video per request, tongyi-embedding-vision-plus allows 20 / 64 / 8). You never need to split inputs yourself.

Embedding Cache

Pass an EmbeddingCacheBase implementation through the embedding_cache argument to reuse previously computed vectors. The built-in FileEmbeddingCache stores each result as a .npy file keyed by the SHA-256 hash of the request:
import asyncio
import os
from agentscope.embedding import DashScopeEmbeddingModel, FileEmbeddingCache
from agentscope.credential import DashScopeCredential

async def main():
    model = DashScopeEmbeddingModel(
        credential=DashScopeCredential(api_key=os.environ["DASHSCOPE_API_KEY"]),
        model="text-embedding-v4",
        embedding_cache=FileEmbeddingCache(
            cache_dir="./.cache/embeddings",
            max_file_number=1000,
            max_cache_size=100,  # MB
        ),
    )
    r1 = await model(["What is AgentScope?"])
    print(r1.source)  # "api" — first call hits the API

    r2 = await model(["What is AgentScope?"])
    print(r2.source)  # "cache" — identical request served locally

asyncio.run(main())
When max_file_number or max_cache_size is exceeded, the oldest files are evicted first. To use a different backend (Redis, SQLite, …), subclass EmbeddingCacheBase and implement its four methods: store, retrieve, remove, and clear.

Custom Embedding Provider

Adding an embedding provider follows the same steps as a chat provider. Override get_embedding_model_class() on your credential (the base implementation returns None, meaning “no embedding support”):
from typing import Type, TYPE_CHECKING
from agentscope.credential import CredentialBase

if TYPE_CHECKING:
    from agentscope.embedding import EmbeddingModelBase

class MyProviderCredential(CredentialBase):
    # ... fields and get_chat_model_class() as before ...

    @classmethod
    def get_embedding_model_class(cls) -> Type["EmbeddingModelBase"]:
        from .my_embedding import MyProviderEmbeddingModel
        return MyProviderEmbeddingModel

Step 2: Implement the Embedding Model

Subclass EmbeddingModelBase and implement _call_api for a single batch — batching, concurrency, and retry are inherited from the base class. Declare provider-specific transient errors via _get_retryable_exceptions:
from typing import Any, Type
from agentscope.embedding import EmbeddingModelBase, EmbeddingResponse, EmbeddingUsage

class MyProviderEmbeddingModel(EmbeddingModelBase[str]):
    def __init__(
        self,
        credential: "MyProviderCredential",
        model: str,
        parameters: "MyProviderEmbeddingModel.Parameters | None" = None,
        context_size: int = 8192,
        max_retries: int = 3,
        retry_delay: float = 1.0,
    ) -> None:
        super().__init__(
            credential=credential,
            model=model,
            parameters=parameters,
            context_size=context_size,
            batch_size=100,          # max items per API call
            max_retries=max_retries,
            retry_delay=retry_delay,
        )

    @classmethod
    def _get_retryable_exceptions(cls) -> tuple[Type[Exception], ...]:
        return (TimeoutError,)       # retried up to max_retries times

    async def _call_api(
        self,
        inputs: list[str],
        **kwargs: Any,
    ) -> EmbeddingResponse:
        # len(inputs) <= self.batch_size is guaranteed.
        # Call your provider's API and return the vectors.
        ...
Bind the generic parameter to the input type your provider supports: EmbeddingModelBase[str] for text-only, EmbeddingModelBase[str | DataBlock] for multimodal — IDEs then surface the correct inputs type to callers.

Step 3: Add Model Cards (optional)

Drop YAML files into a _models/ directory next to your implementation; MyProviderEmbeddingModel.list_models() then picks them up — exactly like chat model cards.

EmbeddingModelCard

EmbeddingModelCard mirrors ModelCard for the frontend, with embedding-specific defaults — the output type application/x-embedding marks a model as producing dense vectors:
FieldDifference from ModelCard
typeAlways "embedding_model"
input_typesDefaults to ["text/plain"]; multimodal cards add image/*, video/*, …
output_typesDefaults to ["application/x-embedding"]
parameter_schemaBuilt from the embedding Parameters class (dimensions) merged with YAML parameter_overrides — same override semantics as chat cards
output_sizeNot present — embedding models have no output token limit
A typical YAML card:
name: text-embedding-v4
label: Text Embedding v4
status: active

input_types:
  - text/plain

output_types:
  - application/x-embedding

context_size: 8192

parameter_overrides:
  dimensions:
    default: 1024
    enum: [2048, 1536, 1024, 768, 512, 256, 128, 64]
Retrieve cards from the model class directly, or discover the class from a credential via get_embedding_model_class():
from agentscope.credential import DashScopeCredential
from agentscope.embedding import OpenAIEmbeddingModel

# Directly on the model class
cards = OpenAIEmbeddingModel.list_models()

# Or discover the class from a credential
embed_cls = DashScopeCredential.get_embedding_model_class()
cards = embed_cls.list_models()

for card in cards:
    print(f"{card.name}: context={card.context_size}, inputs={card.input_types}")

Realtime Model

Coming soon — we are migrating Realtime Model support from v1.0 to v2.0.