| Capability | Description |
|---|---|
| Multi-tenant knowledge bases | Each user gets its own namespace, with knowledge bases fully isolated from each other — a natural fit for multi-user SaaS scenarios, no extra permission isolation needed. |
| Full knowledge base / document management | Full CRUD endpoints for knowledge bases and documents; deletion cascades through vectors, records, and original files — no stale data or orphan files build up over time. |
| Asynchronous uploads + live progress | Uploads return immediately, indexing runs in the background, and a batch status endpoint is exposed — the front-end can render second-level progress bars and failure hints without blocking the user. |
| Pluggable file object store | Local / S3 / custom object-store backends — large files do not need to stay in memory, multiple workers in a distributed deployment can share the same file source, and migrating to the cloud requires zero application changes. |
| Distributed indexing and horizontal scaling | Parsing / chunking / embedding can be deployed as independent worker processes — as document volume or parse cost grows, scale the workers without affecting API throughput. |
| Built-in fault tolerance and self-healing | Task leases, heartbeat renewal, and periodic re-dispatch are all built in — worker crashes, network blips, and duplicate enqueues never get stuck, and operational cost is essentially zero. |
| Automatic embedding-model fit | At knowledge-base creation time, the service automatically filters out models incompatible with the vector store’s dimension policy — the user does not need to worry about dimension matching; the front-end options are the usable set, eliminating indexing failures from picking the wrong model. |
| Out-of-the-box REST API and front-end UI | Every capability is exposed as a complete REST endpoint, with an official front-end implementation — integrators get a ready-to-use upload / search / progress UI with no extra work. |
Quick Start
The steps below bring the RAG service up — backend + the official front-end — and let you create knowledge bases, upload documents, and run searches from the UI.Configure and start the RAG service
Pass a few RAG-related components to The RAG-related
create_app to enable the full set of /knowledge_bases endpoints. The minimal example below shows the two configurations — local blob store and S3 blob store — assuming Redis and Qdrant are already running locally (or at a reachable address):create_app parameters are listed below; without knowledge_base_manager, none of the knowledge-base endpoints are registered.Owner of the knowledge base lifecycle; binds a vector store instance whose connection lifecycle the manager proxies. The built-in
CollectionPerKbManager takes the “one collection per knowledge base” strategy, letting every knowledge base pick its own embedding dimension freely.Parsers registered to the upload path, dispatched by each parser’s
supported_media_types. In list form, later registrations override earlier ones for overlapping types (overrides log a warning); dict form media_type → parser is verbatim explicit routing (useful for binding one parser to multiple types or to custom aliases).The chunker shared across every knowledge base. In production, tune
chunk_size and overlap to fit the embedding model’s context window.Binary store for uploaded files. Local is fine for single-host setups; in a distributed deployment use S3 or a custom shared backend, because workers must share the same file source as the API.
When
True, the API process also runs parsing / chunking / embedding (single-process mode); when False, the API only accepts uploads and enqueues tasks, leaving indexing to a dedicated worker (distributed mode) — see “Deployment topologies” below.Start the official front-end
The Open the URL the dev server prints (usually
examples/web_ui directory in the AgentScope reposory ships a React front-end matching the backend above; just bring it up:http://localhost:5173); the front-end auto-connects to the service on port 8000.Deployment Topologies
The Quick Start above runs the API and the indexing pipeline in the same process, which is fine for local development and low-traffic scenarios. In production, however, parsing / chunking / embedding is a CPU- and IO-heavy pipeline; sharing the process with HTTP requests has two issues:- Resource contention: a single large PDF blocks the event loop on parsing and slows every other API request in the same process.
- Coarse scaling unit: the only horizontal scaling unit is the API replica, but the real resource hog is the indexing pipeline — scaling the API as a whole is wasteful.
| Dimension | Single-process | Distributed |
|---|---|---|
| Process topology | API + indexing in the same process | API + N workers |
| Resource isolation | Heavy parsing competes for request threads | API is unaffected by parsing load |
| Scaling | Scale the API replicas as a whole | Scale API and workers independently |
| Deployment complexity | One configuration is enough | Need two images / services |
| Use case | Local, prototype, light traffic | Production, parsing / embedding is the bottleneck |
Single-process deployment
create_app’s enable_index_worker defaults to True; the API process automatically starts an embedded worker coroutine in its lifespan — no extra configuration needed. This is exactly the form shown in “Quick Start”. If you previously disabled it, set it back to True:
Distributed deployment
The API process disables the embedded worker and only accepts uploads, enqueues tasks, and runs the safety-net sweeper; one or more worker processes start independently, subscribe to the same message-bus channel, and pull tasks. API side:- CLI:
python -m agentscope.app.rag.index_worker, combined with the environment variableAGENTSCOPE_WORKER_BOOTSTRAP=module:callablepointing at a factory that returns the backend dict. Operators can copy the same systemd / k8s unit to scale workers in bulk. - Library: in your own entry script, call
agentscope.app.rag.index_worker.run_worker(...)(orfrom agentscope.app.rag import run_worker), sharing the same backend instances with whatever you wired intocreate_app.
run_worker(...) call, just split into “build the kwargs” and “invoke” — returning the kwargs dict to be forwarded to run_worker:
HA / replication of the vector store itself is the chosen backend’s responsibility; the service layer only holds a connection handle. Pointing Qdrant at a cluster or S3 at a cross-region bucket is enough to scale the storage side without touching application code.
How It Works
The service layer’s core design is using the message bus (event bus) to fully decouple “upload” and “indexing” — the former is a synchronous path optimised for millisecond responses, the latter is asynchronous, retryable, and distribution-friendly. The two paths only talk through a singleindex_tasks channel on the bus, which is why the same code runs single-process in “Quick Start” and scales across hosts in “Distributed deployment” with no business logic changes.
The roles around the bus:
| Role | Process | Relationship with the bus | Responsibility |
|---|---|---|---|
| Knowledge base service (API) | API process | Publishes index tasks | Handles HTTP requests, streams blobs, persists pending records, and pushes tasks to the bus |
| Index consumer | Worker process (or embedded in the API process) | Subscribes to index signals | Listens to bus signals, batch-pulls tasks, hands them to the index worker |
| Index worker | Worker process (or embedded in the API process) | Runs the full pipeline once it gets a task | Lease → parse → chunk → embed → write vector store → mark ready |
| Index sweeper | API process | Re-publishes stuck tasks | Periodically scans for expired leases / long-lived pending records and re-enqueues them |
- Upload path (API process): the router forwards the request to the knowledge base service, which streams the file into the blob store, persists a
pendingrecord, and pushes one index task onto the bus; the HTTP response returns immediately, without running parsing / embedding inside the request. - Indexing path (worker process; can be embedded in the API or deployed standalone): the index consumer subscribes to bus signals, batch-pulls tasks, and hands them to the index worker, which then runs the full “lease → parse → chunk → embed → insert → mark ready” pipeline. Internally the indexing path calls
KnowledgeBaseManagerBase.get_knowledge(...)to obtain a runtimeKnowledgeBasehandle and then callsinsert_document(...)for embedding + insertion — the same code path library-mode callers run. - Self-healing path (always in the API process): the index sweeper periodically detects expired leases or long-stuck
pendingrecords and re-enqueues them on the bus; the CAS-based lease on the worker side guarantees no duplicate processing.
Document state machine
A document record’sstatus field flows strictly through the following states; the front-end uses it to render progress bars and failure hints:
| Status | Trigger | Meaning |
|---|---|---|
pending | Upload complete | File written to the blob store, record persisted; waiting for a worker to pick it up |
parsing | Worker acquires the lease | Streaming bytes from the blob store and handing them to the parser |
chunking | Parser returns | Chunking the parsed sections |
indexing | Chunker returns | Embedding and writing to the vector store |
ready | Vector-store write succeeded | Document is retrievable; chunk_count is populated at this moment |
error | Any stage raises | The error is reduced to one line and written into the error field; the blob and record are preserved so the front-end can investigate / the user can re-upload |
Fault tolerance and self-healing
The service layer ships a few designs around bus + lease that make long-running deployments uneventful:- Lease + CAS prevents reentry: the worker uses storage-layer CAS to acquire the lease; duplicate enqueues or multiple workers racing for the same document only execute once.
- Automatic lease renewal: leases live for 90 seconds by default and a built-in heartbeat renews every 45 seconds, so long-document parses never time out.
- Race detection: the worker runs the pipeline and the heartbeat in parallel; if the heartbeat detects a stolen lease (sweeper false positive / network blip), the pipeline is cancelled immediately to avoid double-writes against the vector store with the new owner.
- Safety-net re-dispatch: documents whose lease expired (worker crash) or whose
pendingexceeded the grace period (API publish failed) are periodically re-enqueued. - Errors isolated to the record: an exception at any stage is recorded in the document’s
errorfield — visible in the front-end. The blob and the record are not auto-cleaned, so the user can investigate and re-upload. - Idempotent delete path: vector store → record → blob, in that order; a mid-way failure followed by a retry never leaves the state inconsistent.
REST API Overview
The service exposes a full set of CRUD + upload + search endpoints under the/knowledge_bases prefix. Field-level request / response details live in the OpenAPI document; the table below groups endpoints by responsibility:
| Category | Endpoint | Description |
|---|---|---|
| Capability discovery | GET /knowledge_bases/embedding_models | Lists embedding models compatible with the vector store’s dimension policy under the current user’s credentials |
| Capability discovery | GET /knowledge_bases/supported_content_types | Lists IANA media types and file extensions supported by the parsers currently mounted on the API; the front-end uses this for <input accept> |
| Capability discovery | GET /knowledge_bases/middleware/parameters_schema | Returns the JSON Schema of RAGMiddleware.Parameters; the front-end uses this for a dynamic form |
| Knowledge base CRUD | POST/GET/PATCH/DELETE /knowledge_bases | Create / read / update / delete knowledge bases; deletion cascades through the collection, document records, and blobs |
| Document management | GET/POST/DELETE /knowledge_bases/{kb_id}/documents | List / upload / delete documents; uploads return immediately with pending |
| Status polling | GET /knowledge_bases/{kb_id}/documents/status?ids=a,b,c | Batch-query the current state of N in-flight documents; used by the front-end for progress rendering |
| Search | POST /knowledge_bases/{kb_id}/search | Natural-language query, returns the top-K retrieval results |
Further Reading
RAG
Learn the atomic interfaces of parser / chunker / vector store / middleware and their library-mode usage.
Architecture
create_app’s global parameters, lifespan, dependency injection, and ASGI middleware layer.Middleware
Which hooks
RAGMiddleware uses to inject retrieval results.Embedding Model
Embedding-model cards and dimension constraints decide which models a knowledge base can pick.