DataJuicer Agent
A multi-agent data processing system built on AgentScope and Data-Juicer (DJ), enabling non-experts to harness Data-Juicer via natural language.Why DataJuicer Agent?
Data processing in LLM R&D is often high-cost, low-efficiency, and hard to reproduce. Data quality, diversity, and task matching directly define model ceiling—optimizing data is optimizing the model. DataJuicer Agent supports data-model co-optimization through agent technology, moving from “script assembly” to a “think and get” workflow.What Does This Agent Do?
Data-Juicer provides the full lifecycle stack: DJ-OP (≈200 multimodal operators), DJ-Core (Ray-based, TB-scale), DJ-Sandbox (A/B test & scaling law), and DJ-Agents (conversational interface). DataJuicer Agent is an intelligent data collaborator that:- Intelligent Query: Match operators from ~200 options via natural language
- Automated Pipeline: Generate and run Data-Juicer YAML from descriptions
- Custom Extension: Develop and integrate custom operators locally
Architecture
Multi-Agent Routing Architecture
A Router Agent triages user requests into standard data processing (→ DJ Agent) or custom development (→ DJ Dev Agent).Two Integration Modes
- Tool Binding: Calls CLI (
dj-analyze,dj-process); low migration cost. - MCP Binding: Calls Data-Juicer MCP directly; no intermediate YAML, better performance.
Quick Start
System Requirements
Python 3.10+, DashScope API key; optionally Data-Juicer source for custom operators.
Configuration
Agent Introduction
Data Processing Agent
Handles Data-Juicer interaction: operator recommendation from natural language, config generation, and execution. Workflow: When a user says: “My data is saved in xxx, please clean entries with text length less than 5 and image size less than 10MB”, the Agent doesn’t blindly execute, but proceeds step by step:- Data Preview: Preview the first 5–10 data samples to confirm field names and data format—this is a crucial step to avoid configuration errors
- Operator Retrieval: Call the
query_dj_operatorstool to semantically match suitable operators - Parameter Decision: LLM autonomously decides global parameters (such as
dataset_path,export_path) and specific operator configurations - Configuration Generation: Generate standard YAML configuration files
- Execute Processing: Call the
dj-processcommand to execute actual processing
- Data Cleaning: Deduplication, removal of low-quality samples, format standardization
- Multimodal Processing: Process text, image, and video data simultaneously
- Batch Conversion: Format conversion, data augmentation, feature extraction
View Complete Example Log (from AgentScope Studio)
View Complete Example Log (from AgentScope Studio)

- Call
query_dj_operators, accurately returning two operators:text_length_filterandimage_size_filter - Use
view_text_filetool to preview raw data, confirming fields are indeed ‘text’ and ‘image’ - Generate YAML configuration and save to temporary path via
write_text_file - Call
execute_safe_commandto executedj-process, returning result path
Code Development Agent (DJ Dev Agent)
When built-in operators are insufficient, the DJ Dev Agent (default model:qwen3-coder-480b-a35b-instruct) compresses “docs → copy → tweak → test” from hours to minutes.
The goal of Operator Development Agent is to compress this process to minutes while ensuring code quality. Powered by the qwen3-coder-480b-a35b-instruct model by default.
Workflow:
When a user requests: “Help me create an operator that reverses word order and generate unit test files”, the Router routes it to DJ Dev Agent.
The Agent’s execution process consists of four steps:
- Operator Retrieval: Find existing operators with similar functionality as references
- Get Templates: Pull base class files and typical examples to ensure consistent code style
- Generate Code: Based on the function prototype provided by the user, generate operator classes compliant with DataJuicer specifications
- Local Integration: Register the new operator to the user-specified local codebase path
- Implement Operator: Create operator class file, inherit from
Mapper/Filterbase class, register using@OPERATORS.register_moduledecorator - Update Registration: Modify
__init__.py, add new class to__all__list - Write Tests: Generate unit tests covering multiple scenarios, including edge cases, ensuring robustness
- Develop domain-specific filter or transformation operators
- Integrate proprietary data processing logic
- Extend Data-Juicer capabilities for specific scenarios
View Complete Example Log (from AgentScope Studio)
View Complete Example Log (from AgentScope Studio)

Advanced Features
Operator Retrieval
The agent matches user intent to ~200 operators via a dedicated retrieval step. Choose mode with-r / --retrieve_mode:
- LLM (default):
Qwen-Turbosemantic match; best accuracy, higher tokens. - Vector (
vector): DashScope embedding +FAISS; fast, lower cost. - Auto (
auto): LLM first, fallback to vector.
MCP Agent
In addition to command-line tools, DataJuicer also natively supports MCP services, which is an important means to improve performance. MCP services can directly obtain operator information and execute data processing through native interfaces, making it easy to migrate and integrate without separate LLM queries and command-line calls.MCP Server Types
Data-Juicer provides two types of MCP: Recipe-Flow MCP (Data Recipe)- Provides two tools:
get_data_processing_opsandrun_data_recipe - Retrieves by operator type, applicable modalities, and other tags, no need to call LLM or vector models
- Suitable for standardized, high-frequency scenarios with better performance
- Wraps each built-in operator as an independent tool, runs on call
- Returns all operators by default, but can control visible scope through environment variables
- Suitable for fine-grained control, building fully customized data processing pipelines
The Data-Juicer MCP server is currently in early development, and features and tools may change with ongoing development.
Configuration
Configure the service address inconfigs/mcp_config.json:
Usage Methods
Enable MCP Agent to replace DJ Agent:Customization and Extension
Custom Prompts
All Agent system prompts are defined in theprompts.py file.
Model Replacement
You can specify different models for different Agents inmain.py. For example:
- Main Agent uses
qwen-maxfor complex reasoning - Development Agent uses
qwen3-coder-480b-a35b-instructto optimize code generation quality
Formatter and Memory can also be replaced. This design allows the system to be both out-of-the-box and adaptable to enterprise-level requirements.
Extending New Agents
DataJuicer Agent is an open framework. The core is theagents2toolkit function—it can automatically wrap any Agent as a tool callable by the Router.
Simply add your Agent instance to the agents list, and the Router will dynamically generate corresponding tools at runtime and automatically route based on task semantics.
This means you can quickly build domain-specific data agents based on this framework.
Extensibility is an important design principle.
Roadmap
The Data-Juicer agent ecosystem is rapidly expanding. Here are the new agents currently in development or planned:Data-Juicer Q&A Agent
Provides users with detailed answers about Data-Juicer operators, concepts, and best practices.Interactive Data Analysis and Visualization Agent (In Development)
We are building a more advanced human-machine collaborative data optimization workflow that introduces human feedback:- Users can view statistics, attribution analysis, and visualization results
- Dynamically edit recipes, approve or reject suggestions
- Underpinned by
dj.analyzer(data analysis),dj.attributor(effect attribution), anddj.sandbox(experiment management) - Supports closed-loop optimization based on validation tasks
Other Directions
- Data Processing Agent Benchmarking: Quantify the performance of different Agents in terms of accuracy, efficiency, and robustness
- Data “Health Check Report” & Data Intelligent Recommendation: Automatically diagnose data problems and recommend optimization solutions
- Router Agent Enhancement: More seamless, e.g., when operators are lacking → Code Development Agent → Data Processing Agent
- MCP Further Optimization: Embedded LLM, users can directly use MCP connected to their local environment (e.g., IDE) to get an experience similar to current data processing agents
- Knowledge Base and RAG-oriented Data Agents
- Better Automatic Processing Solution Generation: Less token usage, more efficient, higher quality processing results
- Data Workflow Template Reuse and Automatic Tuning: Based on DataJuicer community data recipes
- …
Common Issues
How do I get a DashScope API key?
How do I get a DashScope API key?
Visit the DashScope official website to register an account and apply for an API key.
Why does operator retrieval fail?
Why does operator retrieval fail?
Check your network connection and API key configuration, or try switching to vector retrieval mode with
--retrieve_mode vector.How do I debug custom operators?
How do I debug custom operators?
Ensure the Data-Juicer path is configured correctly and review the example code generated by the code development agent.
What should I do if the MCP service connection fails?
What should I do if the MCP service connection fails?
Check whether the MCP server is running and confirm the URL address in
configs/mcp_config.json is correct.Error: "400 Client Error: Bad Request for url: http://localhost:3000/trpc/pushMessage"
Error: "400 Client Error: Bad Request for url: http://localhost:3000/trpc/pushMessage"
Check if AgentScope Studio has been successfully started. Install it first with
npm install -g @agentscope/studio, then start it with as_studio.Optimization Recommendations
- For large-scale data processing, it is recommended to use DataJuicer’s distributed mode
- Set batch size appropriately to balance memory usage and processing speed
- For more advanced data processing features (synthesis, Data-Model Co-Development), please refer to DataJuicer documentation
Related Resources
AgentScope
The multi-agent framework powering DataJuicer Agent.
DataJuicer
The data processing engine with 200+ multimodal operators.
This documentation is based on the codebase at commit
dba3b86, tested with agentscope==1.0.5 and py-data-juicer==1.4.2. For more features and beta version features (such as DJ-QA agents, interactive recipe), see https://datajuicer.github.io/data-juicer-agent.