agents.document_processing.agent

Comprehensive Document Processing Agent.

This agent provides end-to-end document processing capabilities including: - Document fetching with ReactAgent + search tools - Auto-loading with bulk processing - Transform/split/annotate/embed pipeline - Advanced RAG features (refined queries, self-query, etc.) - State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Examples

Basic document processing:

agent = DocumentProcessingAgent()
result = agent.process_query("Load and analyze reports from https://company.com/reports")

Advanced RAG with custom retrieval:

config = DocumentProcessingConfig(
    retrieval_strategy="self_query",
    query_refinement=True,
    annotation_enabled=True,
    embedding_model="text-embedding-3-large"
)
agent = DocumentProcessingAgent(config=config)
result = agent.process_query("Find all financial projections from Q4 2024")

Multi-source document processing:

sources = [
    "/path/to/local/docs/",
    "https://wiki.company.com/procedures",
    "s3://bucket/documents/",
    {"url": "https://api.service.com/docs", "headers": {"Authorization": "Bearer token"}}
]
agent = DocumentProcessingAgent()
result = agent.process_sources(sources, query="Extract key insights")

Author: Claude (Haive AI Agent Framework) Version: 1.0.0

Classes

DocumentProcessingAgent

Comprehensive document processing agent with full pipeline capabilities.

DocumentProcessingConfig

Configuration for comprehensive document processing.

DocumentProcessingResult

Result from document processing operation.

DocumentProcessingState

State for document processing operations.

Module Contents

class agents.document_processing.agent.DocumentProcessingAgent(config=None, engine=None, name='document_processor')

Comprehensive document processing agent with full pipeline capabilities.

This agent provides a complete document processing pipeline including: 1. Document Discovery & Fetching (ReactAgent + search tools) 2. Auto-loading with bulk processing 3. Transform/split/annotate/embed pipeline 4. Advanced RAG features 5. State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Initialize the document processing agent.

Parameters:
  • config (DocumentProcessingConfig | None) – Configuration for document processing

  • engine (haive.core.engine.aug_llm.AugLLMConfig | None) – LLM engine configuration

  • name (str) – Agent name for identification

get_capabilities()

Get agent capabilities and configuration.

Return type:

dict[str, Any]

async process_query(query, sources=None)

Process a query with comprehensive document processing pipeline.

Parameters:
  • query (str) – The user query to process

  • sources (list[str | dict[str, Any]] | None) – Optional list of specific sources to use

Returns:

DocumentProcessingResult with comprehensive results

Return type:

DocumentProcessingResult

async process_sources(sources, query)

Process specific sources with a query.

Parameters:
  • sources (list[str | dict[str, Any]]) – List of sources to process

  • query (str) – Query to process against the sources

Returns:

DocumentProcessingResult with results

Return type:

DocumentProcessingResult

class agents.document_processing.agent.DocumentProcessingConfig(/, **data)

Bases: pydantic.BaseModel

Configuration for comprehensive document processing.

Parameters:

data (Any)

# Core Processing
auto_loader_config

Configuration for document auto-loading

enable_bulk_processing

Enable concurrent bulk document processing

max_concurrent_loads

Maximum concurrent document loads

# Search & Retrieval
search_enabled

Enable web search for document discovery

search_depth

Search depth for web queries (“basic” or “advanced”)

retrieval_strategy

Strategy for document retrieval

retrieval_config

Configuration for retrieval components

# Query Processing
query_refinement

Enable query refinement for better results

multi_query_enabled

Enable multiple query variations

query_expansion

Enable query expansion techniques

# Document Processing
annotation_enabled

Enable document annotation

summarization_enabled

Enable document summarization

kg_extraction_enabled

Enable knowledge graph extraction

# RAG Configuration
rag_strategy

RAG strategy to use

context_window_size

Context window size for RAG

chunk_size

Chunk size for document splitting

chunk_overlap

Overlap between chunks

# Embedding & Vectorization
embedding_model

Embedding model to use

vector_store_config

Vector store configuration

# Performance
enable_caching

Enable document caching

cache_ttl

Cache time-to-live in seconds

enable_streaming

Enable streaming responses

# Output
structured_output

Enable structured output generation

response_format

Format for agent responses

include_sources

Include source information in responses

include_metadata

Include processing metadata

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class agents.document_processing.agent.DocumentProcessingResult(/, **data)

Bases: pydantic.BaseModel

Result from document processing operation.

Parameters:

data (Any)

response

Main response content

sources

List of source documents used

metadata

Processing metadata

documents

Processed documents

query_info

Information about query processing

timing

Timing information

statistics

Processing statistics

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class agents.document_processing.agent.DocumentProcessingState(/, **data)

Bases: haive.core.schema.prebuilt.messages_state.MessagesState

State for document processing operations.

Extends MessagesState with document-specific fields for tracking document processing workflows.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)