agents.document_processing.agent¶

Comprehensive Document Processing Agent.

This agent provides end-to-end document processing capabilities including: - Document fetching with ReactAgent + search tools - Auto-loading with bulk processing - Transform/split/annotate/embed pipeline - Advanced RAG features (refined queries, self-query, etc.) - State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Examples

Basic document processing:

agent = DocumentProcessingAgent()
result = agent.process_query("Load and analyze reports from https://company.com/reports")

Advanced RAG with custom retrieval:

config = DocumentProcessingConfig(
    retrieval_strategy="self_query",
    query_refinement=True,
    annotation_enabled=True,
    embedding_model="text-embedding-3-large"
)
agent = DocumentProcessingAgent(config=config)
result = agent.process_query("Find all financial projections from Q4 2024")

Multi-source document processing:

sources = [
    "/path/to/local/docs/",
    "https://wiki.company.com/procedures",
    "s3://bucket/documents/",
    {"url": "https://api.service.com/docs", "headers": {"Authorization": "Bearer token"}}
]
agent = DocumentProcessingAgent()
result = agent.process_sources(sources, query="Extract key insights")

Author: Claude (Haive AI Agent Framework) Version: 1.0.0

Classes¶

`DocumentProcessingAgent`	Comprehensive document processing agent with full pipeline capabilities.
`DocumentProcessingConfig`	Configuration for comprehensive document processing.
`DocumentProcessingResult`	Result from document processing operation.
`DocumentProcessingState`	State for document processing operations.

Module Contents¶

class agents.document_processing.agent.DocumentProcessingAgent(config=None, engine=None, name='document_processor')¶

Comprehensive document processing agent with full pipeline capabilities.

This agent provides a complete document processing pipeline including: 1. Document Discovery & Fetching (ReactAgent + search tools) 2. Auto-loading with bulk processing 3. Transform/split/annotate/embed pipeline 4. Advanced RAG features 5. State management and persistence

The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.

Initialize the document processing agent.

Parameters:

config (DocumentProcessingConfig | None) – Configuration for document processing
engine (haive.core.engine.aug_llm.AugLLMConfig | None) – LLM engine configuration
name (str) – Agent name for identification

get_capabilities()¶

Get agent capabilities and configuration.

Return type:: dict[str, Any]

async process_query(query, sources=None)¶

Process a query with comprehensive document processing pipeline.

Parameters:

query (str) – The user query to process
sources (list[str | dict[str, Any]] | None) – Optional list of specific sources to use

Returns:

DocumentProcessingResult with comprehensive results

Return type:

DocumentProcessingResult

async process_sources(sources, query)¶

Process specific sources with a query.

Parameters:

sources (list[str | dict[str, Any]]) – List of sources to process
query (str) – Query to process against the sources

Returns:

DocumentProcessingResult with results

Return type:

DocumentProcessingResult

class agents.document_processing.agent.DocumentProcessingConfig(/, **data)¶

Bases: pydantic.BaseModel

Configuration for comprehensive document processing.

Parameters:: data (Any)

# Core Processing

auto_loader_config¶: Configuration for document auto-loading

enable_bulk_processing¶: Enable concurrent bulk document processing

max_concurrent_loads¶: Maximum concurrent document loads

# Search & Retrieval

search_enabled¶: Enable web search for document discovery

search_depth¶: Search depth for web queries (“basic” or “advanced”)

retrieval_strategy¶: Strategy for document retrieval

retrieval_config¶: Configuration for retrieval components

# Query Processing

query_refinement¶: Enable query refinement for better results

multi_query_enabled¶: Enable multiple query variations

query_expansion¶: Enable query expansion techniques

# Document Processing

annotation_enabled¶: Enable document annotation

summarization_enabled¶: Enable document summarization

kg_extraction_enabled¶: Enable knowledge graph extraction

# RAG Configuration

rag_strategy¶: RAG strategy to use

context_window_size¶: Context window size for RAG

chunk_size¶: Chunk size for document splitting

chunk_overlap¶: Overlap between chunks

# Embedding & Vectorization

embedding_model¶: Embedding model to use

vector_store_config¶: Vector store configuration

# Performance

enable_caching¶: Enable document caching

cache_ttl¶: Cache time-to-live in seconds

enable_streaming¶: Enable streaming responses

# Output

structured_output¶: Enable structured output generation

response_format¶: Format for agent responses

include_sources¶: Include source information in responses

include_metadata¶: Include processing metadata

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class agents.document_processing.agent.DocumentProcessingResult(/, **data)¶

Bases: pydantic.BaseModel

Result from document processing operation.

Parameters:: data (Any)

response¶: Main response content

sources¶: List of source documents used

metadata¶: Processing metadata

documents¶: Processed documents

query_info¶: Information about query processing

timing¶: Timing information

statistics¶: Processing statistics

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class agents.document_processing.agent.DocumentProcessingState(/, **data)¶

Bases: haive.core.schema.prebuilt.messages_state.MessagesState

State for document processing operations.

Extends MessagesState with document-specific fields for tracking document processing workflows.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)