agents.document_processing.agent¶
Comprehensive Document Processing Agent.
This agent provides end-to-end document processing capabilities including: - Document fetching with ReactAgent + search tools - Auto-loading with bulk processing - Transform/split/annotate/embed pipeline - Advanced RAG features (refined queries, self-query, etc.) - State management and persistence
The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.
Examples
Basic document processing:
agent = DocumentProcessingAgent()
result = agent.process_query("Load and analyze reports from https://company.com/reports")
Advanced RAG with custom retrieval:
config = DocumentProcessingConfig(
retrieval_strategy="self_query",
query_refinement=True,
annotation_enabled=True,
embedding_model="text-embedding-3-large"
)
agent = DocumentProcessingAgent(config=config)
result = agent.process_query("Find all financial projections from Q4 2024")
Multi-source document processing:
sources = [
"/path/to/local/docs/",
"https://wiki.company.com/procedures",
"s3://bucket/documents/",
{"url": "https://api.service.com/docs", "headers": {"Authorization": "Bearer token"}}
]
agent = DocumentProcessingAgent()
result = agent.process_sources(sources, query="Extract key insights")
Author: Claude (Haive AI Agent Framework) Version: 1.0.0
Classes¶
Comprehensive document processing agent with full pipeline capabilities. |
|
Configuration for comprehensive document processing. |
|
Result from document processing operation. |
|
State for document processing operations. |
Module Contents¶
- class agents.document_processing.agent.DocumentProcessingAgent(config=None, engine=None, name='document_processor')¶
Comprehensive document processing agent with full pipeline capabilities.
This agent provides a complete document processing pipeline including: 1. Document Discovery & Fetching (ReactAgent + search tools) 2. Auto-loading with bulk processing 3. Transform/split/annotate/embed pipeline 4. Advanced RAG features 5. State management and persistence
The agent integrates all existing Haive document processing components into a unified, powerful system for document-based AI workflows.
Initialize the document processing agent.
- Parameters:
config (DocumentProcessingConfig | None) – Configuration for document processing
engine (haive.core.engine.aug_llm.AugLLMConfig | None) – LLM engine configuration
name (str) – Agent name for identification
- async process_query(query, sources=None)¶
Process a query with comprehensive document processing pipeline.
- class agents.document_processing.agent.DocumentProcessingConfig(/, **data)¶
Bases:
pydantic.BaseModel
Configuration for comprehensive document processing.
- Parameters:
data (Any)
- # Core Processing
- auto_loader_config¶
Configuration for document auto-loading
- enable_bulk_processing¶
Enable concurrent bulk document processing
- max_concurrent_loads¶
Maximum concurrent document loads
- # Search & Retrieval
- search_enabled¶
Enable web search for document discovery
- search_depth¶
Search depth for web queries (“basic” or “advanced”)
- retrieval_strategy¶
Strategy for document retrieval
- retrieval_config¶
Configuration for retrieval components
- # Query Processing
- query_refinement¶
Enable query refinement for better results
- multi_query_enabled¶
Enable multiple query variations
- query_expansion¶
Enable query expansion techniques
- # Document Processing
- annotation_enabled¶
Enable document annotation
- summarization_enabled¶
Enable document summarization
- kg_extraction_enabled¶
Enable knowledge graph extraction
- # RAG Configuration
- rag_strategy¶
RAG strategy to use
- context_window_size¶
Context window size for RAG
- chunk_size¶
Chunk size for document splitting
- chunk_overlap¶
Overlap between chunks
- # Embedding & Vectorization
- embedding_model¶
Embedding model to use
- vector_store_config¶
Vector store configuration
- # Performance
- enable_caching¶
Enable document caching
- cache_ttl¶
Cache time-to-live in seconds
- enable_streaming¶
Enable streaming responses
- # Output
- structured_output¶
Enable structured output generation
- response_format¶
Format for agent responses
- include_sources¶
Include source information in responses
- include_metadata¶
Include processing metadata
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- class agents.document_processing.agent.DocumentProcessingResult(/, **data)¶
Bases:
pydantic.BaseModel
Result from document processing operation.
- Parameters:
data (Any)
- response¶
Main response content
- sources¶
List of source documents used
- metadata¶
Processing metadata
- documents¶
Processed documents
- query_info¶
Information about query processing
- timing¶
Timing information
- statistics¶
Processing statistics
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- class agents.document_processing.agent.DocumentProcessingState(/, **data)¶
Bases:
haive.core.schema.prebuilt.messages_state.MessagesState
State for document processing operations.
Extends MessagesState with document-specific fields for tracking document processing workflows.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
data (Any)