agents.document.agent¶
Document Agent for comprehensive document processing pipeline.
from typing import Any This module provides the DocumentAgent class which implements the full document processing pipeline: FETCH -> LOAD -> TRANSFORM -> SPLIT -> ANNOTATE -> EMBED -> STORE -> RETRIEVE.
The agent leverages the Document Engine from haive-core to handle 97+ document types and sources with advanced processing capabilities including chunking, metadata extraction, and parallel processing.
- Classes:
DocumentAgent: Main agent for document processing pipeline DocumentProcessingResult: Structured result from document processing
Examples
Basic usage:
from haive.agents.document import DocumentAgent
from haive.core.engine.document import DocumentEngineConfig
agent = DocumentAgent(name="doc_processor")
result = agent.process_sources(["document.pdf"])
print(f"Processed {result.total_documents} documents")
Advanced configuration:
config = DocumentEngineConfig(
chunking_strategy=ChunkingStrategy.SEMANTIC,
parallel_processing=True,
max_workers=8
)
agent = DocumentAgent(name="enterprise_processor", engine=config)
result = agent.process_directory("/path/to/documents")
See also
DocumentEngine
: Core processing engineAgent
: Base agent class
Classes¶
Comprehensive Document Processing Agent. |
|
Comprehensive result of document processing pipeline. |
Module Contents¶
- class agents.document.agent.DocumentAgent¶
Bases:
haive.agents.base.agent.Agent
Comprehensive Document Processing Agent.
This agent implements the complete document processing pipeline for enterprise-grade document ingestion and analysis. It provides a unified interface for the full document lifecycle from source discovery to retrievable storage.
## Processing Pipeline:
FETCH: Discover and validate document sources (files, URLs, databases, cloud)
LOAD: Load documents using 97+ specialized loaders with auto-detection
TRANSFORM: Normalize content, extract metadata, detect language/encoding
SPLIT: Apply intelligent chunking strategies (recursive, semantic, paragraph)
ANNOTATE: Extract and enrich metadata, apply content classification
EMBED: Generate vector embeddings for semantic search (optional)
STORE: Persist to vector databases and document stores (optional)
RETRIEVE: Enable search and retrieval capabilities (optional)
## Key Features:
Universal Source Support: 97+ document sources including files, URLs, databases, cloud storage
Intelligent Processing: Auto-detection of source types and optimal processing strategies
Parallel Processing: Multi-threaded processing for high-throughput scenarios
Advanced Chunking: Multiple strategies including recursive, semantic, and paragraph-based
Metadata Enrichment: Comprehensive metadata extraction and annotation
Error Resilience: Graceful error handling with detailed reporting
Extensible Pipeline: Configurable stages for custom processing workflows
## Configuration Options:
- Parameters:
engine – Document engine configuration for processing pipeline
Configuration (# Chunking)
source_types – List of allowed source types (files, URLs, databases, etc.)
auto_detect_sources – Whether to auto-detect source types
max_sources – Maximum number of sources to process
Configuration
processing_strategy – Processing strategy (simple, enhanced, parallel)
parallel_processing – Enable parallel processing
max_workers – Maximum worker threads for parallel processing
Configuration
chunking_strategy – Strategy for document chunking
chunk_size – Size of chunks in characters
chunk_overlap – Overlap between consecutive chunks
Processing (# Content)
normalize_content – Whether to normalize content (whitespace, encoding)
extract_metadata – Whether to extract document metadata
detect_language – Whether to detect document language
Stages (# Pipeline)
enable_embedding – Whether to generate embeddings
enable_storage – Whether to store in vector database
enable_retrieval – Whether to enable retrieval capabilities
Handling (# Error)
raise_on_error – Whether to raise exceptions on errors
skip_invalid – Whether to skip invalid documents
## Example Usage:
### Basic Document Processing: .. code-block:: python
from haive.agents.document import DocumentAgent from haive.core.engine.document import DocumentEngineConfig
# Create agent for PDF processing agent = DocumentAgent( engine=DocumentEngineConfig( chunking_strategy=ChunkingStrategy.PARAGRAPH, chunk_size=1000, parallel_processing=True ) )
# Process a single document result = agent.invoke({ “source”: “document.pdf”, “extract_metadata”: True })
### Multi-Source Processing: .. code-block:: python
# Process multiple sources with different types agent = DocumentAgent.create_for_enterprise()
result = agent.process_sources([ “documents/reports/”, # Directory “https://example.com/api/docs”, # Web API “s3://bucket/documents/”, # Cloud storage “postgresql://db/documents” # Database ])
### Custom Pipeline Configuration: .. code-block:: python
# Configure custom processing pipeline agent = DocumentAgent( processing_strategy=ProcessingStrategy.ENHANCED, chunking_strategy=ChunkingStrategy.SEMANTIC, chunk_size=1500, enable_embedding=True, enable_storage=True, normalize_content=True, detect_language=True )
## Document Source Types Supported:
Files: PDF, DOCX, TXT, HTML, MD, JSON, CSV, XML, YAML, and 50+ more
Web: HTTP/HTTPS URLs, APIs, web scraping, RSS feeds
Cloud: AWS S3, Google Cloud Storage, Azure Blob, Dropbox, Box
Databases: PostgreSQL, MySQL, MongoDB, Elasticsearch, and 15+ more
Chat/Messaging: Slack, Discord, WhatsApp, Telegram exports
Knowledge Bases: Notion, Confluence, Obsidian, Roam Research
Version Control: Git repositories, GitHub, GitLab
Archives: ZIP, TAR, 7Z with recursive extraction
## Processing Strategies:
Simple: Basic loading and chunking for development/testing
Enhanced: Full metadata extraction, content normalization, language detection
Parallel: Multi-threaded processing for high-throughput production use
## Chunking Strategies:
None: No chunking, process documents as whole units
Fixed Size: Fixed character-based chunks with overlap
Recursive: Hierarchical splitting using multiple separators
Paragraph: Split on paragraph boundaries with size limits
Sentence: Split on sentence boundaries with size limits
Semantic: AI-powered semantic boundary detection (experimental)
Note
This agent integrates with the haive-core Document Engine which provides the underlying processing capabilities. The agent adds workflow orchestration, error handling, and result aggregation on top of the engine.
- analyze_source_structure(source, **kwargs)¶
Analyze the structure of a source without full processing.
- build_graph()¶
Build the document processing graph.
- Return type:
haive.core.graph.state_graph.base_graph2.BaseGraph
- classmethod create_for_databases(name='Database Document Agent', **kwargs)¶
Create DocumentAgent optimized for database document processing.
- Parameters:
name (str)
- Return type:
- classmethod create_for_enterprise(name='Enterprise Document Agent', **kwargs)¶
Create DocumentAgent optimized for enterprise-scale processing.
- Parameters:
name (str)
- Return type:
- classmethod create_for_pdfs(name='PDF Document Agent', chunk_size=1000, **kwargs)¶
Create DocumentAgent optimized for PDF processing.
- Parameters:
- Return type:
- classmethod create_for_research(name='Research Document Agent', **kwargs)¶
Create DocumentAgent optimized for research document processing.
- Parameters:
name (str)
- Return type:
- classmethod create_for_web_scraping(name='Web Scraping Agent', **kwargs)¶
Create DocumentAgent optimized for web content processing.
- Parameters:
name (str)
- Return type:
- process_cloud_storage(cloud_paths, **kwargs)¶
Process documents from cloud storage.
- Parameters:
- Returns:
DocumentProcessingResult with cloud processing results
- Return type:
- process_directory(directory_path, recursive=True, include_patterns=None, exclude_patterns=None, **kwargs)¶
Process all documents in a directory.
- Parameters:
- Returns:
DocumentProcessingResult with directory processing results
- Return type:
- process_sources(sources, **kwargs)¶
Process multiple document sources through the full pipeline.
- Parameters:
- Returns:
DocumentProcessingResult with comprehensive pipeline results
- Return type:
- process_urls(urls, **kwargs)¶
Process documents from web URLs.
- Parameters:
- Returns:
DocumentProcessingResult with web processing results
- Return type:
- setup_agent()¶
Configure the document engine with agent settings.
- Return type:
None
- classmethod validate_chunk_overlap(v, info)¶
Ensure chunk overlap is less than chunk size.
- Return type:
Any
- classmethod validate_engine_type(v)¶
Ensure engine is DocumentEngineConfig.
- Return type:
Any
- class agents.document.agent.DocumentProcessingResult(/, **data)¶
Bases:
pydantic.BaseModel
Comprehensive result of document processing pipeline.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- Parameters:
data (Any)