agents.document.agent¶

Document Agent for comprehensive document processing pipeline.

from typing import Any This module provides the DocumentAgent class which implements the full document processing pipeline: FETCH -> LOAD -> TRANSFORM -> SPLIT -> ANNOTATE -> EMBED -> STORE -> RETRIEVE.

The agent leverages the Document Engine from haive-core to handle 97+ document types and sources with advanced processing capabilities including chunking, metadata extraction, and parallel processing.

Classes:: DocumentAgent: Main agent for document processing pipeline DocumentProcessingResult: Structured result from document processing

Examples

Basic usage:

from haive.agents.document import DocumentAgent
from haive.core.engine.document import DocumentEngineConfig

agent = DocumentAgent(name="doc_processor")
result = agent.process_sources(["document.pdf"])
print(f"Processed {result.total_documents} documents")

Advanced configuration:

config = DocumentEngineConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    parallel_processing=True,
    max_workers=8
)
agent = DocumentAgent(name="enterprise_processor", engine=config)
result = agent.process_directory("/path/to/documents")

Classes¶

`DocumentAgent`	Comprehensive Document Processing Agent.
`DocumentProcessingResult`	Comprehensive result of document processing pipeline.

Module Contents¶

class agents.document.agent.DocumentAgent¶

Bases: haive.agents.base.agent.Agent

Comprehensive Document Processing Agent.

This agent implements the complete document processing pipeline for enterprise-grade document ingestion and analysis. It provides a unified interface for the full document lifecycle from source discovery to retrievable storage.

## Processing Pipeline:

FETCH: Discover and validate document sources (files, URLs, databases, cloud)
LOAD: Load documents using 97+ specialized loaders with auto-detection
TRANSFORM: Normalize content, extract metadata, detect language/encoding
SPLIT: Apply intelligent chunking strategies (recursive, semantic, paragraph)
ANNOTATE: Extract and enrich metadata, apply content classification
EMBED: Generate vector embeddings for semantic search (optional)
STORE: Persist to vector databases and document stores (optional)
RETRIEVE: Enable search and retrieval capabilities (optional)

## Key Features:

Universal Source Support: 97+ document sources including files, URLs, databases, cloud storage
Intelligent Processing: Auto-detection of source types and optimal processing strategies
Parallel Processing: Multi-threaded processing for high-throughput scenarios
Advanced Chunking: Multiple strategies including recursive, semantic, and paragraph-based
Metadata Enrichment: Comprehensive metadata extraction and annotation
Error Resilience: Graceful error handling with detailed reporting
Extensible Pipeline: Configurable stages for custom processing workflows

## Configuration Options:

Parameters:

engine – Document engine configuration for processing pipeline
Configuration (# Chunking)
source_types – List of allowed source types (files, URLs, databases, etc.)
auto_detect_sources – Whether to auto-detect source types
max_sources – Maximum number of sources to process
Configuration
processing_strategy – Processing strategy (simple, enhanced, parallel)
parallel_processing – Enable parallel processing
max_workers – Maximum worker threads for parallel processing
Configuration
chunking_strategy – Strategy for document chunking
chunk_size – Size of chunks in characters
chunk_overlap – Overlap between consecutive chunks
Processing (# Content)
normalize_content – Whether to normalize content (whitespace, encoding)
extract_metadata – Whether to extract document metadata
detect_language – Whether to detect document language
Stages (# Pipeline)
enable_embedding – Whether to generate embeddings
enable_storage – Whether to store in vector database
enable_retrieval – Whether to enable retrieval capabilities
Handling (# Error)
raise_on_error – Whether to raise exceptions on errors
skip_invalid – Whether to skip invalid documents

## Example Usage:

### Basic Document Processing: .. code-block:: python

from haive.agents.document import DocumentAgent from haive.core.engine.document import DocumentEngineConfig

# Create agent for PDF processing agent = DocumentAgent( engine=DocumentEngineConfig( chunking_strategy=ChunkingStrategy.PARAGRAPH, chunk_size=1000, parallel_processing=True ) )

# Process a single document result = agent.invoke({ “source”: “document.pdf”, “extract_metadata”: True })

### Multi-Source Processing: .. code-block:: python

# Process multiple sources with different types agent = DocumentAgent.create_for_enterprise()

result = agent.process_sources([ “documents/reports/”, # Directory “https://example.com/api/docs”, # Web API “s3://bucket/documents/”, # Cloud storage “postgresql://db/documents” # Database ])

### Custom Pipeline Configuration: .. code-block:: python

# Configure custom processing pipeline agent = DocumentAgent( processing_strategy=ProcessingStrategy.ENHANCED, chunking_strategy=ChunkingStrategy.SEMANTIC, chunk_size=1500, enable_embedding=True, enable_storage=True, normalize_content=True, detect_language=True )

## Document Source Types Supported:

Files: PDF, DOCX, TXT, HTML, MD, JSON, CSV, XML, YAML, and 50+ more
Web: HTTP/HTTPS URLs, APIs, web scraping, RSS feeds
Cloud: AWS S3, Google Cloud Storage, Azure Blob, Dropbox, Box
Databases: PostgreSQL, MySQL, MongoDB, Elasticsearch, and 15+ more
Chat/Messaging: Slack, Discord, WhatsApp, Telegram exports
Knowledge Bases: Notion, Confluence, Obsidian, Roam Research
Version Control: Git repositories, GitHub, GitLab
Archives: ZIP, TAR, 7Z with recursive extraction

## Processing Strategies:

Simple: Basic loading and chunking for development/testing
Enhanced: Full metadata extraction, content normalization, language detection
Parallel: Multi-threaded processing for high-throughput production use

## Chunking Strategies:

None: No chunking, process documents as whole units
Fixed Size: Fixed character-based chunks with overlap
Recursive: Hierarchical splitting using multiple separators
Paragraph: Split on paragraph boundaries with size limits
Sentence: Split on sentence boundaries with size limits
Semantic: AI-powered semantic boundary detection (experimental)

Note

This agent integrates with the haive-core Document Engine which provides the underlying processing capabilities. The agent adds workflow orchestration, error handling, and result aggregation on top of the engine.

analyze_source_structure(source, **kwargs)¶

Analyze the structure of a source without full processing.

Parameters:

source (str) – Source to analyze
**kwargs – Additional analysis options

Returns:

Dictionary with source structure analysis

Return type:

dict[str, Any]

build_graph()¶

Build the document processing graph.

Return type:: haive.core.graph.state_graph.base_graph2.BaseGraph

classmethod create_for_databases(name='Database Document Agent', **kwargs)¶

Create DocumentAgent optimized for database document processing.

Parameters:: name (str)
Return type:: DocumentAgent

classmethod create_for_enterprise(name='Enterprise Document Agent', **kwargs)¶

Create DocumentAgent optimized for enterprise-scale processing.

Parameters:: name (str)
Return type:: DocumentAgent

classmethod create_for_pdfs(name='PDF Document Agent', chunk_size=1000, **kwargs)¶

Create DocumentAgent optimized for PDF processing.

Parameters:

name (str)
chunk_size (int)

Return type:

DocumentAgent

classmethod create_for_research(name='Research Document Agent', **kwargs)¶

Create DocumentAgent optimized for research document processing.

Parameters:: name (str)
Return type:: DocumentAgent

classmethod create_for_web_scraping(name='Web Scraping Agent', **kwargs)¶

Create DocumentAgent optimized for web content processing.

Parameters:: name (str)
Return type:: DocumentAgent

process_cloud_storage(cloud_paths, **kwargs)¶

Process documents from cloud storage.

Parameters:

cloud_paths (list[str]) – List of cloud storage paths (s3://, gs://, etc.)
**kwargs – Additional processing options

Returns:

DocumentProcessingResult with cloud processing results

Return type:

DocumentProcessingResult

process_directory(directory_path, recursive=True, include_patterns=None, exclude_patterns=None, **kwargs)¶

Process all documents in a directory.

Parameters:

directory_path (str) – Path to directory
recursive (bool) – Whether to process subdirectories
include_patterns (list[str] | None) – Glob patterns for files to include
exclude_patterns (list[str] | None) – Glob patterns for files to exclude
**kwargs – Additional processing options

Returns:

DocumentProcessingResult with directory processing results

Return type:

DocumentProcessingResult

process_sources(sources, **kwargs)¶

Process multiple document sources through the full pipeline.

Parameters:

sources (str | list[str]) – Single source or list of sources to process
**kwargs – Additional processing options

Returns:

DocumentProcessingResult with comprehensive pipeline results

Return type:

DocumentProcessingResult

process_urls(urls, **kwargs)¶

Process documents from web URLs.

Parameters:

urls (list[str]) – List of URLs to process
**kwargs – Additional processing options

Returns:

DocumentProcessingResult with web processing results

Return type:

DocumentProcessingResult

setup_agent()¶

Configure the document engine with agent settings.

Return type:: None

classmethod validate_chunk_overlap(v, info)¶

Ensure chunk overlap is less than chunk size.

Return type:: Any

classmethod validate_engine_type(v)¶

Ensure engine is DocumentEngineConfig.

Return type:: Any

class agents.document.agent.DocumentProcessingResult(/, **data)¶

Bases: pydantic.BaseModel

Comprehensive result of document processing pipeline.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:: data (Any)