agents.document.agent

Document Agent for comprehensive document processing pipeline.

from typing import Any This module provides the DocumentAgent class which implements the full document processing pipeline: FETCH -> LOAD -> TRANSFORM -> SPLIT -> ANNOTATE -> EMBED -> STORE -> RETRIEVE.

The agent leverages the Document Engine from haive-core to handle 97+ document types and sources with advanced processing capabilities including chunking, metadata extraction, and parallel processing.

Classes:

DocumentAgent: Main agent for document processing pipeline DocumentProcessingResult: Structured result from document processing

Examples

Basic usage:

from haive.agents.document import DocumentAgent
from haive.core.engine.document import DocumentEngineConfig

agent = DocumentAgent(name="doc_processor")
result = agent.process_sources(["document.pdf"])
print(f"Processed {result.total_documents} documents")

Advanced configuration:

config = DocumentEngineConfig(
    chunking_strategy=ChunkingStrategy.SEMANTIC,
    parallel_processing=True,
    max_workers=8
)
agent = DocumentAgent(name="enterprise_processor", engine=config)
result = agent.process_directory("/path/to/documents")

See also

  • DocumentEngine: Core processing engine

  • Agent: Base agent class

Classes

DocumentAgent

Comprehensive Document Processing Agent.

DocumentProcessingResult

Comprehensive result of document processing pipeline.

Module Contents

class agents.document.agent.DocumentAgent

Bases: haive.agents.base.agent.Agent

Comprehensive Document Processing Agent.

This agent implements the complete document processing pipeline for enterprise-grade document ingestion and analysis. It provides a unified interface for the full document lifecycle from source discovery to retrievable storage.

## Processing Pipeline:

  1. FETCH: Discover and validate document sources (files, URLs, databases, cloud)

  2. LOAD: Load documents using 97+ specialized loaders with auto-detection

  3. TRANSFORM: Normalize content, extract metadata, detect language/encoding

  4. SPLIT: Apply intelligent chunking strategies (recursive, semantic, paragraph)

  5. ANNOTATE: Extract and enrich metadata, apply content classification

  6. EMBED: Generate vector embeddings for semantic search (optional)

  7. STORE: Persist to vector databases and document stores (optional)

  8. RETRIEVE: Enable search and retrieval capabilities (optional)

## Key Features:

  • Universal Source Support: 97+ document sources including files, URLs, databases, cloud storage

  • Intelligent Processing: Auto-detection of source types and optimal processing strategies

  • Parallel Processing: Multi-threaded processing for high-throughput scenarios

  • Advanced Chunking: Multiple strategies including recursive, semantic, and paragraph-based

  • Metadata Enrichment: Comprehensive metadata extraction and annotation

  • Error Resilience: Graceful error handling with detailed reporting

  • Extensible Pipeline: Configurable stages for custom processing workflows

## Configuration Options:

Parameters:
  • engine – Document engine configuration for processing pipeline

  • Configuration (# Chunking)

  • source_types – List of allowed source types (files, URLs, databases, etc.)

  • auto_detect_sources – Whether to auto-detect source types

  • max_sources – Maximum number of sources to process

  • Configuration

  • processing_strategy – Processing strategy (simple, enhanced, parallel)

  • parallel_processing – Enable parallel processing

  • max_workers – Maximum worker threads for parallel processing

  • Configuration

  • chunking_strategy – Strategy for document chunking

  • chunk_size – Size of chunks in characters

  • chunk_overlap – Overlap between consecutive chunks

  • Processing (# Content)

  • normalize_content – Whether to normalize content (whitespace, encoding)

  • extract_metadata – Whether to extract document metadata

  • detect_language – Whether to detect document language

  • Stages (# Pipeline)

  • enable_embedding – Whether to generate embeddings

  • enable_storage – Whether to store in vector database

  • enable_retrieval – Whether to enable retrieval capabilities

  • Handling (# Error)

  • raise_on_error – Whether to raise exceptions on errors

  • skip_invalid – Whether to skip invalid documents

## Example Usage:

### Basic Document Processing: .. code-block:: python

from haive.agents.document import DocumentAgent from haive.core.engine.document import DocumentEngineConfig

# Create agent for PDF processing agent = DocumentAgent( engine=DocumentEngineConfig( chunking_strategy=ChunkingStrategy.PARAGRAPH, chunk_size=1000, parallel_processing=True ) )

# Process a single document result = agent.invoke({ “source”: “document.pdf”, “extract_metadata”: True })

### Multi-Source Processing: .. code-block:: python

# Process multiple sources with different types agent = DocumentAgent.create_for_enterprise()

result = agent.process_sources([ “documents/reports/”, # Directory “https://example.com/api/docs”, # Web API “s3://bucket/documents/”, # Cloud storage “postgresql://db/documents” # Database ])

### Custom Pipeline Configuration: .. code-block:: python

# Configure custom processing pipeline agent = DocumentAgent( processing_strategy=ProcessingStrategy.ENHANCED, chunking_strategy=ChunkingStrategy.SEMANTIC, chunk_size=1500, enable_embedding=True, enable_storage=True, normalize_content=True, detect_language=True )

## Document Source Types Supported:

  • Files: PDF, DOCX, TXT, HTML, MD, JSON, CSV, XML, YAML, and 50+ more

  • Web: HTTP/HTTPS URLs, APIs, web scraping, RSS feeds

  • Cloud: AWS S3, Google Cloud Storage, Azure Blob, Dropbox, Box

  • Databases: PostgreSQL, MySQL, MongoDB, Elasticsearch, and 15+ more

  • Chat/Messaging: Slack, Discord, WhatsApp, Telegram exports

  • Knowledge Bases: Notion, Confluence, Obsidian, Roam Research

  • Version Control: Git repositories, GitHub, GitLab

  • Archives: ZIP, TAR, 7Z with recursive extraction

## Processing Strategies:

  • Simple: Basic loading and chunking for development/testing

  • Enhanced: Full metadata extraction, content normalization, language detection

  • Parallel: Multi-threaded processing for high-throughput production use

## Chunking Strategies:

  • None: No chunking, process documents as whole units

  • Fixed Size: Fixed character-based chunks with overlap

  • Recursive: Hierarchical splitting using multiple separators

  • Paragraph: Split on paragraph boundaries with size limits

  • Sentence: Split on sentence boundaries with size limits

  • Semantic: AI-powered semantic boundary detection (experimental)

Note

This agent integrates with the haive-core Document Engine which provides the underlying processing capabilities. The agent adds workflow orchestration, error handling, and result aggregation on top of the engine.

analyze_source_structure(source, **kwargs)

Analyze the structure of a source without full processing.

Parameters:
  • source (str) – Source to analyze

  • **kwargs – Additional analysis options

Returns:

Dictionary with source structure analysis

Return type:

dict[str, Any]

build_graph()

Build the document processing graph.

Return type:

haive.core.graph.state_graph.base_graph2.BaseGraph

classmethod create_for_databases(name='Database Document Agent', **kwargs)

Create DocumentAgent optimized for database document processing.

Parameters:

name (str)

Return type:

DocumentAgent

classmethod create_for_enterprise(name='Enterprise Document Agent', **kwargs)

Create DocumentAgent optimized for enterprise-scale processing.

Parameters:

name (str)

Return type:

DocumentAgent

classmethod create_for_pdfs(name='PDF Document Agent', chunk_size=1000, **kwargs)

Create DocumentAgent optimized for PDF processing.

Parameters:
Return type:

DocumentAgent

classmethod create_for_research(name='Research Document Agent', **kwargs)

Create DocumentAgent optimized for research document processing.

Parameters:

name (str)

Return type:

DocumentAgent

classmethod create_for_web_scraping(name='Web Scraping Agent', **kwargs)

Create DocumentAgent optimized for web content processing.

Parameters:

name (str)

Return type:

DocumentAgent

process_cloud_storage(cloud_paths, **kwargs)

Process documents from cloud storage.

Parameters:
  • cloud_paths (list[str]) – List of cloud storage paths (s3://, gs://, etc.)

  • **kwargs – Additional processing options

Returns:

DocumentProcessingResult with cloud processing results

Return type:

DocumentProcessingResult

process_directory(directory_path, recursive=True, include_patterns=None, exclude_patterns=None, **kwargs)

Process all documents in a directory.

Parameters:
  • directory_path (str) – Path to directory

  • recursive (bool) – Whether to process subdirectories

  • include_patterns (list[str] | None) – Glob patterns for files to include

  • exclude_patterns (list[str] | None) – Glob patterns for files to exclude

  • **kwargs – Additional processing options

Returns:

DocumentProcessingResult with directory processing results

Return type:

DocumentProcessingResult

process_sources(sources, **kwargs)

Process multiple document sources through the full pipeline.

Parameters:
  • sources (str | list[str]) – Single source or list of sources to process

  • **kwargs – Additional processing options

Returns:

DocumentProcessingResult with comprehensive pipeline results

Return type:

DocumentProcessingResult

process_urls(urls, **kwargs)

Process documents from web URLs.

Parameters:
  • urls (list[str]) – List of URLs to process

  • **kwargs – Additional processing options

Returns:

DocumentProcessingResult with web processing results

Return type:

DocumentProcessingResult

setup_agent()

Configure the document engine with agent settings.

Return type:

None

classmethod validate_chunk_overlap(v, info)

Ensure chunk overlap is less than chunk size.

Return type:

Any

classmethod validate_engine_type(v)

Ensure engine is DocumentEngineConfig.

Return type:

Any

class agents.document.agent.DocumentProcessingResult(/, **data)

Bases: pydantic.BaseModel

Comprehensive result of document processing pipeline.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Parameters:

data (Any)