haive.core.schema.prebuilt.document_state

Document State Schema for the Haive Document Engine.

This module provides a comprehensive state schema for document processing workflows, integrating with the document loader system and providing state management for document loading, processing, and analysis operations.

Author: Claude (Haive AI Agent Framework) Version: 1.0.0

Classes

DocumentEngineInputSchema

Defines the input state for document loading and processing.

DocumentEngineOutputSchema

Defines the output state from document loading and processing.

DocumentState

Represents the complete state of a document processing workflow.

DocumentWorkflowSchema

Manages the state of a document processing workflow.

Module Contents

class haive.core.schema.prebuilt.document_state.DocumentEngineInputSchema(/, **data)[source]

Bases: haive.core.schema.state_schema.StateSchema

Defines the input state for document loading and processing.

This schema supports various source types and configurations, providing a flexible interface for document ingestion workflows.

Parameters:

data (Any)

source

The primary source to process, which can be a file path, URL, or a configuration dictionary.

Type:

Optional[Union[str, Path, Dict[str, Any]]]

sources

A list of sources for bulk processing.

Type:

Optional[List[Union[str, Path, Dict[str, Any]]]]

source_type

The explicit type of the source (e.g., FILE, URL). If not provided, it will be auto-detected.

Type:

Optional[DocumentSourceType]

loader_name

The specific loader to use for processing. If not provided, a loader will be auto-selected.

Type:

Optional[str]

loader_preference

The preference for auto-selecting a loader, balancing speed and quality. Defaults to BALANCED.

Type:

LoaderPreference

processing_strategy

The strategy for document processing. Defaults to ENHANCED.

Type:

ProcessingStrategy

chunking_strategy

The strategy for chunking documents. Defaults to RECURSIVE.

Type:

ChunkingStrategy

chunk_size

The size of chunks in characters. Defaults to 1000.

Type:

int

chunk_overlap

The overlap between chunks in characters. Defaults to 200.

Type:

int

recursive

Whether to recursively process directories. Defaults to True.

Type:

bool

max_documents

The maximum number of documents to load.

Type:

Optional[int]

use_async

Whether to use asynchronous loading when available. Defaults to False.

Type:

bool

parallel_processing

Whether to enable parallel processing for supported operations. Defaults to True.

Type:

bool

max_workers

The maximum number of worker threads for parallel processing. Defaults to 4.

Type:

int

include_patterns

Glob patterns for files to include.

Type:

List[str]

exclude_patterns

Glob patterns for files to exclude.

Type:

List[str]

loader_options

Additional options specific to the loader.

Type:

Dict[str, Any]

processing_options

Additional options for processing.

Type:

Dict[str, Any]

enable_caching

Whether to enable document caching. Defaults to False.

Type:

bool

cache_ttl

The time-to-live for the cache in seconds. Defaults to 3600.

Type:

int

Examples

Loading a single PDF file with default settings:

from haive.core.engine.document.config import DocumentSourceType
from haive.core.schema.prebuilt.document_state import DocumentEngineInputSchema

state = DocumentEngineInputSchema(
    source="/path/to/document.pdf",
    source_type=DocumentSourceType.FILE
)

Scraping a website with custom loader and processing options:

state = DocumentEngineInputSchema(
    source="https://example.com",
    source_type=DocumentSourceType.URL,
    loader_options={"verify_ssl": True},
    processing_options={"extract_links": True}
)

Configuring a bulk loading operation with a preference for quality:

from haive.core.engine.document.config import LoaderPreference, ChunkingStrategy

state = DocumentEngineInputSchema(
    sources=["/path/to/doc1.pdf", "/path/to/doc2.docx"],
    loader_preference=LoaderPreference.QUALITY,
    chunking_strategy=ChunkingStrategy.SEMANTIC
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class haive.core.schema.prebuilt.document_state.DocumentEngineOutputSchema(/, **data)[source]

Bases: haive.core.schema.state_schema.StateSchema

Defines the output state from document loading and processing.

This schema contains the loaded documents, metadata, and processing statistics, providing a comprehensive overview of the operation’s results.

Parameters:

data (Any)

documents

A list of processed documents.

Type:

List[ProcessedDocument]

raw_documents

A list of raw LangChain Document objects.

Type:

List[Document]

total_documents

The total number of documents processed.

Type:

int

successful_documents

The number of documents successfully processed.

Type:

int

failed_documents

The number of documents that failed to process.

Type:

int

operation_time

The total time for the operation in seconds.

Type:

float

average_processing_time

The average processing time per document.

Type:

float

original_source

The original source path or URL.

Type:

Optional[str]

source_type

The detected or specified source type.

Type:

Optional[DocumentSourceType]

loader_names

The names of the loaders used.

Type:

List[str]

processing_strategy

The processing strategy used.

Type:

Optional[ProcessingStrategy]

chunking_strategy

The chunking strategy used.

Type:

Optional[ChunkingStrategy]

total_chunks

The total number of chunks created.

Type:

int

total_characters

The total character count across all documents.

Type:

int

total_words

The estimated total word count.

Type:

int

errors

A list of errors encountered during processing.

Type:

List[Dict[str, Any]]

warnings

A list of warnings generated.

Type:

List[Dict[str, Any]]

metadata

Additional metadata about the operation.

Type:

Dict[str, Any]

Examples

Inspecting the output of a successful loading operation:

# Assuming 'output' is an instance of DocumentEngineOutputSchema
if output.successful_documents > 0:
    print(f"Successfully loaded {output.successful_documents} documents.")
    print(f"Total chunks created: {output.total_chunks}")
    print(f"First document content: {output.documents[0].content[:100]}")

Handling partial success with errors:

if output.failed_documents > 0:
    print(f"Failed to load {output.failed_documents} documents.")
    for error in output.errors:
        print(f"Source: {error['source']}, Error: {error['error']}")

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

add_document(document)[source]

Adds a processed document to the output state.

This method appends a processed document to the output state and updates various statistics, such as total documents, successful documents, and chunk/character/word counts.

Parameters:

document (ProcessedDocument) – The processed document to add.

Return type:

None

add_error(source, error, details=None)[source]

Adds an error record to the output state.

This method is used to log errors encountered during document processing, providing a structured way to track failures.

Parameters:
  • source (str) – The source that caused the error (e.g., file path or URL).

  • error (str) – A description of the error.

  • details (Optional[Dict[str, Any]]) – Additional details about the error.

Return type:

None

calculate_statistics()[source]

Calculates aggregate statistics from the loaded documents.

This method updates the output state with summary statistics, such as average processing time and total counts for chunks, characters, and words.

Return type:

None

class haive.core.schema.prebuilt.document_state.DocumentState(/, **data)[source]

Bases: DocumentEngineInputSchema, DocumentEngineOutputSchema

Represents the complete state of a document processing workflow.

This schema combines the input, output, and workflow states to provide a full picture of a document processing operation. It inherits all attributes from DocumentEngineInputSchema and DocumentEngineOutputSchema.

Parameters:

data (Any)

workflow

The state of the processing workflow.

Type:

DocumentWorkflowSchema

Examples

Initializing a complete workflow state and executing a step:

from haive.core.engine.document.config import DocumentSourceType
from haive.core.schema.prebuilt.document_state import DocumentState, DocumentWorkflowSchema

# Initial state for processing a directory
state = DocumentState(
    source="/path/to/documents/",
    source_type=DocumentSourceType.DIRECTORY,
    recursive=True,
    workflow=DocumentWorkflowSchema(processing_stage="loading")
)

# After processing, the state might look like this:
# state.total_documents = 50
# state.successful_documents = 48
# state.workflow.processing_stage = "completed"

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class Config[source]

Pydantic configuration for the DocumentState schema.

arbitrary_types_allowed

Allows Pydantic to handle arbitrary types, which is useful for complex data structures like langchain_core.documents.Document.

Type:

bool

class haive.core.schema.prebuilt.document_state.DocumentWorkflowSchema(/, **data)[source]

Bases: pydantic.BaseModel

Manages the state of a document processing workflow.

This schema tracks the progress and metadata of a multi-step document processing workflow.

Parameters:

data (Any)

processing_stage

The current stage of the processing workflow (e.g., “initialized”, “loading”, “chunking”, “completed”).

Type:

str

last_processed_index

The index of the last document processed, useful for resuming workflows.

Type:

int

workflow_metadata

A dictionary for storing any additional metadata related to the workflow.

Type:

Dict[str, Any]

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.