haive.core.schema.prebuilt.document_state¶

Document State Schema for the Haive Document Engine.

This module provides a comprehensive state schema for document processing workflows, integrating with the document loader system and providing state management for document loading, processing, and analysis operations.

Author: Claude (Haive AI Agent Framework) Version: 1.0.0

Classes¶

`DocumentEngineInputSchema`	Defines the input state for document loading and processing.
`DocumentEngineOutputSchema`	Defines the output state from document loading and processing.
`DocumentState`	Represents the complete state of a document processing workflow.
`DocumentWorkflowSchema`	Manages the state of a document processing workflow.

Module Contents¶

class haive.core.schema.prebuilt.document_state.DocumentEngineInputSchema(/, **data)[source]¶

Bases: haive.core.schema.state_schema.StateSchema

Defines the input state for document loading and processing.

This schema supports various source types and configurations, providing a flexible interface for document ingestion workflows.

Parameters:: data (Any)

source¶

The primary source to process, which can be a file path, URL, or a configuration dictionary.

Type:: Optional[Union[str, Path, Dict[str, Any]]]

sources¶

A list of sources for bulk processing.

Type:: Optional[List[Union[str, Path, Dict[str, Any]]]]

source_type¶

The explicit type of the source (e.g., FILE, URL). If not provided, it will be auto-detected.

Type:: Optional[DocumentSourceType]

loader_name¶

The specific loader to use for processing. If not provided, a loader will be auto-selected.

Type:: Optional[str]

loader_preference¶

The preference for auto-selecting a loader, balancing speed and quality. Defaults to BALANCED.

Type:: LoaderPreference

processing_strategy¶

The strategy for document processing. Defaults to ENHANCED.

Type:: ProcessingStrategy

chunking_strategy¶

The strategy for chunking documents. Defaults to RECURSIVE.

Type:: ChunkingStrategy

chunk_size¶

The size of chunks in characters. Defaults to 1000.

Type:: int

chunk_overlap¶

The overlap between chunks in characters. Defaults to 200.

Type:: int

recursive¶

Whether to recursively process directories. Defaults to True.

Type:: bool

max_documents¶

The maximum number of documents to load.

Type:: Optional[int]

use_async¶

Whether to use asynchronous loading when available. Defaults to False.

Type:: bool

parallel_processing¶

Whether to enable parallel processing for supported operations. Defaults to True.

Type:: bool

max_workers¶

The maximum number of worker threads for parallel processing. Defaults to 4.

Type:: int

include_patterns¶

Glob patterns for files to include.

Type:: List[str]

exclude_patterns¶

Glob patterns for files to exclude.

Type:: List[str]

loader_options¶

Additional options specific to the loader.

Type:: Dict[str, Any]

processing_options¶

Additional options for processing.

Type:: Dict[str, Any]

enable_caching¶

Whether to enable document caching. Defaults to False.

Type:: bool

cache_ttl¶

The time-to-live for the cache in seconds. Defaults to 3600.

Type:: int

Examples

Loading a single PDF file with default settings:

from haive.core.engine.document.config import DocumentSourceType
from haive.core.schema.prebuilt.document_state import DocumentEngineInputSchema

state = DocumentEngineInputSchema(
    source="/path/to/document.pdf",
    source_type=DocumentSourceType.FILE
)

Scraping a website with custom loader and processing options:

state = DocumentEngineInputSchema(
    source="https://example.com",
    source_type=DocumentSourceType.URL,
    loader_options={"verify_ssl": True},
    processing_options={"extract_links": True}
)

Configuring a bulk loading operation with a preference for quality:

from haive.core.engine.document.config import LoaderPreference, ChunkingStrategy

state = DocumentEngineInputSchema(
    sources=["/path/to/doc1.pdf", "/path/to/doc2.docx"],
    loader_preference=LoaderPreference.QUALITY,
    chunking_strategy=ChunkingStrategy.SEMANTIC
)

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class haive.core.schema.prebuilt.document_state.DocumentEngineOutputSchema(/, **data)[source]¶

Bases: haive.core.schema.state_schema.StateSchema

Defines the output state from document loading and processing.

This schema contains the loaded documents, metadata, and processing statistics, providing a comprehensive overview of the operation’s results.

Parameters:: data (Any)

documents¶

A list of processed documents.

Type:: List[ProcessedDocument]

raw_documents¶

A list of raw LangChain Document objects.

Type:: List[Document]

total_documents¶

The total number of documents processed.

Type:: int

successful_documents¶

The number of documents successfully processed.

Type:: int

failed_documents¶

The number of documents that failed to process.

Type:: int

operation_time¶

The total time for the operation in seconds.

Type:: float

average_processing_time¶

The average processing time per document.

Type:: float

original_source¶

The original source path or URL.

Type:: Optional[str]

source_type¶

The detected or specified source type.

Type:: Optional[DocumentSourceType]

loader_names¶

The names of the loaders used.

Type:: List[str]

processing_strategy¶

The processing strategy used.

Type:: Optional[ProcessingStrategy]

chunking_strategy¶

The chunking strategy used.

Type:: Optional[ChunkingStrategy]

total_chunks¶

The total number of chunks created.

Type:: int

total_characters¶

The total character count across all documents.

Type:: int

total_words¶

The estimated total word count.

Type:: int

errors¶

A list of errors encountered during processing.

Type:: List[Dict[str, Any]]

warnings¶

A list of warnings generated.

Type:: List[Dict[str, Any]]

metadata¶

Additional metadata about the operation.

Type:: Dict[str, Any]

Examples

Inspecting the output of a successful loading operation:

# Assuming 'output' is an instance of DocumentEngineOutputSchema
if output.successful_documents > 0:
    print(f"Successfully loaded {output.successful_documents} documents.")
    print(f"Total chunks created: {output.total_chunks}")
    print(f"First document content: {output.documents[0].content[:100]}")

Handling partial success with errors:

if output.failed_documents > 0:
    print(f"Failed to load {output.failed_documents} documents.")
    for error in output.errors:
        print(f"Source: {error['source']}, Error: {error['error']}")

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

add_document(document)[source]¶

Adds a processed document to the output state.

This method appends a processed document to the output state and updates various statistics, such as total documents, successful documents, and chunk/character/word counts.

Parameters:: document (ProcessedDocument) – The processed document to add.
Return type:: None

add_error(source, error, details=None)[source]¶

Adds an error record to the output state.

This method is used to log errors encountered during document processing, providing a structured way to track failures.

Parameters:

source (str) – The source that caused the error (e.g., file path or URL).
error (str) – A description of the error.
details (Optional[Dict[str, Any]]) – Additional details about the error.

Return type:

None

calculate_statistics()[source]¶

Calculates aggregate statistics from the loaded documents.

This method updates the output state with summary statistics, such as average processing time and total counts for chunks, characters, and words.

Return type:: None

class haive.core.schema.prebuilt.document_state.DocumentState(/, **data)[source]¶

Bases: DocumentEngineInputSchema, DocumentEngineOutputSchema

Represents the complete state of a document processing workflow.

This schema combines the input, output, and workflow states to provide a full picture of a document processing operation. It inherits all attributes from DocumentEngineInputSchema and DocumentEngineOutputSchema.

Parameters:: data (Any)

workflow¶

The state of the processing workflow.

Type:: DocumentWorkflowSchema

Examples

Initializing a complete workflow state and executing a step:

from haive.core.engine.document.config import DocumentSourceType
from haive.core.schema.prebuilt.document_state import DocumentState, DocumentWorkflowSchema

# Initial state for processing a directory
state = DocumentState(
    source="/path/to/documents/",
    source_type=DocumentSourceType.DIRECTORY,
    recursive=True,
    workflow=DocumentWorkflowSchema(processing_stage="loading")
)

# After processing, the state might look like this:
# state.total_documents = 50
# state.successful_documents = 48
# state.workflow.processing_stage = "completed"

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class Config[source]¶

Pydantic configuration for the DocumentState schema.

arbitrary_types_allowed¶

Allows Pydantic to handle arbitrary types, which is useful for complex data structures like langchain_core.documents.Document.

Type:: bool

class haive.core.schema.prebuilt.document_state.DocumentWorkflowSchema(/, **data)[source]¶

Bases: pydantic.BaseModel

Manages the state of a document processing workflow.

This schema tracks the progress and metadata of a multi-step document processing workflow.

Parameters:: data (Any)

processing_stage¶

The current stage of the processing workflow (e.g., “initialized”, “loading”, “chunking”, “completed”).

Type:: str

last_processed_index¶

The index of the last document processed, useful for resuming workflows.

Type:: int

workflow_metadata¶

A dictionary for storing any additional metadata related to the workflow.

Type:: Dict[str, Any]

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.