haive.core.schema.prebuilt.document_state¶
Document State Schema for the Haive Document Engine.
This module provides a comprehensive state schema for document processing workflows, integrating with the document loader system and providing state management for document loading, processing, and analysis operations.
Author: Claude (Haive AI Agent Framework) Version: 1.0.0
Classes¶
Defines the input state for document loading and processing. |
|
Defines the output state from document loading and processing. |
|
Represents the complete state of a document processing workflow. |
|
Manages the state of a document processing workflow. |
Module Contents¶
- class haive.core.schema.prebuilt.document_state.DocumentEngineInputSchema(/, **data)[source]¶
Bases:
haive.core.schema.state_schema.StateSchema
Defines the input state for document loading and processing.
This schema supports various source types and configurations, providing a flexible interface for document ingestion workflows.
- Parameters:
data (Any)
- source¶
The primary source to process, which can be a file path, URL, or a configuration dictionary.
- sources¶
A list of sources for bulk processing.
- source_type¶
The explicit type of the source (e.g., FILE, URL). If not provided, it will be auto-detected.
- Type:
Optional[DocumentSourceType]
- loader_name¶
The specific loader to use for processing. If not provided, a loader will be auto-selected.
- Type:
Optional[str]
- loader_preference¶
The preference for auto-selecting a loader, balancing speed and quality. Defaults to BALANCED.
- Type:
- processing_strategy¶
The strategy for document processing. Defaults to ENHANCED.
- Type:
- chunking_strategy¶
The strategy for chunking documents. Defaults to RECURSIVE.
- Type:
- parallel_processing¶
Whether to enable parallel processing for supported operations. Defaults to True.
- Type:
Examples
Loading a single PDF file with default settings:
from haive.core.engine.document.config import DocumentSourceType from haive.core.schema.prebuilt.document_state import DocumentEngineInputSchema state = DocumentEngineInputSchema( source="/path/to/document.pdf", source_type=DocumentSourceType.FILE )
Scraping a website with custom loader and processing options:
state = DocumentEngineInputSchema( source="https://example.com", source_type=DocumentSourceType.URL, loader_options={"verify_ssl": True}, processing_options={"extract_links": True} )
Configuring a bulk loading operation with a preference for quality:
from haive.core.engine.document.config import LoaderPreference, ChunkingStrategy state = DocumentEngineInputSchema( sources=["/path/to/doc1.pdf", "/path/to/doc2.docx"], loader_preference=LoaderPreference.QUALITY, chunking_strategy=ChunkingStrategy.SEMANTIC )
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- class haive.core.schema.prebuilt.document_state.DocumentEngineOutputSchema(/, **data)[source]¶
Bases:
haive.core.schema.state_schema.StateSchema
Defines the output state from document loading and processing.
This schema contains the loaded documents, metadata, and processing statistics, providing a comprehensive overview of the operation’s results.
- Parameters:
data (Any)
- documents¶
A list of processed documents.
- Type:
List[ProcessedDocument]
- raw_documents¶
A list of raw LangChain Document objects.
- Type:
List[Document]
- source_type¶
The detected or specified source type.
- Type:
Optional[DocumentSourceType]
- processing_strategy¶
The processing strategy used.
- Type:
Optional[ProcessingStrategy]
- chunking_strategy¶
The chunking strategy used.
- Type:
Optional[ChunkingStrategy]
Examples
Inspecting the output of a successful loading operation:
# Assuming 'output' is an instance of DocumentEngineOutputSchema if output.successful_documents > 0: print(f"Successfully loaded {output.successful_documents} documents.") print(f"Total chunks created: {output.total_chunks}") print(f"First document content: {output.documents[0].content[:100]}")
Handling partial success with errors:
if output.failed_documents > 0: print(f"Failed to load {output.failed_documents} documents.") for error in output.errors: print(f"Source: {error['source']}, Error: {error['error']}")
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- add_document(document)[source]¶
Adds a processed document to the output state.
This method appends a processed document to the output state and updates various statistics, such as total documents, successful documents, and chunk/character/word counts.
- Parameters:
document (ProcessedDocument) – The processed document to add.
- Return type:
None
- class haive.core.schema.prebuilt.document_state.DocumentState(/, **data)[source]¶
Bases:
DocumentEngineInputSchema
,DocumentEngineOutputSchema
Represents the complete state of a document processing workflow.
This schema combines the input, output, and workflow states to provide a full picture of a document processing operation. It inherits all attributes from DocumentEngineInputSchema and DocumentEngineOutputSchema.
- Parameters:
data (Any)
- workflow¶
The state of the processing workflow.
- Type:
Examples
Initializing a complete workflow state and executing a step:
from haive.core.engine.document.config import DocumentSourceType from haive.core.schema.prebuilt.document_state import DocumentState, DocumentWorkflowSchema # Initial state for processing a directory state = DocumentState( source="/path/to/documents/", source_type=DocumentSourceType.DIRECTORY, recursive=True, workflow=DocumentWorkflowSchema(processing_stage="loading") ) # After processing, the state might look like this: # state.total_documents = 50 # state.successful_documents = 48 # state.workflow.processing_stage = "completed"
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
- class haive.core.schema.prebuilt.document_state.DocumentWorkflowSchema(/, **data)[source]¶
Bases:
pydantic.BaseModel
Manages the state of a document processing workflow.
This schema tracks the progress and metadata of a multi-step document processing workflow.
- Parameters:
data (Any)
- processing_stage¶
The current stage of the processing workflow (e.g., “initialized”, “loading”, “chunking”, “completed”).
- Type:
- last_processed_index¶
The index of the last document processed, useful for resuming workflows.
- Type:
- workflow_metadata¶
A dictionary for storing any additional metadata related to the workflow.
- Type:
Dict[str, Any]
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.
self is explicitly positional-only to allow self as a field name.