haive.core.engine.document.processors¶
Document Processing Components.
This module provides document processing capabilities including chunking and content transformation that integrate with the DocumentEngine.
The processors handle: - Content normalization - Document chunking strategies - Metadata extraction - Format conversion
Classes¶
Processor for chunking documents into smaller pieces. |
|
Processor for normalizing document content. |
|
Base class for document processing operations. |
|
Processor for detecting document formats. |
|
Processor for extracting metadata from documents. |
Module Contents¶
- class haive.core.engine.document.processors.ChunkingProcessor(chunking_strategy=ChunkingStrategy.RECURSIVE, chunk_size=1000, chunk_overlap=200, **kwargs)¶
Bases:
DocumentProcessor
Processor for chunking documents into smaller pieces.
Initialize the chunking processor.
- Parameters:
chunking_strategy (haive.core.engine.document.config.ChunkingStrategy) – Strategy for chunking
chunk_size (int) – Size of chunks in characters
chunk_overlap (int) – Overlap between chunks
**kwargs – Additional configuration
- chunk_text(text, strategy, chunk_size, chunk_overlap, metadata)¶
Chunk text according to the specified strategy.
- Parameters:
- Returns:
List of document chunks
- Return type:
- class haive.core.engine.document.processors.ContentNormalizer(normalize_whitespace=True, remove_extra_newlines=True, strip_content=True, **kwargs)¶
Bases:
DocumentProcessor
Processor for normalizing document content.
Initialize the content normalizer.
- Parameters:
- class haive.core.engine.document.processors.DocumentProcessor(**kwargs)¶
Base class for document processing operations.
Initialize the processor.
- abstractmethod process(document)¶
Process a document.
- Parameters:
document (langchain_core.documents.Document) – Document to process
- Returns:
Processed document
- Return type:
- class haive.core.engine.document.processors.FormatDetector(**kwargs)¶
Bases:
DocumentProcessor
Processor for detecting document formats.
Initialize the processor.
- detect_format(content, metadata)¶
Detect document format from content and metadata.
- Parameters:
- Returns:
Detected document format
- Return type:
- class haive.core.engine.document.processors.MetadataExtractor(**kwargs)¶
Bases:
DocumentProcessor
Processor for extracting metadata from documents.
Initialize the processor.