haive.core.engine.document.processors

Document Processing Components.

This module provides document processing capabilities including chunking and content transformation that integrate with the DocumentEngine.

The processors handle: - Content normalization - Document chunking strategies - Metadata extraction - Format conversion

Classes

ChunkingProcessor

Processor for chunking documents into smaller pieces.

ContentNormalizer

Processor for normalizing document content.

DocumentProcessor

Base class for document processing operations.

FormatDetector

Processor for detecting document formats.

MetadataExtractor

Processor for extracting metadata from documents.

Module Contents

class haive.core.engine.document.processors.ChunkingProcessor(chunking_strategy=ChunkingStrategy.RECURSIVE, chunk_size=1000, chunk_overlap=200, **kwargs)

Bases: DocumentProcessor

Processor for chunking documents into smaller pieces.

Initialize the chunking processor.

Parameters:
chunk_text(text, strategy, chunk_size, chunk_overlap, metadata)

Chunk text according to the specified strategy.

Parameters:
Returns:

List of document chunks

Return type:

list[haive.core.engine.document.config.DocumentChunk]

class haive.core.engine.document.processors.ContentNormalizer(normalize_whitespace=True, remove_extra_newlines=True, strip_content=True, **kwargs)

Bases: DocumentProcessor

Processor for normalizing document content.

Initialize the content normalizer.

Parameters:
  • normalize_whitespace (bool) – Whether to normalize whitespace

  • remove_extra_newlines (bool) – Whether to remove extra newlines

  • strip_content (bool) – Whether to strip leading/trailing whitespace

  • **kwargs – Additional configuration

normalize_content(content)

Normalize document content.

Parameters:

content (str) – Content to normalize

Returns:

Normalized content

Return type:

str

class haive.core.engine.document.processors.DocumentProcessor(**kwargs)

Base class for document processing operations.

Initialize the processor.

abstractmethod process(document)

Process a document.

Parameters:

document (langchain_core.documents.Document) – Document to process

Returns:

Processed document

Return type:

haive.core.engine.document.config.ProcessedDocument

class haive.core.engine.document.processors.FormatDetector(**kwargs)

Bases: DocumentProcessor

Processor for detecting document formats.

Initialize the processor.

detect_format(content, metadata)

Detect document format from content and metadata.

Parameters:
  • content (str) – Document content

  • metadata (dict[str, Any]) – Document metadata

Returns:

Detected document format

Return type:

haive.core.engine.document.config.DocumentFormat

class haive.core.engine.document.processors.MetadataExtractor(**kwargs)

Bases: DocumentProcessor

Processor for extracting metadata from documents.

Initialize the processor.

extract_metadata(content, existing_metadata)

Extract additional metadata from document content.

Parameters:
  • content (str) – Document content

  • existing_metadata (dict[str, Any]) – Existing metadata

Returns:

Enhanced metadata dictionary

Return type:

dict[str, Any]