haive.core.engine.document.splittersΒΆ
Document splitters module for the Haive framework.
This module provides various text splitting strategies for chunking documents into smaller, manageable segments suitable for processing by LLMs, vector databases, and other components that have size constraints.
The splitters subsystem supports multiple splitting algorithms optimized for different document types and use cases, from simple character-based splitting to sophisticated semantic and structural splitting.
- Key Components:
DocumentSplitterEngine: Main engine for splitting documents DocSplitterType: Enumeration of available splitter types Various text splitters from LangChain
- Splitter Types:
RecursiveCharacterTextSplitter: Most versatile, tries multiple separators
CharacterTextSplitter: Simple character-based splitting
TokenTextSplitter: Split based on token count
SentenceTransformersTokenTextSplitter: Token splitting with sentence transformers
MarkdownTextSplitter: Preserves Markdown structure
HTMLHeaderTextSplitter: Splits HTML by headers
LatexTextSplitter: Preserves LaTeX structure
PythonCodeTextSplitter: Code-aware splitting for Python
RecursiveJsonSplitter: JSON structure-aware splitting
SpacyTextSplitter: Uses spaCy for linguistic splitting
NLTKTextSplitter: Uses NLTK for sentence splitting
Examples
Basic document splitting:
from haive.core.engine.document.splitters import (
DocumentSplitterEngine,
DocSplitterType
)
# Create splitter engine
splitter = DocumentSplitterEngine(
splitter_type=DocSplitterType.RECURSIVE_CHARACTER,
chunk_size=1000,
chunk_overlap=200
)
# Split documents
chunks = splitter.invoke({"documents": documents})
Code-aware splitting:
from haive.core.engine.document.splitters import (
DocumentSplitterEngine,
DocSplitterType
)
# Python code splitter
code_splitter = DocumentSplitterEngine(
splitter_type=DocSplitterType.PYTHON_CODE,
chunk_size=2000,
chunk_overlap=100
)
code_chunks = code_splitter.invoke({"documents": python_docs})
Markdown-aware splitting:
# Markdown splitter preserves structure
md_splitter = DocumentSplitterEngine(
splitter_type=DocSplitterType.MARKDOWN,
chunk_size=1500,
strip_whitespace=True
)
md_chunks = md_splitter.invoke({"documents": markdown_docs})
See also
Document loader module for loading documents
Document transformer module for document transformation
LangChain text splitters documentation