haive.core.engine.document.loaders

Haive Document Loaders - Ultimate Auto-Loading System.

This module provides the world’s most comprehensive document loading system with support for 230+ langchain_community document loaders. It can automatically detect, configure, and load documents from ANY source type.

🚀 Features: - Auto-Detection: Automatically detects source type from paths/URLs - 230+ Loaders: Complete langchain_community loader support - Smart Registry: Intelligent loader selection based on preferences - Bulk Loading: Concurrent processing with progress tracking - Error Handling: Built-in retry logic and graceful error handling - Async Support: Full async/await support for high-performance scenarios

📁 Supported Sources: - Local Files: PDF, DOCX, CSV, JSON, code files, archives, etc. - Web Sources: Websites, APIs, documentation sites, social media - Databases: PostgreSQL, MySQL, MongoDB, Redis, Elasticsearch, etc. - Cloud Storage: S3, GCS, Azure Blob, Google Drive, Dropbox, etc. - Business Platforms: Salesforce, HubSpot, Zendesk, Jira, etc. - Communication: Slack, Discord, Teams, email, forums, etc. - Specialized: Government, healthcare, education, finance, etc.

💡 Quick Start:

from haive.core.engine.document.loaders import AutoLoader

# Ultimate auto-loader - works with ANY source loader = AutoLoader()

# Load from anywhere docs = loader.load(“document.pdf”) # Local file docs = loader.load(”https://docs.site.com”) # Website docs = loader.load(“s3://bucket/docs/”) # Cloud storage docs = loader.load(“postgres://db/table”) # Database

# Load documents from multiple sources (standard langchain method) docs = loader.load_documents([

“file1.pdf”, “file2.txt”, “https://site.com

])

# Bulk loading with detailed results sources = [“file1.pdf”, “https://site.com”, “s3://bucket/”] result = loader.load_bulk(sources)

# Load everything from a source docs = loader.load_all(“/documents/”) # Entire directory docs = loader.load_all(”https://wiki.com”) # Entire website

🔧 Advanced Usage:
from haive.core.engine.document.loaders import (

AutoLoader, AutoLoaderConfig, LoaderPreference

)

# Configure for quality vs speed config = AutoLoaderConfig(

preference=LoaderPreference.QUALITY, max_concurrency=20, enable_caching=True

) loader = AutoLoader(config)

# Async loading from single source docs = await loader.aload(”https://large-site.com”)

# Async loading from multiple sources docs = await loader.aload_documents([

“file1.pdf”, “https://site1.com”, “https://site2.com

])

# Get detailed loading information result = loader.load_detailed(“document.pdf”) print(f”Loaded {len(result.documents)} docs in {result.loading_time:.2f}s”)

📊 Registry Management:
from haive.core.engine.document.loaders import (

auto_register_all, get_registration_status, list_available_sources

)

# Auto-register all 230+ loaders stats = auto_register_all() print(f”Registered {stats.total_sources_registered} sources”)

# Check what’s available sources = list_available_sources() print(f”Available sources: {len(sources)}”)

# Get detailed status status = get_registration_status()

Convenience Functions:
from haive.core.engine.document.loaders import (

load_document, load_documents_bulk, aload_document

)

# Simple one-liner loading docs = load_document(“any-source-here”)

# Bulk loading multiple sources docs = load_documents_bulk([“file1.pdf”, “file2.docx”])

# Async loading docs = await aload_document(”https://example.com”)

This system represents the ultimate evolution of document loading - from the messy legacy system to a production-ready, scalable solution that handles any document source imaginable.

Author: Claude (Haive AI Agent Framework) Version: 2.0.0 - Complete Rewrite with 230+ Loaders

Submodules