haive.core.engine.retriever.providers.BM25RetrieverConfig¶

BM25 Retriever implementation for the Haive framework.

from typing import Any This module provides a configuration class for the BM25 (Best Matching 25) retriever, which uses the BM25 ranking function for text retrieval. BM25 is a probabilistic ranking function used by search engines to estimate the relevance of documents to a given search query.

The BM25Retriever works by: 1. Tokenizing and preprocessing documents and queries 2. Computing BM25 scores for each document-query pair 3. Ranking documents by their BM25 scores 4. Returning the top-k most relevant documents

This retriever is particularly useful when: - Working with text-based document collections - Need precise keyword matching and term frequency analysis - Want interpretable ranking scores - Building traditional information retrieval systems - Combining with vector search in hybrid approaches

The implementation integrates with LangChain’s BM25Retriever while providing a consistent Haive configuration interface.

Classes¶

BM25RetrieverConfig

Configuration for BM25 retriever in the Haive framework.

Module Contents¶

class haive.core.engine.retriever.providers.BM25RetrieverConfig.BM25RetrieverConfig[source]¶

Bases: haive.core.engine.retriever.retriever.BaseRetrieverConfig

Configuration for BM25 retriever in the Haive framework.

This retriever uses the BM25 ranking function to score documents based on term frequency, inverse document frequency, and document length normalization.

retriever_type¶

The type of retriever (always BM25).

Type:: RetrieverType

documents¶

Documents to index for retrieval.

Type:: List[Document]

k¶

Number of documents to retrieve (default: 4).

Type:: int

k1¶

BM25 parameter controlling term frequency saturation (default: 1.2).

Type:: float

b¶

BM25 parameter controlling document length normalization (default: 0.75).

Type:: float

epsilon¶

BM25 parameter for IDF calculation (default: 0.25).

Type:: float

Examples

>>> from haive.core.engine.retriever import BM25RetrieverConfig
>>> from langchain_core.documents import Document
>>>
>>> # Create documents
>>> docs = [
...     Document(page_content="Machine learning is a subset of AI"),
...     Document(page_content="Deep learning uses neural networks"),
...     Document(page_content="Natural language processing handles text")
... ]
>>>
>>> # Create the BM25 retriever config
>>> config = BM25RetrieverConfig(
...     name="bm25_retriever",
...     documents=docs,
...     k=2,
...     k1=1.5,  # Higher term frequency saturation
...     b=0.8    # More document length normalization
... )
>>>
>>> # Instantiate and use the retriever
>>> retriever = config.instantiate()
>>> docs = retriever.get_relevant_documents("machine learning algorithms")

get_input_fields()[source]¶

Return input field definitions for BM25 retriever.

Return type:: dict[str, tuple[type, Any]]

get_output_fields()[source]¶

Return output field definitions for BM25 retriever.

Return type:: dict[str, tuple[type, Any]]

instantiate()[source]¶

Create a BM25 retriever from this configuration.

Returns:

Instantiated retriever ready for text ranking.

Return type:

BM25Retriever

Raises:

ImportError – If required packages are not available.
ValueError – If documents list is empty.