agents.document_modifiers.complex_extraction.agent

Complex Extraction Agent for structured data extraction from text.

This module provides the ComplexExtractionAgent class which implements sophisticated structured data extraction using validation with retries and optional JSONPatch-based error correction to reliably extract data according to specified schemas.

The agent supports multiple retry strategies and can handle complex validation scenarios where initial extraction attempts may fail.

Classes:

ComplexExtractionAgent: Main agent for complex structured data extraction

Examples

Basic usage:

from haive.agents.document_modifiers.complex_extraction import ComplexExtractionAgent
from haive.agents.document_modifiers.complex_extraction.config import ComplexExtractionAgentConfig
from pydantic import BaseModel

class PersonInfo(BaseModel):
    name: str
    age: int
    occupation: str

config = ComplexExtractionAgentConfig(
    extraction_model=PersonInfo,
    max_retries=3
)
agent = ComplexExtractionAgent(config)

text = "John Smith is a 35-year-old software engineer."
result = agent.run(text)
person_data = result["extracted_data"]

With JSONPatch error correction:

config = ComplexExtractionAgentConfig(
    extraction_model=PersonInfo,
    use_jsonpatch=True,
    max_retries=5
)
agent = ComplexExtractionAgent(config)
result = agent.run(complex_text)

See also

  • ComplexExtractionAgentConfig: Configuration class

  • RetryStrategy: Retry strategy configuration

Classes

ComplexExtractionAgent

Agent that extracts complex structured information from text.

Module Contents

class agents.document_modifiers.complex_extraction.agent.ComplexExtractionAgent(config=ComplexExtractionAgentConfig())

Bases: haive.core.engine.agent.agent.Agent[haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig]

Agent that extracts complex structured information from text.

This agent implements sophisticated structured data extraction using validation with retries and optional JSONPatch-based error correction to reliably extract data according to specified Pydantic schemas.

The agent creates a validation workflow that can handle complex extraction scenarios where initial attempts may fail due to parsing errors, validation issues, or incomplete data. It supports multiple retry strategies and can automatically correct errors using JSONPatch operations.

Parameters:

config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig) – Configuration object containing extraction settings, model schema, retry parameters, and LLM configuration.

extraction_model

Pydantic model class defining the extraction schema

max_retries

Maximum number of retry attempts for failed extractions

force_tool_choice

Whether to force the LLM to use the extraction tool

use_jsonpatch

Whether to enable JSONPatch-based error correction

extraction_tool

Tool instance created from the extraction model

llm

Language model instance for performing extractions

Examples

Basic structured extraction:

from pydantic import BaseModel

class ProductInfo(BaseModel):
    name: str
    price: float
    category: str

config = ComplexExtractionAgentConfig(
    extraction_model=ProductInfo,
    max_retries=3
)
agent = ComplexExtractionAgent(config)

text = "The MacBook Pro costs $2499 and is a laptop computer."
result = agent.run(text)
product = result["extracted_data"]
# product = {"name": "MacBook Pro", "price": 2499.0, "category": "laptop"}

With advanced error correction:

config = ComplexExtractionAgentConfig(
    extraction_model=ProductInfo,
    use_jsonpatch=True,
    max_retries=5,
    force_tool_choice=True
)
agent = ComplexExtractionAgent(config)

Processing multiple documents:

documents = ["Product A costs $100", "Product B is $200 software"]
results = [agent.run(doc) for doc in documents]

Note

The agent requires a Pydantic model class to define the extraction schema. JSONPatch functionality requires the ‘jsonpatch’ library to be installed.

Raises:
  • ImportError – If JSONPatch is enabled but the jsonpatch library is not installed

  • ValueError – If extraction fails after maximum retry attempts

Parameters:

config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig)

See also

  • ComplexExtractionAgentConfig: Configuration options

  • RetryStrategy: Retry strategy configuration

  • PatchFunctionParameters: JSONPatch parameter schema

Initialize the complex extraction agent.

Sets up the extraction model, validation tools, and retry mechanisms based on the provided configuration.

Parameters:

config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig) – Configuration object containing extraction model, retry settings, and LLM configuration. Defaults to a new instance with default values.

Raises:

ImportError – If JSONPatch is enabled in config but jsonpatch library is not installed.

bind_validator_with_jsonpatch_retries(llm, *, tools, tool_choice=None, max_attempts=3)

Bind a validator with JSONPatch-based retries.

Creates an advanced validation workflow that uses JSONPatch operations to automatically correct validation errors. When a tool call fails validation, the system generates patch instructions to fix the errors.

Parameters:
  • llm (langchain_core.language_models.BaseChatModel) – The base language model to use for extraction and error correction.

  • tools (list[langchain_core.tools.Tool]) – List of tools available for extraction. The validation will ensure tool calls conform to these tool schemas.

  • tool_choice (str | None) – Optional specific tool name to force the LLM to use. If specified, the LLM must use this tool.

  • max_attempts (int) – Maximum number of retry attempts before giving up. Defaults to 3.

Returns:

StateGraph builder instance (not compiled). Must be compiled before use.

Raises:

ImportError – If the jsonpatch library is not installed but JSONPatch functionality is requested.

Return type:

langgraph.graph.StateGraph

Note

This method creates a sophisticated retry mechanism where: 1. Initial extraction attempts use the primary LLM 2. Validation errors trigger JSONPatch correction attempts 3. Patch operations are applied to fix specific validation issues 4. Multiple correction iterations are supported up to max_attempts

bind_validator_with_retries(llm, *, tools, tool_choice=None, max_attempts=3)

Bind a validator with standard retries (no JSONPatch).

Creates a basic validation workflow with simple retry logic. When validation fails, the system will retry the extraction up to the maximum number of attempts without advanced error correction.

Parameters:
  • llm (langchain_core.language_models.BaseChatModel) – The base language model to use for extraction attempts.

  • tools (list[langchain_core.tools.Tool]) – List of tools available for extraction. Tool calls will be validated against these tool schemas.

  • tool_choice (str | None) – Optional specific tool name to force the LLM to use. If specified, the LLM must call this tool.

  • max_attempts (int) – Maximum number of retry attempts before failing. Defaults to 3.

Returns:

StateGraph builder instance (not compiled). Must be compiled before use.

Return type:

langgraph.graph.StateGraph

Note

This is the simpler alternative to JSONPatch-based retries. It will simply retry failed extractions without attempting to automatically correct validation errors.

extract_node(state)

Main extraction node function.

Processes the current state through the extraction pipeline, invoking the configured extraction tool and handling the results.

Parameters:

state (Any) – Current workflow state containing messages and other context. Can be either a dictionary with ‘messages’ key or an object with messages attribute.

Returns:

  • extracted_data: The structured data extracted by the tool

  • messages: Updated message list including extraction results

  • error: Error message if extraction failed

Return type:

Updated state dictionary containing

Note

This method handles various state formats and gracefully manages errors during extraction. If no extraction runnable is available, the state is passed through unchanged.

run(input_data, **kwargs)

Run the extraction agent on input data.

Processes the input through the extraction pipeline, handling various input formats and returning structured extraction results.

Parameters:
  • input_data (str | list[str] | dict[str, Any] | pydantic.BaseModel) – Input text or data to extract information from. Supports: - str: Single text document - List[str]: Multiple text documents to process together - Dict[str, Any]: Dictionary with ‘text’, ‘content’, or ‘messages’ keys - BaseModel: Pydantic model with text content

  • **kwargs – Additional runtime configuration options passed to the underlying workflow execution.

Returns:

  • extracted_data: Structured data conforming to the extraction model

  • messages: Full conversation history during extraction

  • Additional metadata from the extraction process

Return type:

Dictionary containing extraction results

Examples

Basic text extraction:

agent = ComplexExtractionAgent(config)
result = agent.run("John Smith is 30 years old.")
person_data = result["extracted_data"]

Multiple documents:

docs = ["Person A info", "Person B info"]
result = agent.run(docs)

Note

If no extraction workflow has been set up, this method will automatically call setup_workflow() before processing.

setup_workflow()

Set up the agent workflow.

Initializes the extraction workflow graph based on the agent configuration. This method creates the appropriate validation and retry mechanism (either JSONPatch-based or standard retries) and configures the processing pipeline.

The workflow includes encoding/decoding steps, validation nodes, and state management for tracking extraction progress.

Note

This method is called automatically when needed and does not need to be invoked manually. The workflow graph is not compiled here - compilation happens in the parent class.

Return type:

None