agents.document_modifiers.complex_extraction.agent¶
Complex Extraction Agent for structured data extraction from text.
This module provides the ComplexExtractionAgent class which implements sophisticated structured data extraction using validation with retries and optional JSONPatch-based error correction to reliably extract data according to specified schemas.
The agent supports multiple retry strategies and can handle complex validation scenarios where initial extraction attempts may fail.
- Classes:
ComplexExtractionAgent: Main agent for complex structured data extraction
Examples
Basic usage:
from haive.agents.document_modifiers.complex_extraction import ComplexExtractionAgent
from haive.agents.document_modifiers.complex_extraction.config import ComplexExtractionAgentConfig
from pydantic import BaseModel
class PersonInfo(BaseModel):
name: str
age: int
occupation: str
config = ComplexExtractionAgentConfig(
extraction_model=PersonInfo,
max_retries=3
)
agent = ComplexExtractionAgent(config)
text = "John Smith is a 35-year-old software engineer."
result = agent.run(text)
person_data = result["extracted_data"]
With JSONPatch error correction:
config = ComplexExtractionAgentConfig(
extraction_model=PersonInfo,
use_jsonpatch=True,
max_retries=5
)
agent = ComplexExtractionAgent(config)
result = agent.run(complex_text)
See also
ComplexExtractionAgentConfig
: Configuration classRetryStrategy
: Retry strategy configuration
Classes¶
Agent that extracts complex structured information from text. |
Module Contents¶
- class agents.document_modifiers.complex_extraction.agent.ComplexExtractionAgent(config=ComplexExtractionAgentConfig())¶
Bases:
haive.core.engine.agent.agent.Agent
[haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig
]Agent that extracts complex structured information from text.
This agent implements sophisticated structured data extraction using validation with retries and optional JSONPatch-based error correction to reliably extract data according to specified Pydantic schemas.
The agent creates a validation workflow that can handle complex extraction scenarios where initial attempts may fail due to parsing errors, validation issues, or incomplete data. It supports multiple retry strategies and can automatically correct errors using JSONPatch operations.
- Parameters:
config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig) – Configuration object containing extraction settings, model schema, retry parameters, and LLM configuration.
- extraction_model¶
Pydantic model class defining the extraction schema
- max_retries¶
Maximum number of retry attempts for failed extractions
- force_tool_choice¶
Whether to force the LLM to use the extraction tool
- use_jsonpatch¶
Whether to enable JSONPatch-based error correction
- extraction_tool¶
Tool instance created from the extraction model
- llm¶
Language model instance for performing extractions
Examples
Basic structured extraction:
from pydantic import BaseModel class ProductInfo(BaseModel): name: str price: float category: str config = ComplexExtractionAgentConfig( extraction_model=ProductInfo, max_retries=3 ) agent = ComplexExtractionAgent(config) text = "The MacBook Pro costs $2499 and is a laptop computer." result = agent.run(text) product = result["extracted_data"] # product = {"name": "MacBook Pro", "price": 2499.0, "category": "laptop"}
With advanced error correction:
config = ComplexExtractionAgentConfig( extraction_model=ProductInfo, use_jsonpatch=True, max_retries=5, force_tool_choice=True ) agent = ComplexExtractionAgent(config)
Processing multiple documents:
documents = ["Product A costs $100", "Product B is $200 software"] results = [agent.run(doc) for doc in documents]
Note
The agent requires a Pydantic model class to define the extraction schema. JSONPatch functionality requires the ‘jsonpatch’ library to be installed.
- Raises:
ImportError – If JSONPatch is enabled but the jsonpatch library is not installed
ValueError – If extraction fails after maximum retry attempts
- Parameters:
config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig)
See also
ComplexExtractionAgentConfig
: Configuration optionsRetryStrategy
: Retry strategy configurationPatchFunctionParameters
: JSONPatch parameter schema
Initialize the complex extraction agent.
Sets up the extraction model, validation tools, and retry mechanisms based on the provided configuration.
- Parameters:
config (haive.agents.document_modifiers.complex_extraction.config.ComplexExtractionAgentConfig) – Configuration object containing extraction model, retry settings, and LLM configuration. Defaults to a new instance with default values.
- Raises:
ImportError – If JSONPatch is enabled in config but jsonpatch library is not installed.
- bind_validator_with_jsonpatch_retries(llm, *, tools, tool_choice=None, max_attempts=3)¶
Bind a validator with JSONPatch-based retries.
Creates an advanced validation workflow that uses JSONPatch operations to automatically correct validation errors. When a tool call fails validation, the system generates patch instructions to fix the errors.
- Parameters:
llm (langchain_core.language_models.BaseChatModel) – The base language model to use for extraction and error correction.
tools (list[langchain_core.tools.Tool]) – List of tools available for extraction. The validation will ensure tool calls conform to these tool schemas.
tool_choice (str | None) – Optional specific tool name to force the LLM to use. If specified, the LLM must use this tool.
max_attempts (int) – Maximum number of retry attempts before giving up. Defaults to 3.
- Returns:
StateGraph builder instance (not compiled). Must be compiled before use.
- Raises:
ImportError – If the jsonpatch library is not installed but JSONPatch functionality is requested.
- Return type:
langgraph.graph.StateGraph
Note
This method creates a sophisticated retry mechanism where: 1. Initial extraction attempts use the primary LLM 2. Validation errors trigger JSONPatch correction attempts 3. Patch operations are applied to fix specific validation issues 4. Multiple correction iterations are supported up to max_attempts
- bind_validator_with_retries(llm, *, tools, tool_choice=None, max_attempts=3)¶
Bind a validator with standard retries (no JSONPatch).
Creates a basic validation workflow with simple retry logic. When validation fails, the system will retry the extraction up to the maximum number of attempts without advanced error correction.
- Parameters:
llm (langchain_core.language_models.BaseChatModel) – The base language model to use for extraction attempts.
tools (list[langchain_core.tools.Tool]) – List of tools available for extraction. Tool calls will be validated against these tool schemas.
tool_choice (str | None) – Optional specific tool name to force the LLM to use. If specified, the LLM must call this tool.
max_attempts (int) – Maximum number of retry attempts before failing. Defaults to 3.
- Returns:
StateGraph builder instance (not compiled). Must be compiled before use.
- Return type:
langgraph.graph.StateGraph
Note
This is the simpler alternative to JSONPatch-based retries. It will simply retry failed extractions without attempting to automatically correct validation errors.
- extract_node(state)¶
Main extraction node function.
Processes the current state through the extraction pipeline, invoking the configured extraction tool and handling the results.
- Parameters:
state (Any) – Current workflow state containing messages and other context. Can be either a dictionary with ‘messages’ key or an object with messages attribute.
- Returns:
extracted_data: The structured data extracted by the tool
messages: Updated message list including extraction results
error: Error message if extraction failed
- Return type:
Updated state dictionary containing
Note
This method handles various state formats and gracefully manages errors during extraction. If no extraction runnable is available, the state is passed through unchanged.
- run(input_data, **kwargs)¶
Run the extraction agent on input data.
Processes the input through the extraction pipeline, handling various input formats and returning structured extraction results.
- Parameters:
input_data (str | list[str] | dict[str, Any] | pydantic.BaseModel) – Input text or data to extract information from. Supports: - str: Single text document - List[str]: Multiple text documents to process together - Dict[str, Any]: Dictionary with ‘text’, ‘content’, or ‘messages’ keys - BaseModel: Pydantic model with text content
**kwargs – Additional runtime configuration options passed to the underlying workflow execution.
- Returns:
extracted_data: Structured data conforming to the extraction model
messages: Full conversation history during extraction
Additional metadata from the extraction process
- Return type:
Dictionary containing extraction results
Examples
Basic text extraction:
agent = ComplexExtractionAgent(config) result = agent.run("John Smith is 30 years old.") person_data = result["extracted_data"]
Multiple documents:
docs = ["Person A info", "Person B info"] result = agent.run(docs)
Note
If no extraction workflow has been set up, this method will automatically call setup_workflow() before processing.
- setup_workflow()¶
Set up the agent workflow.
Initializes the extraction workflow graph based on the agent configuration. This method creates the appropriate validation and retry mechanism (either JSONPatch-based or standard retries) and configures the processing pipeline.
The workflow includes encoding/decoding steps, validation nodes, and state management for tracking extraction progress.
Note
This method is called automatically when needed and does not need to be invoked manually. The workflow graph is not compiled here - compilation happens in the parent class.
- Return type:
None