agents.document_modifiers.tnt.utils¶
Utility functions for taxonomy generation and document processing.
This module provides utility functions for parsing, formatting, and managing taxonomy-related data structures. It includes functions for handling XML-formatted outputs, document summaries, and taxonomy clusters.
Note
All XML parsing functions assume well-formed XML input with specific expected tags. Malformed XML may raise parsing errors.
Example
Basic usage for taxonomy parsing:
xml_output = '''
<id>1</id>
<name>Category A</name>
<description>Description text</description>
'''
taxonomy = parse_taxonomy(xml_output)
Functions¶
|
Format documents as XML table for taxonomy generation. |
|
Convert taxonomy clusters to XML format. |
|
Format taxonomy clusters as a Markdown table. |
|
Extract document content from taxonomy generation state. |
|
Parse category labels from prediction output. |
|
Parse summary and explanation from XML-formatted string. |
|
Parse taxonomy information from LLM-generated output. |
|
Merge summarized content with original documents. |
Module Contents¶
- agents.document_modifiers.tnt.utils.format_docs(docs)¶
Format documents as XML table for taxonomy generation.
- Parameters:
docs (list[langchain_core.documents.Document]) – List of Document objects, each must have: - id: Document identifier - summary: Document summary text
- Returns:
XML-formatted string containing conversation summaries
- Return type:
Example
>>> docs = [Document(id="1", summary="text")] >>> xml = format_docs(docs) >>> print(xml) <conversations> <conv_summ id=1>text</conv_summ> </conversations>
- agents.document_modifiers.tnt.utils.format_taxonomy(clusters)¶
Convert taxonomy clusters to XML format.
- Parameters:
clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description
- Returns:
XML-formatted taxonomy string
- Return type:
Example
>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}] >>> xml = format_taxonomy(clusters) >>> print(xml) <cluster_table> <cluster> <id>1</id> <name>Tech</name> <description>Technology</description> </cluster> </cluster_table>
- agents.document_modifiers.tnt.utils.format_taxonomy_md(clusters)¶
Format taxonomy clusters as a Markdown table.
- Parameters:
clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description
- Returns:
Markdown-formatted table string
- Return type:
Example
>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}] >>> md_table = format_taxonomy_md(clusters)
- agents.document_modifiers.tnt.utils.get_content(state)¶
Extract document content from taxonomy generation state.
- Parameters:
state (haive.agents.document_modifiers.tnt.state.TaxonomyGenerationState) – Current state of the taxonomy generation process. Must contain a ‘documents’ key with list of document dictionaries.
- Returns:
- List of dictionaries, each containing:
content (str): The content of a document
- Return type:
Example
>>> state = {"documents": [{"content": "doc1"}, {"content": "doc2"}]} >>> contents = get_content(state) >>> print(contents) [{'content': 'doc1'}, {'content': 'doc2'}]
- agents.document_modifiers.tnt.utils.parse_labels(output_text)¶
Parse category labels from prediction output.
Extracts category information from XML-formatted prediction text. Handles multiple categories but returns only the first one.
- Parameters:
output_text (str) –
XML-formatted string containing category predictions. Expected format:
<category>Label Name</category>
- Returns:
- Dictionary containing:
category (str): The first category label found
- Return type:
Note
If multiple categories are found, a warning is logged and only the first category is returned.
Example
>>> text = "<category>Technology</category>" >>> result = parse_labels(text) >>> print(result) {'category': 'Technology'}
- agents.document_modifiers.tnt.utils.parse_summary(xml_string)¶
Parse summary and explanation from XML-formatted string.
Extracts the content within <summary> and <explanation> tags from the input XML string. If tags are not found, returns empty strings for the missing elements.
- Parameters:
xml_string (str) –
XML-formatted string containing <summary> and <explanation> tags. Example:
<summary>Main points...</summary> <explanation>Detailed analysis...</explanation>
- Returns:
- Dictionary containing:
summary (str): Content within <summary> tags
explanation (str): Content within <explanation> tags
- Return type:
Example
>>> xml = "<summary>Key points</summary><explanation>Details</explanation>" >>> result = parse_summary(xml) >>> print(result) {'summary': 'Key points', 'explanation': 'Details'}
- agents.document_modifiers.tnt.utils.parse_taxonomy(output_text)¶
Parse taxonomy information from LLM-generated output.
Extracts cluster information including IDs, names, and descriptions from XML-formatted output text.
- Parameters:
output_text (str) –
XML-formatted string containing taxonomy clusters. Expected format:
<id>1</id> <name>Category Name</name> <description>Category Description</description>
- Returns:
- Dictionary containing:
- clusters (list): List of dictionaries, each with:
id (str): Cluster identifier
name (str): Cluster name
description (str): Cluster description
- Return type:
Example
>>> text = "<id>1</id><name>Tech</name><description>Technology</description>" >>> taxonomy = parse_taxonomy(text) >>> print(taxonomy) {'clusters': [{'id': '1', 'name': 'Tech', 'description': 'Technology'}]}
- agents.document_modifiers.tnt.utils.reduce_summaries(combined)¶
Merge summarized content with original documents.
Takes a dictionary containing both original documents and their summaries, and combines them into a single state object.
- Parameters:
combined (dict) – Dictionary containing: - documents (list): Original document list - summaries (list): Corresponding summaries
- Returns:
- Updated state containing:
documents (list): List of documents with added summaries
- Return type:
Example
>>> combined = { ... "documents": [{"id": 1, "content": "text"}], ... "summaries": [{"summary": "sum", "explanation": "exp"}] ... } >>> state = reduce_summaries(combined)