agents.document_modifiers.tnt.utils

Utility functions for taxonomy generation and document processing.

This module provides utility functions for parsing, formatting, and managing taxonomy-related data structures. It includes functions for handling XML-formatted outputs, document summaries, and taxonomy clusters.

Note

All XML parsing functions assume well-formed XML input with specific expected tags. Malformed XML may raise parsing errors.

Example

Basic usage for taxonomy parsing:

xml_output = '''
    <id>1</id>
    <name>Category A</name>
    <description>Description text</description>
'''
taxonomy = parse_taxonomy(xml_output)

Functions

format_docs(docs)

Format documents as XML table for taxonomy generation.

format_taxonomy(clusters)

Convert taxonomy clusters to XML format.

format_taxonomy_md(clusters)

Format taxonomy clusters as a Markdown table.

get_content(state)

Extract document content from taxonomy generation state.

parse_labels(output_text)

Parse category labels from prediction output.

parse_summary(xml_string)

Parse summary and explanation from XML-formatted string.

parse_taxonomy(output_text)

Parse taxonomy information from LLM-generated output.

reduce_summaries(combined)

Merge summarized content with original documents.

Module Contents

agents.document_modifiers.tnt.utils.format_docs(docs)

Format documents as XML table for taxonomy generation.

Parameters:

docs (list[langchain_core.documents.Document]) – List of Document objects, each must have: - id: Document identifier - summary: Document summary text

Returns:

XML-formatted string containing conversation summaries

Return type:

str

Example

>>> docs = [Document(id="1", summary="text")]
>>> xml = format_docs(docs)
>>> print(xml)
<conversations>
<conv_summ id=1>text</conv_summ>
</conversations>
agents.document_modifiers.tnt.utils.format_taxonomy(clusters)

Convert taxonomy clusters to XML format.

Parameters:

clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description

Returns:

XML-formatted taxonomy string

Return type:

str

Example

>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> xml = format_taxonomy(clusters)
>>> print(xml)
<cluster_table>
  <cluster>
    <id>1</id>
    <name>Tech</name>
    <description>Technology</description>
  </cluster>
</cluster_table>
agents.document_modifiers.tnt.utils.format_taxonomy_md(clusters)

Format taxonomy clusters as a Markdown table.

Parameters:

clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description

Returns:

Markdown-formatted table string

Return type:

str

Example

>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> md_table = format_taxonomy_md(clusters)
agents.document_modifiers.tnt.utils.get_content(state)

Extract document content from taxonomy generation state.

Parameters:

state (haive.agents.document_modifiers.tnt.state.TaxonomyGenerationState) – Current state of the taxonomy generation process. Must contain a ‘documents’ key with list of document dictionaries.

Returns:

List of dictionaries, each containing:
  • content (str): The content of a document

Return type:

list

Example

>>> state = {"documents": [{"content": "doc1"}, {"content": "doc2"}]}
>>> contents = get_content(state)
>>> print(contents)
[{'content': 'doc1'}, {'content': 'doc2'}]
agents.document_modifiers.tnt.utils.parse_labels(output_text)

Parse category labels from prediction output.

Extracts category information from XML-formatted prediction text. Handles multiple categories but returns only the first one.

Parameters:

output_text (str) –

XML-formatted string containing category predictions. Expected format:

<category>Label Name</category>

Returns:

Dictionary containing:
  • category (str): The first category label found

Return type:

dict

Note

If multiple categories are found, a warning is logged and only the first category is returned.

Example

>>> text = "<category>Technology</category>"
>>> result = parse_labels(text)
>>> print(result)
{'category': 'Technology'}
agents.document_modifiers.tnt.utils.parse_summary(xml_string)

Parse summary and explanation from XML-formatted string.

Extracts the content within <summary> and <explanation> tags from the input XML string. If tags are not found, returns empty strings for the missing elements.

Parameters:

xml_string (str) –

XML-formatted string containing <summary> and <explanation> tags. Example:

<summary>Main points...</summary>
<explanation>Detailed analysis...</explanation>

Returns:

Dictionary containing:
  • summary (str): Content within <summary> tags

  • explanation (str): Content within <explanation> tags

Return type:

dict

Example

>>> xml = "<summary>Key points</summary><explanation>Details</explanation>"
>>> result = parse_summary(xml)
>>> print(result)
{'summary': 'Key points', 'explanation': 'Details'}
agents.document_modifiers.tnt.utils.parse_taxonomy(output_text)

Parse taxonomy information from LLM-generated output.

Extracts cluster information including IDs, names, and descriptions from XML-formatted output text.

Parameters:

output_text (str) –

XML-formatted string containing taxonomy clusters. Expected format:

<id>1</id>
<name>Category Name</name>
<description>Category Description</description>

Returns:

Dictionary containing:
  • clusters (list): List of dictionaries, each with:
    • id (str): Cluster identifier

    • name (str): Cluster name

    • description (str): Cluster description

Return type:

dict

Example

>>> text = "<id>1</id><name>Tech</name><description>Technology</description>"
>>> taxonomy = parse_taxonomy(text)
>>> print(taxonomy)
{'clusters': [{'id': '1', 'name': 'Tech', 'description': 'Technology'}]}
agents.document_modifiers.tnt.utils.reduce_summaries(combined)

Merge summarized content with original documents.

Takes a dictionary containing both original documents and their summaries, and combines them into a single state object.

Parameters:

combined (dict) – Dictionary containing: - documents (list): Original document list - summaries (list): Corresponding summaries

Returns:

Updated state containing:
  • documents (list): List of documents with added summaries

Return type:

TaxonomyGenerationState

Example

>>> combined = {
...     "documents": [{"id": 1, "content": "text"}],
...     "summaries": [{"summary": "sum", "explanation": "exp"}]
... }
>>> state = reduce_summaries(combined)