agents.document_modifiers.tnt.utils¶

Utility functions for taxonomy generation and document processing.

This module provides utility functions for parsing, formatting, and managing taxonomy-related data structures. It includes functions for handling XML-formatted outputs, document summaries, and taxonomy clusters.

Note

All XML parsing functions assume well-formed XML input with specific expected tags. Malformed XML may raise parsing errors.

Example

Basic usage for taxonomy parsing:

xml_output = '''
    <id>1</id>
    <name>Category A</name>
    <description>Description text</description>
'''
taxonomy = parse_taxonomy(xml_output)

Functions¶

`format_docs`(docs)	Format documents as XML table for taxonomy generation.
`format_taxonomy`(clusters)	Convert taxonomy clusters to XML format.
`format_taxonomy_md`(clusters)	Format taxonomy clusters as a Markdown table.
`get_content`(state)	Extract document content from taxonomy generation state.
`parse_labels`(output_text)	Parse category labels from prediction output.
`parse_summary`(xml_string)	Parse summary and explanation from XML-formatted string.
`parse_taxonomy`(output_text)	Parse taxonomy information from LLM-generated output.
`reduce_summaries`(combined)	Merge summarized content with original documents.

Module Contents¶

agents.document_modifiers.tnt.utils.format_docs(docs)¶

Format documents as XML table for taxonomy generation.

Parameters:: docs (list[langchain_core.documents.Document]) – List of Document objects, each must have: - id: Document identifier - summary: Document summary text
Returns:: XML-formatted string containing conversation summaries
Return type:: str

Example

>>> docs = [Document(id="1", summary="text")]
>>> xml = format_docs(docs)
>>> print(xml)
<conversations>
<conv_summ id=1>text</conv_summ>
</conversations>

agents.document_modifiers.tnt.utils.format_taxonomy(clusters)¶

Convert taxonomy clusters to XML format.

Parameters:: clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description
Returns:: XML-formatted taxonomy string
Return type:: str

Example

>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> xml = format_taxonomy(clusters)
>>> print(xml)
<cluster_table>
  <cluster>
    <id>1</id>
    <name>Tech</name>
    <description>Technology</description>
  </cluster>
</cluster_table>

agents.document_modifiers.tnt.utils.format_taxonomy_md(clusters)¶

Format taxonomy clusters as a Markdown table.

Parameters:: clusters (list[dict]) – List of cluster dictionaries, each containing: - id (str): Cluster identifier - name (str): Cluster name - description (str): Cluster description
Returns:: Markdown-formatted table string
Return type:: str

Example

>>> clusters = [{"id": "1", "name": "Tech", "description": "Technology"}]
>>> md_table = format_taxonomy_md(clusters)

agents.document_modifiers.tnt.utils.get_content(state)¶

Extract document content from taxonomy generation state.

Parameters:

state (haive.agents.document_modifiers.tnt.state.TaxonomyGenerationState) – Current state of the taxonomy generation process. Must contain a ‘documents’ key with list of document dictionaries.

Returns:

List of dictionaries, each containing:

content (str): The content of a document

Return type:

list

Example

>>> state = {"documents": [{"content": "doc1"}, {"content": "doc2"}]}
>>> contents = get_content(state)
>>> print(contents)
[{'content': 'doc1'}, {'content': 'doc2'}]

agents.document_modifiers.tnt.utils.parse_labels(output_text)¶

Parse category labels from prediction output.

Extracts category information from XML-formatted prediction text. Handles multiple categories but returns only the first one.

Parameters:

output_text (str) –

XML-formatted string containing category predictions. Expected format:

<category>Label Name</category>

Returns:

Dictionary containing:

category (str): The first category label found

Return type:

dict

Note

If multiple categories are found, a warning is logged and only the first category is returned.

Example

>>> text = "<category>Technology</category>"
>>> result = parse_labels(text)
>>> print(result)
{'category': 'Technology'}

agents.document_modifiers.tnt.utils.parse_summary(xml_string)¶

Parse summary and explanation from XML-formatted string.

Extracts the content within <summary> and <explanation> tags from the input XML string. If tags are not found, returns empty strings for the missing elements.

Parameters:

xml_string (str) –

XML-formatted string containing <summary> and <explanation> tags. Example:

<summary>Main points...</summary>
<explanation>Detailed analysis...</explanation>

Returns:

Dictionary containing:

summary (str): Content within <summary> tags
explanation (str): Content within <explanation> tags

Return type:

dict

Example

>>> xml = "<summary>Key points</summary><explanation>Details</explanation>"
>>> result = parse_summary(xml)
>>> print(result)
{'summary': 'Key points', 'explanation': 'Details'}

agents.document_modifiers.tnt.utils.parse_taxonomy(output_text)¶

Parse taxonomy information from LLM-generated output.

Extracts cluster information including IDs, names, and descriptions from XML-formatted output text.

Parameters:

output_text (str) –

XML-formatted string containing taxonomy clusters. Expected format:

<id>1</id>
<name>Category Name</name>
<description>Category Description</description>

Returns:

Dictionary containing:

clusters (list): List of dictionaries, each with:
- id (str): Cluster identifier
- name (str): Cluster name
- description (str): Cluster description

Return type:

dict

Example

>>> text = "<id>1</id><name>Tech</name><description>Technology</description>"
>>> taxonomy = parse_taxonomy(text)
>>> print(taxonomy)
{'clusters': [{'id': '1', 'name': 'Tech', 'description': 'Technology'}]}

agents.document_modifiers.tnt.utils.reduce_summaries(combined)¶

Merge summarized content with original documents.

Takes a dictionary containing both original documents and their summaries, and combines them into a single state object.

Parameters:

combined (dict) – Dictionary containing: - documents (list): Original document list - summaries (list): Corresponding summaries

Returns:

Updated state containing:

documents (list): List of documents with added summaries

Return type:

TaxonomyGenerationState

Example

>>> combined = {
...     "documents": [{"id": 1, "content": "text"}],
...     "summaries": [{"summary": "sum", "explanation": "exp"}]
... }
>>> state = reduce_summaries(combined)