[Question]: HierarchicalNodeParser sets each leafs ref_doc_id to the doc_id from the second level node. Is this intended? #8430

mphipps2 · 2023-10-10T22:44:43Z

mphipps2
Oct 10, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

I'm using the HierarchicalNodeParser as shown in this documentation: https://gpt-index.readthedocs.io/en/stable/examples/retrievers/auto_merging_retriever.html

But I noticed that each leaf node ref_doc_id is being set to the Node ID of its parent node (the second level node). So then when the retriever is called the response synthesizer acts across all the leaf chunks for a given second level node. So for example, if a given second level node has 4 children, those children would be compacted or refined based off the response_mode being used for the response_synthesizer, meaning we end up with one response back from our llm instead of 4. As I understand, this effectively turns the second level nodes into the new leaves. Was this the intended effect?

To prevent this from happening I have to hack together something like this:
node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048,512,512])

@dosu-bot · 2023-10-10T22:49:48Z

dosubot[bot]
bot Oct 10, 2023

🤖

Hello,

Thank you for your detailed question. Based on the information provided, it seems like you have a good understanding of how the HierarchicalNodeParser works.

The HierarchicalNodeParser is designed to split a document into a recursive hierarchy of nodes. The ref_doc_id of each leaf node is indeed set to the Node ID of its parent node (the second level node). This is because the HierarchicalNodeParser treats each document as a hierarchy, and the ref_doc_id is used to maintain this hierarchical relationship.

When the retriever is called, it retrieves all the leaf nodes associated with a given second level node. The response synthesizer then acts across all these leaf nodes, compacting or refining them based on the response_mode being used. This is the intended behavior, as it allows the LLM to generate a single, coherent response from multiple related leaf nodes.

If you want to change this behavior, you can modify the chunk_sizes parameter when creating the HierarchicalNodeParser. The chunk_sizes parameter determines the size of each chunk at each level of the hierarchy. By setting it to [2048,512,512], you're creating a three-level hierarchy, where the second level nodes are split into smaller chunks, effectively turning them into new leaf nodes.

I hope this answers your question. If you have any further questions or need clarification, please don't hesitate to ask.

Sources

llama_index/node_parser/hierarchical.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

logan-markewich · 2023-10-10T22:54:39Z

logan-markewich
Oct 10, 2023
Maintainer

@mphipps2 I'm not 100% sure I understand the issue.

The ref_doc_id is set to the parent, so that the AutoMergingRetriever can retriever the parent node when a majority of it's children are retrieved

Then, these nodes are given to the response synthesizer. The default response synthesis mode is compact, which means all retrieved nodes will be compacted into as few LLM calls as possible.

If you want to make one LLM call per node, you can set response_mode="refine" in your response synthesizer/query engine

0 replies

mphipps2 · 2023-10-11T00:16:15Z

mphipps2
Oct 11, 2023
Author

@logan-markewich Hmm ok I see what's going on. When I use the HierarchicalNodeParser with the VectorStoreIndex and AutoMergingRetriever as done in the documentation I do in fact get the expected behavior.

My problem comes in trying to combine the DocumentSummaryIndex with the HierarchicalNodeParser/AutoMergingRetriever. The idea here was to use an LLM to generate summaries and questions from my leaf nodes, embed the summaries, return the top k summaries using DocumentSummaryIndexEmbeddingRetriever, and pass that to the AutoMergingRetriever, which passes its final merged nodes to the query engine.

This seems to work well except for the fact that my summaries are being done on the second level nodes, rather than the leaf nodes. This is the issue I was describing above where creating my leaf nodes through the HierarchicalNodeParser causes the ref_doc_id to be set as the parent node, which causes the DocumentSummaryIndex to perform summaries on those parent second level nodes, rather than the leaf nodes.

I can hack together a solution to this, but is there anything I'm missing here that would let DocumentSummaryIndex and HierarchicalNodeParser/AutoMergingRetriever play together better?

0 replies

logan-markewich · 2023-10-11T00:22:27Z

logan-markewich
Oct 11, 2023
Maintainer

@mphipps2 can you give some code to replicate the issue? From node parsing to retrieval? I can try reproducing and seeing what's up

0 replies

mphipps2 · 2023-10-11T00:45:22Z

mphipps2
Oct 11, 2023
Author

Sure and thanks for looking into this ...

node_parser = HierarchicalNodeParser.from_defaults(chunk_sizes=[2048,512,128])    
nodes = node_parser.get_nodes_from_documents(document_list)
leaf_nodes = get_leaf_nodes(nodes)
second_level_nodes = get_second_level_nodes(nodes)
leaf_text = [node.text for node in leaf_nodes]

def get_second_level_nodes(nodes: List[BaseNode]) -> List[BaseNode]:
    """Get second-level nodes."""
    second_level_nodes = []
    for node in nodes:
        # Check if the node has a parent (making it at least a second-level node)
        if NodeRelationship.PARENT in node.relationships:
            # Check if the node has children (making it not a leaf/third-level node)
            if NodeRelationship.CHILD in node.relationships:
                second_level_nodes.append(node)
    return second_level_nodes

for i, leaf in enumerate(leaf_nodes):
    print('i: ' , i , ' original leaf ref doc id: ' , leaf.ref_doc_id, ' leaf_nodes: ' , leaf)
for i, second_level in enumerate(second_level_nodes):
    print('i: ' , i, ' ref doc id: ' , second_level.ref_doc_id , ' second_level: ' , second_level)

summary_response_synthesizer = get_response_synthesizer(
    response_mode=ResponseMode.SIMPLE_SUMMARIZE, 
    text_qa_template=summary_template, 
    service_context=service_context,
    use_async=True
)

doc_store = SimpleDocumentStore()
doc_store.add_documents(nodes)

# get vector embedding dimensionality
d = len(service_context.embed_model.get_text_embedding("This is just a test!"))
# Create a Faiss index
faiss_index = faiss.IndexFlatL2(d)
vector_store = FaissVectorStore(faiss_index=faiss_index)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

doc_summary_index = DocumentSummaryIndex(
    nodes=nodes,
    service_context=service_context,
    storage_context=storage_context,
    response_synthesizer=summary_response_synthesizer,
    show_progress=True,
)

# return top k nodes from summary index
summary_retriever = DocumentSummaryIndexEmbeddingRetriever(
    doc_summary_index,
    similarity_top_k=5,
)
auto_merging_retriever = AutoMergingRetriever(summary_retriever, storage_context, verbose=True)

0 replies

mphipps2 · 2023-10-11T01:31:02Z

mphipps2
Oct 11, 2023
Author

Just to add to my last post ... I guess the issue is that ref_doc_id serves different purposes in DocumentSummaryIndex and AutoMergingRetriever. In DocumentSummaryIndex, it's used by the ResponseSynthesizer to deal with the context window constraint, while in AutoMergingRetriever, it's used to link/merge child and parent nodes. This causes tension then when the two are chained together.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: HierarchicalNodeParser sets each leafs ref_doc_id to the doc_id from the second level node. Is this intended? #8430

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Question]: HierarchicalNodeParser sets each leafs ref_doc_id to the doc_id from the second level node. Is this intended? #8430

mphipps2 Oct 10, 2023

Question Validation

Question

Replies: 6 comments

dosubot[bot] bot Oct 10, 2023

Sources

logan-markewich Oct 10, 2023 Maintainer

mphipps2 Oct 11, 2023 Author

logan-markewich Oct 11, 2023 Maintainer

mphipps2 Oct 11, 2023 Author

mphipps2 Oct 11, 2023 Author

mphipps2
Oct 10, 2023

dosubot[bot]
bot Oct 10, 2023

logan-markewich
Oct 10, 2023
Maintainer

mphipps2
Oct 11, 2023
Author

logan-markewich
Oct 11, 2023
Maintainer

mphipps2
Oct 11, 2023
Author

mphipps2
Oct 11, 2023
Author