RagDatasetGenerator seems ti generate more Q/A pairs than requested #11704

vecorro · 2024-03-06T17:57:34Z

vecorro
Mar 6, 2024

Hi,

I'm using RagDatasetGenerator to generate Q/A pairs that hopefully can help me evaluate a RAG system. I started to retrieve all the nodes from a vector store. Once I have all the nodes in a list, I create a sample on it which produces a smaller list of only 63 nodes. I'm generating the Q/A pairs the following way:

dataset_generator = RagDatasetGenerator(
    nodes=eval_nodes_sample,
    show_progress=True,
    num_questions_per_chunk=2,
    workers=NUM_WORKERS
)
rag_eval_dataset =  dataset_generator.generate_dataset_from_nodes()

Even when I ask the generator to produce 2 questions per chunk (I'd assume 1 node is equivalent to 1 chunk) and it has 63 nodes to work, the rag_eval_dataset object ends up containing 501 questions when I'd expect only 126. Why is that?

Thanks

Answered by dosubot[bot]

Mar 6, 2024

Hey @vecorro! Nice to see you back around here. Let's take a look at your issue with the RagDatasetGenerator.

The behavior you're observing is due to the way the RagDatasetGenerator generates and processes questions. When you set num_questions_per_chunk to 2, this parameter influences the question_gen_query that instructs the LLM to generate a certain number of questions per chunk of text. However, the actual number of questions generated can exceed this parameter due to several reasons:

Question Generation Process: The LLM generates questions based on the question_gen_query, which might not strictly adhere to the requested number of questions (num_questions_per_chunk). The language mo…

View full answer

@vecorro · 2024-03-06T17:57:47Z

dosubot[bot]
bot Mar 6, 2024

Hey @vecorro! Nice to see you back around here. Let's take a look at your issue with the RagDatasetGenerator.

The behavior you're observing is due to the way the RagDatasetGenerator generates and processes questions. When you set num_questions_per_chunk to 2, this parameter influences the question_gen_query that instructs the LLM to generate a certain number of questions per chunk of text. However, the actual number of questions generated can exceed this parameter due to several reasons:

Question Generation Process: The LLM generates questions based on the question_gen_query, which might not strictly adhere to the requested number of questions (num_questions_per_chunk). The language model's output depends on its interpretation of the prompt and the context provided. If the model finds the content rich or interpretable in multiple ways, it might generate more questions than specified.
Post-Processing of Questions: After questions are generated, there's a post-processing step where the responses are split and cleaned. This step involves:
- Splitting the response into individual questions based on newline characters.
- Cleaning each question by removing numeric prefixes and trimming whitespace.
This process does not filter out excess questions if more than num_questions_per_chunk are generated. Instead, it processes all questions generated by the LLM, leading to potentially more questions in the dataset than expected.
No Explicit Limit Enforcement: The code does not explicitly enforce a limit on the number of questions per chunk after generation. Once the questions are generated and cleaned, all of them are used to create LabelledRagDataExample instances, regardless of whether their number exceeds num_questions_per_chunk.
Ambiguity in LLM Interpretation: The instruction given to the LLM through question_gen_query might be interpreted in various ways by the model, especially since it's encouraged to generate diverse questions. The LLM might generate more questions to cover different aspects of the document, interpreting the instruction to mean a minimum rather than an exact number.

You can find more details about this in the llama_index/core/llama_dataset/generator.py file.

There's also a similar open issue [Bug]: Unexpected Number of Questions Generated When Requesting FAQ Generation which suggests modifying the _agenerate_dataset method to strictly limit the number of questions generated. However, please note that this might result in less diverse questions being generated if the limit is set to a low value.

I hope this helps clarify the behavior you're observing. If you have any more questions, feel free to ask.

Sources

About Dosu

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RagDatasetGenerator seems ti generate more Q/A pairs than requested #11704

{{title}}

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

About Dosu

Select a reply

RagDatasetGenerator seems ti generate more Q/A pairs than requested #11704

vecorro Mar 6, 2024

Replies: 1 comment

dosubot[bot] bot Mar 6, 2024

Sources

About Dosu

vecorro
Mar 6, 2024

dosubot[bot]
bot Mar 6, 2024