Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad performance on PrOntoQA benchmark #28

Open
huskydoge opened this issue Apr 10, 2024 · 1 comment
Open

Bad performance on PrOntoQA benchmark #28

huskydoge opened this issue Apr 10, 2024 · 1 comment

Comments

@huskydoge
Copy link

PrOntoQA is a question-answering dataset that generates examples with chains-of-thought that describe the reasoning required to answer the questions correctly. The sentences in the examples are syntactically simple and amenable to semantic parsing. It can be used to formally analyze the predicted chain-of-thought from large language models.

I have tested the performance of DBRX-Base on GSM8k, AQuA, strategyQA dataset using COT-4-shot, its performance is satisfying compared to other models (GPT4, Claude Opus, LLama 70B, etc.).

Nevertheless, when I test the model's performance on PrOntoQA, its performance is not that satisfying, where dbrx-instruction achieves a 24.2% accuracy and dbrx-base is worse. Although there might be some output processing errors when using dbrx-base, dbrx-instruct has no problem with endless generation but still fails to achieve a good performance.

Therefore, I want to know whether there is an official test result on PrOntoQA for others to take as a reference.

Thanks!

@hanlint
Copy link
Collaborator

hanlint commented Apr 15, 2024

Hello @huskydoge , we have not tried PrOntoQA yet, but will let you know if we do!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants