[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127

ai-nikolai · 2024-03-09T14:14:16Z

Bug / Assistance Description
The results that are reported in the HH column are very different to the ReAct paper. In particular, ReAct reports

To Reproduce
See screenshots below. Your results in HH column indicate 16% success for text-davinci-002 or gpt-3.5-turbo. However, the reults using text-davinci-002 on ReAct indicate 78% (second screenshot). This is a significant difference.

Screenshots or Terminal Copy&Paste

Concrete Questions / Actions:
Please tell us:

How your evaluation for Alfworld (HH) differs from ReAct?
Which exact model you used?
Which prompts you used (1-shot, 2-shot), and are they the same as from the ReAct paper?
Why are the results so different?

ai-nikolai · 2024-03-09T14:14:39Z

@cenyk1230 @Btlmd @1049451037 @zfjsail

zhc7 · 2024-03-11T12:59:26Z

Please read the paper carefully. You can find all the prompt in appendix or code. The results are different because 1. we are not using the same prompt. 2. we are not using exactly the same envrionment.

ai-nikolai · 2024-03-13T10:44:53Z

Thanks for coming back @zhc7.

Thanks for clarifying, yes in appending G.2 a prompt example can be seen, which I guess corresponds to either:
a. https://github.com/THUDM/AgentBench/blob/main/src/server/tasks/alfworld/prompts/alfworld_multiturn_react.json
b. https://github.com/THUDM/AgentBench/blob/main/src/server/tasks/alfworld/prompts/alfworld_multiturn_plan_first.json
Can you elaborate how the environment is not exactly the same? [Do you use a different version of alfworld, etc.?]

The reason for asking about this question is to understand whether you were able to get close to the results reported in ReAct and what the exact difference might be, as the results of ReAct seem quite impossible to reproduce.

zhc7 · 2024-03-26T09:16:43Z

Hi, @ai-nikolai sorry for the late reply, we've been quite busy lately. To answer your question, I believe the main difference is the prompting technique. We weren't aiming to reproduce the ReAct's result, but to design a prompt and a evaluation process that is relatively fair to all the models. The prompt we used is listed in paper Appendix G. The evaluation process is located at

AgentBench/src/server/tasks/alfworld/task.py

Line 105 in 2f3c343

    
           async def start_sample(self, index, session: Session) -> TaskSampleExecutionResult:

.

Can you elaborate how the environment is not exactly the same? [Do you use a different version of alfworld, etc.?]

The main differences are about adapting the alfworld to the framework and set some limitations and rules to avoid prolonged evaluation.

To sum up, you may have to do some more investigations on this problem.

ai-nikolai added bug Something isn't working help wanted Extra attention is needed labels Mar 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127

ai-nikolai commented Mar 9, 2024 •

edited

ai-nikolai commented Mar 9, 2024

zhc7 commented Mar 11, 2024

ai-nikolai commented Mar 13, 2024

zhc7 commented Mar 26, 2024

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127

[Bug/Assistance] - Reproducing Results on Alfworld (HH) (vs. ReAct paper) #127

Comments

ai-nikolai commented Mar 9, 2024 • edited

ai-nikolai commented Mar 9, 2024

zhc7 commented Mar 11, 2024

ai-nikolai commented Mar 13, 2024

zhc7 commented Mar 26, 2024

ai-nikolai commented Mar 9, 2024 •

edited