Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Major bug: Chat template is not actually applied in run_sft.py and run_dpo.py #125

Open
AlexiaJM opened this issue Feb 23, 2024 · 7 comments

Comments

@AlexiaJM
Copy link

AlexiaJM commented Feb 23, 2024

In run_sft.py and run_dpo.py, it says that it applies the chat template. But this is not actually done.

In the code below, column_names contains all the names of the columns, and this is given as a remove_columns argument to the mpa function. This means that all columns are skipped and the map is not applied at all. I actually tested by adding prints inside the apply_chat_template and I observed no printout at all, unless I removed the "remove_columns=column_names" in which case I was able to see printouts. Thus, the map is not applied at all and I wonder how it can even work.

column_names = list(raw_datasets["train"].features)
#####################
# Apply chat template
#####################
raw_datasets = raw_datasets.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
num_proc=data_args.preprocessing_num_workers,
remove_columns=column_names,
desc="Formatting comparisons with prompt template",
)

@AlexiaJM
Copy link
Author

See

#####################
and
#####################
.

@AlexiaJM
Copy link
Author

AlexiaJM commented Feb 23, 2024

Problem is fixed by replacing:

    raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
        num_proc=data_args.preprocessing_num_workers,
        remove_columns=column_names,
        desc="Formatting comparisons with prompt template",
    )

with

    raw_datasets = raw_datasets.map(
        apply_chat_template,
        fn_kwargs={"tokenizer": tokenizer, "task": "dpo"},
        num_proc=data_args.preprocessing_num_workers,
        desc="Formatting comparisons with prompt template",
    )
    raw_datasets = raw_datasets.map(lambda x: x, remove_columns=column_names)

I don't know why, but the datasets library (v 2.14.6) is not properly removing the columns after processing them. So the variables need to be deleted after the first map.

@AlexiaJM
Copy link
Author

AlexiaJM commented Feb 23, 2024

Update: The problem is fixed by using the latest version of datasets (v2.17.1), so my above fix is not needed.

So all you need to do is change the requirements in setup.py to the latest version of datasets. I kind of suspect that a lot of the weird issues people have (like in #45) might be because of this.

When using the current code with the versions of software suggested in

"datasets==2.14.6",
, the chat template is not applied!

@BramVanroy
Copy link
Contributor

Are you sure about this? As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq

@AlexiaJM
Copy link
Author

This is how it should be, but it must be bugged in this old version of datasets.

Yes, I tested it with prints inside the apply_chat_template. It doesn't print with datasets==2.14.6.

@lhoestq
Copy link
Member

lhoestq commented Feb 26, 2024

As far as I know the "remove columns" is only called after updating the dataset. So the chat template is applied, new columns are added and all the previous (old) columns are removed. I think that's how it has always worked. Cc @lhoestq

Actually the remove_columns is taken into account when updating the dataset itself, to not have to write unnecessary data

@EriChen0615
Copy link

Hi @AlexiaJM

Thanks for opening the thread. I was unable to replicate the error you reported using datasets==2.14.6. I inserted a breakpoint after the map call and inspect the processed dataset. The prompt is in the correct format.

Here's what the first item of the processed UltraFeedback Binarized dataset for Zephyr training looks like:

'prompt': '<|system|>\n</s>\n<|user|>\nDo you know something about crystallography and structure factor?</s>\n'

Seems like the effect of map should be checked after the whole map call is completed.

Regards,
Eric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants