You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
Note
The script used to identify this issue is not an official transformers script but a script from trl, but since the SFTTrainer, DPOTrainer, and such, from trl just subclass the Trainer, I've decided to open the issue here following Philpp's recommendation.
The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in trl at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using the save_strategy="epoch", push_to_hub=True, hub_strategy="every_save".
To run the script mentioned above under the same settings:
I've seen that happening for SFTTrainer, DPOTrainer, and ORPOTrainer in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:
The model is indeed properly pushed when calling trainer.push_to_hub explicitly once the training has finished.
Expected behavior
Ideally when setting save_strategy="epoch", push_to_hub=True, hub_strategy="every_save", assuming that the Hugging Face authentication is properly done, the model weights available under the checkpoint-<STEP_NUM> directory within the output_dir should be pushed along with the rest of the files (tokenizer and configuration).
But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.
We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.
Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗
The text was updated successfully, but these errors were encountered:
System Info
transformers
version: 4.40.2Additionally:
Who can help?
@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Note
The script used to identify this issue is not an official
transformers
script but a script fromtrl
, but since theSFTTrainer
,DPOTrainer
, and such, fromtrl
just subclass theTrainer
, I've decided to open the issue here following Philpp's recommendation.The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in
trl
at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using thesave_strategy="epoch"
,push_to_hub=True
,hub_strategy="every_save"
.To run the script mentioned above under the same settings:
I've seen that happening for
SFTTrainer
,DPOTrainer
, andORPOTrainer
in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:The model is indeed properly pushed when calling
trainer.push_to_hub
explicitly once the training has finished.Expected behavior
Ideally when setting
save_strategy="epoch"
,push_to_hub=True
,hub_strategy="every_save"
, assuming that the Hugging Face authentication is properly done, the model weights available under thecheckpoint-<STEP_NUM>
directory within theoutput_dir
should be pushed along with the rest of the files (tokenizer and configuration).But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.
We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.
Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗
The text was updated successfully, but these errors were encountered: