Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hub_strategy="every_save" won't push the model to the Hub if large #30724

Open
2 of 4 tasks
alvarobartt opened this issue May 9, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@alvarobartt
Copy link
Contributor

alvarobartt commented May 9, 2024

System Info

  • transformers version: 4.40.2
  • Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.1+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes (A100 80GB SXM)
  • Using distributed or parallel set-up in script?: NA

Additionally:

  • trl version: 0.8.7.dev0

Who can help?

@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Note

The script used to identify this issue is not an official transformers script but a script from trl, but since the SFTTrainer, DPOTrainer, and such, from trl just subclass the Trainer, I've decided to open the issue here following Philpp's recommendation.

The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in trl at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using the save_strategy="epoch", push_to_hub=True, hub_strategy="every_save".

To run the script mentioned above under the same settings:

python sft.py --model_name_or_path="mistralai/Mistral-7B-v0.1" --report_to="tensorboard" --learning_rate=5e-5 --dataset_name="timdettmers/openassistant-guanaco" --dataset_train_split="train" --dataset_test_split="test" --torch_dtype="bfloat16" --per_device_train_batch_size=16 --gradient_accumulation_steps=2 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --push_to_hub --gradient_checkpointing --hub_strategy="every_save" --hub_private_repo --save_strategy="epoch" --hub_repo_id="hub-strategy-every-save-mistral-sft" --optim=adamw_bnb_8bit

I've seen that happening for SFTTrainer, DPOTrainer, and ORPOTrainer in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:

image

The model is indeed properly pushed when calling trainer.push_to_hub explicitly once the training has finished.

Expected behavior

Ideally when setting save_strategy="epoch", push_to_hub=True, hub_strategy="every_save", assuming that the Hugging Face authentication is properly done, the model weights available under the checkpoint-<STEP_NUM> directory within the output_dir should be pushed along with the rest of the files (tokenizer and configuration).

But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.

We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.

Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant