`hub_strategy="every_save"` won't push the model to the Hub if large #30724

alvarobartt · 2024-05-09T10:17:09Z

System Info

transformers version: 4.40.2
Platform: Linux-5.15.0-87-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.30.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.1+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: Yes (A100 80GB SXM)
Using distributed or parallel set-up in script?: NA

Additionally:

trl version: 0.8.7.dev0

Who can help?

@muellerzr and @pacman100, also cc @philschmid and @lewtun as a follow up on a recent conversation about this issue

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Note

The script used to identify this issue is not an official transformers script but a script from trl, but since the SFTTrainer, DPOTrainer, and such, from trl just subclass the Trainer, I've decided to open the issue here following Philpp's recommendation.

The script that I'm using can be found at trl/examples/scripts/sft.py, but any fine-tuning script in trl at least, will not push the model to the Hugging Face Hub after every epoch and only push the tokenizer and configuration files instead, when using the save_strategy="epoch", push_to_hub=True, hub_strategy="every_save".

To run the script mentioned above under the same settings:

python sft.py --model_name_or_path="mistralai/Mistral-7B-v0.1" --report_to="tensorboard" --learning_rate=5e-5 --dataset_name="timdettmers/openassistant-guanaco" --dataset_train_split="train" --dataset_test_split="test" --torch_dtype="bfloat16" --per_device_train_batch_size=16 --gradient_accumulation_steps=2 --output_dir="sft_openassistant-guanaco" --logging_steps=1 --num_train_epochs=3 --push_to_hub --gradient_checkpointing --hub_strategy="every_save" --hub_private_repo --save_strategy="epoch" --hub_repo_id="hub-strategy-every-save-mistral-sft" --optim=adamw_bnb_8bit

I've seen that happening for SFTTrainer, DPOTrainer, and ORPOTrainer in both single and multi-GPU setups. What's pushed to the Hub after every epoch is the following:

The model is indeed properly pushed when calling trainer.push_to_hub explicitly once the training has finished.

Expected behavior

Ideally when setting save_strategy="epoch", push_to_hub=True, hub_strategy="every_save", assuming that the Hugging Face authentication is properly done, the model weights available under the checkpoint-<STEP_NUM> directory within the output_dir should be pushed along with the rest of the files (tokenizer and configuration).

But apparently only the latter are uploaded while the model is not. So ideally, that combination of flags should also upload the model to the Hub after every epoch.

We've reproduced that using smaller models and it does work as expected, but as long as the model is either over 5GB or requires sharding it won't work.

Feel free to let me know if there's anything else you'd like me to do to help debug this issue further! Thanks in advance 🤗

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`hub_strategy="every_save"` won't push the model to the Hub if large #30724

`hub_strategy="every_save"` won't push the model to the Hub if large #30724

alvarobartt commented May 9, 2024 •

edited

hub_strategy="every_save" won't push the model to the Hub if large #30724

hub_strategy="every_save" won't push the model to the Hub if large #30724

Comments

alvarobartt commented May 9, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

`hub_strategy="every_save"` won't push the model to the Hub if large #30724

`hub_strategy="every_save"` won't push the model to the Hub if large #30724

alvarobartt commented May 9, 2024 •

edited