Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could you please share some tips with your rich experience? #3452

Open
1 task done
xiaochengsky opened this issue Apr 25, 2024 · 1 comment
Open
1 task done

Could you please share some tips with your rich experience? #3452

xiaochengsky opened this issue Apr 25, 2024 · 1 comment
Labels
pending This problem is yet to be addressed.

Comments

@xiaochengsky
Copy link

Reminder

  • I have read the README and searched the existing issues.

Reproduction

It's an awesome project! Thank you wonderful contributions!

For an example repo about stf by using deepspeed:

deepspeed --num_gpus=8 src/train_bash.py
--stage sft
--model_name_or_path "xxx"
--do_train
--dataset alpaca_en
--dataset_dir ./data
--finetuning_type lora
--output_dir "xxx"
--overwrite_cache
--per_device_train_batch_size 16
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3
--plot_loss
--fp16
--template default
--deepspeed "scripts/ds_z3_config_lora.json"

Here some questions in multi-gpus finetuning:

  1. Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size?
    e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?

  2. Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?

  3. If the training loss is reduced, is it good for performing well on NLP general tasks?

  4. For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?

I know you are very busy, but I still looking forward to your reply, thanks!

Expected behavior

No response

System Info

No response

Others

No response

@hiyouga hiyouga added the pending This problem is yet to be addressed. label Apr 25, 2024
@xiaochengsky
Copy link
Author

Maybe I should update the first question.

  1. Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and gradient_accumulation_steps (maybe per_device_train_batch_size isn't so cirtical? right)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

2 participants