You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size?
e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?
Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?
If the training loss is reduced, is it good for performing well on NLP general tasks?
For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?
I know you are very busy, but I still looking forward to your reply, thanks!
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered:
Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and gradient_accumulation_steps (maybe per_device_train_batch_size isn't so cirtical? right)
Reminder
Reproduction
It's an awesome project! Thank you wonderful contributions!
For an example repo about stf by using deepspeed:
deepspeed --num_gpus=8 src/train_bash.py
--stage sft
--model_name_or_path "xxx"
--do_train
--dataset alpaca_en
--dataset_dir ./data
--finetuning_type lora
--output_dir "xxx"
--overwrite_cache
--per_device_train_batch_size 16
--gradient_accumulation_steps 4
--lr_scheduler_type cosine
--logging_steps 10
--save_steps 1000
--learning_rate 5e-5
--num_train_epochs 3
--plot_loss
--fp16
--template default
--deepspeed "scripts/ds_z3_config_lora.json"
Here some questions in multi-gpus finetuning:
Does the learning rate need to be linear scaled accordingly depending on the number of gpu's and per_device_train_batch_size?
e.g. now gpus=8, per_device_train_batch_size=16, lr=5e-5. So, if gpus=4, per_device_train_batch_size=4, lr~6.25e-6, right?
Based on your rich experience, for NLP general tasks(e.g ARC-c/ARC-e/BoolQ/HellaSwag/MMLU/OBQA/RTE/WinoGrande, and so on ), how much loss reduction is considered good(like low than 1? for alpaca_en)?
If the training loss is reduced, is it good for performing well on NLP general tasks?
For base models(like Mixtral-8x7B, not Mixtral-8x7B Instruct), will it affect their zero-shot performance on NLP general tasks by using different template(default/alpaca/vicuna)?
I know you are very busy, but I still looking forward to your reply, thanks!
Expected behavior
No response
System Info
No response
Others
No response
The text was updated successfully, but these errors were encountered: