Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine-Tuning Crashes for no reason when Eight GPU cards are used. #816

Open
OscarC9912 opened this issue May 8, 2024 · 4 comments
Open

Comments

@OscarC9912
Copy link

Dear Developers at LMFlow:

I have been using LMFlow for a long time and the experience is great !

But recently, after cloning the latest LMFlow and use it to Fine-Tune my model, I encountered some expected issue.

Specifically, when I use all 8 of my NVIDIA-A100 cards, the fine-tuning program crashes without indicating any error. However, when I use only 6 / 7 cards, things goes well.

Beliw is the output of the program:

[2024-05-08 14:29:20,246] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,246] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,398] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,399] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,458] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,458] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,499] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,500] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,537] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,537] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,538] [INFO] [comm.py:643:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-05-08 14:29:20,593] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,593] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,634] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,634] [INFO] [comm.py:616:init_distributed] cdb=None
[2024-05-08 14:29:20,650] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented
[2024-05-08 14:29:20,650] [INFO] [comm.py:616:init_distributed] cdb=None
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 5, device: cuda:5, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 6, device: cuda:6, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 4, device: cuda:4, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 0, device: cuda:0, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:21 - WARNING - lmflow.pipeline.finetuner - Process rank: 1, device: cuda:1, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 3, device: cuda:3, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 7, device: cuda:7, n_gpu: 1,distributed training: True, 16-bits training: True
05/08/2024 14:29:22 - WARNING - lmflow.pipeline.finetuner - Process rank: 2, device: cuda:2, n_gpu: 1,distributed training: True, 16-bits training: True
[2024-05-08 14:32:25,933] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906770
[2024-05-08 14:32:25,972] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906771
[2024-05-08 14:32:31,375] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906772
[2024-05-08 14:32:35,169] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906773
[2024-05-08 14:32:38,199] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906774
[2024-05-08 14:32:41,567] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906775
[2024-05-08 14:32:45,223] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906776
[2024-05-08 14:32:48,678] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 3906777

[2024-05-08 14:32:53,189] [ERROR] [launch.py:321:sigkill_handler] ['miniconda3/envs/lmflow/bin/python', '-u', 'examples/finetune.py', '--local_rank=7', '--model_name_or_path', 'meta-llama/Meta-Llama-3-8B', '--dataset_path', '/data', '--output_dir', '/model', '--overwrite_output_dir', '--num_train_epochs', '1', '--learning_rate', '1e-5', '--block_size', '512', '--per_device_train_batch_size', '32', '--deepspeed', 'configs/ds_config_zero3.json', '--fp16', '--run_name', 'chinese-llama3', '--validation_split_percentage', '20', '--logging_steps', '20', '--do_train', '--ddp_timeout', '72000', '--save_steps', '5000', '--use_flash_attention', 'True', '--dataloader_num_workers', '8'] exits with return code = -9

I am pretty sure I use the correct way to specify the GPU cards to use by setting DeepSpeed Arguments:
deepspeed_args="--master_port=11012 --include localhost:0,1,2,3,4,5,6,7"

As I have never excountered this problem with the elder version, after several times of experiment with errors, I come here to seek for helps.
I am not sure if it is the problem on the LMFlow side / on my side.

Thanks for your help ~

@research4pan
Copy link
Contributor

research4pan commented May 8, 2024

Thanks for your interest and recognition in LMFlow! Some of our collaborators have met a similar issue. We were using CUDA 12.0 and pytorch for cuda 12.1, and similar problems occurred. It was resolved by using pytorch corresponding to an older CUDA (like 11.8).

We suspect this problem is caused by the mismatch of the latest pytorch version and CUDA version. You may try to adjust the versions of pytorch to see if the problem occurs again. Hope this information can be helpful 😄

@OscarC9912
Copy link
Author

Thanks for your reply !
I will try that !

Another issue is the run_all_benchmark.sh; specifically, when I run the script, it just gives error saying:

Traceback (most recent call last):
Selected Tasks: ['hellaswag', 'winogrande', 'arc_challenge', 'boolq', 'openbookqa', 'arc_easy', 'piqa']
  File "/ssddata/zchenhj/LMFlow/utils/lm_evaluator.py", line 108, in <module>
    main()
  File "/ssddata/zchenhj/LMFlow/utils/lm_evaluator.py", line 79, in main
    results = evaluator.simple_evaluate(
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/utils.py", line 161, in _wrapper
    return fn(*args, **kwargs)
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/evaluator.py", line 64, in simple_evaluate
    lm = lm_eval.models.get_model(model).create_from_arg_string(
  File "/ssddata/zchenhj/miniconda3/envs/lmflow/lib/python3.9/site-packages/lm_eval/models/__init__.py", line 16, in get_model
    return MODEL_REGISTRY[model_name]
KeyError: 'hf-causal-experimental'
[2024-05-08 16:53:38,340] [INFO] [launch.py:347:main] Process 4024922 exits successfully.

I get into the code, I suspect that some parts of the code has not yet finished implementation, and then leads to the error ?

I run the code by bash run_all_benchmark.sh --model_name_or_path model_name

Thanks again for your help !

@research4pan
Copy link
Contributor

@2003pro I am wondering if you can take a look at this?

@2003pro
Copy link
Contributor

2003pro commented May 11, 2024

I suggest to switch lm-eval package's version back to 0.4.0. Just like:

git clone -b v0.0.4 https://github.com/EleutherAI/lm-evaluation-harness.git 
cd lm-evaluation-harness
pip install -e .

Also, if there any further issues, you may check if transformers' version is compatible. My environment's version is 4.33.3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants