New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Running ColossalAI in H800 with torch 2.0 #5594
Comments
|
Hi, |
I know. But I want to deploy colossal on NVIDIA H800 GPU which only support cuda 12. Based on cuda 12, I can only install pytorch 2.0+ not 1.12.... Could you give me some further suggestions? |
Sorry, I think the current auto parallel is less performant and popular so we didn't adapt it to the newest version. Do you have a compelling reason to use it? |
I don't necessarily have to use Auto Parallel Strategy . What I mean is that the official demos provided now are all based on the Torch 1.12 API, but on H800, only Torch 2.0+ can be used, which means I can't deploy training plans on H800. |
Other demos should work on torch 2.0 |
Could you give me some examples ? I have tried many training demo codes but they all failed on torch 2.0 but succeeded on torch 1.12.. |
Could you try examples/language/gpt/gemini and examples/language/gpt/hybridparallelism? |
Could you try examples/language/gpt/gemini and examples/language/gpt/hybrid parallelism? |
I have fixed this so pulling from the newest main branch should work |
Could you either install apex from source or set enable_all_optimization=False? Thanks. |
I have re-compiled and re-installed apex from source and run the programs , got the following:
|
You'll need to either set enable_all_optimization=False or pip install flash-attn |
set enable_all_optimization for colossal or apex? |
fix and thanks |
when I run the examples in ColossalAI/examples/language/gpt/hybridparallelism/ using command colossalai run --nproc_per_node=2 finetune.py, I always got the following error: |
Thanks for your issue. This is probably due to a recent transformers upgrade, so I've fixed it. |
Thanks for you reply. Actually, I have launched two Docker containers on two separate machines. How can I configure the Docker address in the host file |
Please refer to similar examples in Pytorch forum. You can either run docker in host network mode or map a port from container to host. |
when I run the examples in ColossalAI/examples/language/gpt/hybridparallelism/ using command bash run.sh, I always got the following error: Failed to run torch 2.1 in Tesla V100 GPU ..... |
This is not a bug on our end as flash attention doesn't support V100, which is why it's throwing no kernel. You should uninstall flash_attn |
When I uninstall the flash-attn and re-run this example , and I met the similar error. How can I run this example successfully |
🐛 Describe the bug
I am running example codes show in https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel with Pytorch 2.0 (because I need to deploy colossal in H800 which needs cuda at least 12.0 matched with pytorch at least 2.0)
However, I meet the following error:
Then I replace _checkpoint_without_reentrant_generator with _checkpoint_without_reentrant , re-run colossalai run --nproc_per_node 4 auto_parallel_with_gpt.py , I got the following errors:
It seems that the current code is not compatible with the new version of the API (torch 2.0+)
Environment
The text was updated successfully, but these errors were encountered: