Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: OOM when saving 70B model #5585

Open
jiejie1993 opened this issue Apr 11, 2024 · 2 comments
Open

[BUG]: OOM when saving 70B model #5585

jiejie1993 opened this issue Apr 11, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jiejie1993
Copy link

πŸ› Describe the bug

I train the llama2-70b model in 4*8(80G) H100 with "gemini" plugin, Can train normally, but an β€œOOM” error occurs when saving the model.

here is the log
Epoch 0: 0%| | 5/1119698 [01:42<6209:10:15, 19.96s/it, Loss=13.5324] Start saving model checkpoint with running states Traceback (most recent call last): File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 430, in <module> main() File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/train.py", line 390, in main save_checkpoint( File "/mnt/disk1/gyj/ColossalAI/applications/Colossal-LLaMA-2/colossal_llama2/utils/ckpt_io.py", line 56, in save_checkpoint booster.save_optimizer(optimizer, os.path.join(save_dir, "optimizer"), shard=True) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/booster.py", line 307, in save_optimizer self.checkpoint_io.save_optimizer(optimizer, checkpoint, shard, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/checkpoint_io_base.py", line 197, in save_optimizer self.save_sharded_optimizer(optimizer, checkpoint, gather_dtensor, prefix, size_per_shard) File "/mnt/disk1/gyj/ColossalAI/colossalai/booster/plugin/gemini_plugin.py", line 191, in save_sharded_optimizer total_size = save_state_dict_shards( File "/mnt/disk1/gyj/ColossalAI/colossalai/checkpoint_io/utils.py", line 234, in save_state_dict_shards for idx, shard_pair in enumerate(sharded_state_dict): File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 799, in state_shard state = self.collect_states(param_id=param_id, only_rank_0=only_rank_0) File "/mnt/disk1/gyj/ColossalAI/colossalai/zero/gemini/gemini_optimizer.py", line 519, in collect_states dist.all_gather_object(gathered_state_shards, [compacted_states, shard_offset, shard_size], group=zero_group) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 2057, in all_gather_object object_list[i] = _tensor_to_object(tensor, tensor_size) File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1955, in _tensor_to_object return _unpickler(io.BytesIO(buf)).load() File "/usr/local/lib/python3.10/dist-packages/torch/storage.py", line 241, in _load_from_bytes return torch.load(io.BytesIO(b)) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 815, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 1043, in _legacy_load result = unpickler.load() File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 980, in persistent_load wrap_storage=restore_location(obj, location), File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 217, in default_restore_location result = fn(storage, location) File "/usr/local/lib/python3.10/dist-packages/torch/serialization.py", line 185, in _cuda_deserialize return torch.UntypedStorage(obj.nbytes(), device=torch.device(location)) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 79.11 GiB total capacity; 1.40 GiB already allocated; 29.69 MiB free; 1.41 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Environment

4 nodes with 8 H100 gpus, per gpu has 80G mem

CUDA:12.1
CUDNN:2.18.1
python:3.10
pytorch:2.0.0

@jiejie1993 jiejie1993 added the bug Something isn't working label Apr 11, 2024
@Issues-translate-bot
Copy link

Bot detected the issue body's language is not English, translate it automatically. πŸ‘―πŸ‘­πŸ»πŸ§‘β€πŸ€β€πŸ§‘πŸ‘«πŸ§‘πŸΏβ€πŸ€β€πŸ§‘πŸ»πŸ‘©πŸΎβ€πŸ€β€πŸ‘¨πŸΏπŸ‘¬πŸΏ


Title: [BUG]: OOM when saving 70B model

@Edenzzzz
Copy link
Contributor

@ver217 any insights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants