[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs #5502

hijkzzz · 2024-05-06T07:04:43Z

Model Link: https://huggingface.co/ai21labs/Jamba-v0.1

Reproduce script in OpenRLHF: https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_jamba_lora.sh

Please pip install mamba-ssm causal-conv1d>=1.2.0
and set --micro_train_batch_size 1 to reproduce this bug ( IMPORTANT! )
--micro_train_batch_size 4 works well

Mixtral + ZeRO3 + LoRA works well with the same hyperparameters (`--micro_train_batch_size 1):
see https://github.com/OpenLLMAI/OpenRLHF/blob/main/examples/scripts/train_sft_mixtral_lora.sh

The text was updated successfully, but these errors were encountered:

hijkzzz added bug Something isn't working training labels May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs #5502

[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs #5502

hijkzzz commented May 6, 2024 •

edited

[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs #5502

[BUG] Jamba (Mamba+MoE) + ZeRO3 + LoRA training hangs #5502

Comments

hijkzzz commented May 6, 2024 • edited

hijkzzz commented May 6, 2024 •

edited