Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Hotfix/randomizer]Fix Randomizer error on CPU #5265

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Orion-Zheng
Copy link
Contributor

馃搶 Checklist before creating the PR

  • I have created an issue for this PR for traceability
  • The title follows the standard format: [doc/gemini/tensor/...]: A concise description
  • I have added relevant tags if possible for us to better distinguish different PRs

馃毃 Issue number

Link this PR to your issue with words like fixed to automatically close the linked issue upon merge
fixed #5211

馃摑 What does this PR do?

Summarize your work here.

Previously, Randomizer doesn't support cpu due to the _dispatch_device_func, which causes OpenMoE unable to initialize parameters on CPU. When converting a 34B OpenMoE Checkpoint, I found this inconvenient because I can only convert a 166GB checkpoint in CPU RAM (I don't have such a huge GPU memory for that).

# colossalai/utils/device.py
# Old Verison
def _dispatch_device_func(fn_name: str, *args, **kwargs):
    if torch.cuda.is_available():
        return getattr(torch.cuda, fn_name)(*args, **kwargs)
    elif IS_NPU_AVAILABLE:
        return getattr(torch.npu, fn_name)(*args, **kwargs)
    else:  # Running on CPU will cause an error
        raise RuntimeError("No device available")

The bug is raise from: colossalai/moe/experts.py#L88, where set_rng_state and get_rng_state need to be executed during parameters initialization. I found on cpu, you can use torch. set/get_rng_state to manipulate the rng_state. So I modify the _dispatch_device_func, allow to dispatch supported functions like get/set_rng_state to cpu, and still raise error if some functions are not supported on cpu device, such as get/set_rng_state_all.
In addition to fixing bugs, I also added some conditional statements to make the code more robust. The modified code is compatible with original behaviors. After modification, you can run code below to verify the effectiveness

from colossalai.shardformer.layer.utils import Randomizer
Randomizer(42).fork_rng(enable_cpu=True)  # won't raise error if run on CPU

馃挜 Checklist before requesting a review

  • I have linked my PR to an issue (instruction)
  • My issue clearly describes the problem/feature/proposal, with diagrams/charts/table/code if possible
  • I have performed a self-review of my code
  • I have added thorough tests.
  • I have added docstrings for all the functions/methods I implemented

猸愶笍 Do you enjoy contributing to Colossal-AI?

  • 馃対 Yes, I do.
  • 馃寶 No, I don't.

Tell us more if you don't enjoy contributing to Colossal-AI.

@Orion-Zheng Orion-Zheng requested a review from a team as a code owner January 13, 2024 04:48
@Orion-Zheng Orion-Zheng changed the title Fix Randomizer error on CPU [Hotfix/randomizer]Fix Randomizer error on CPU Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG]: Randomizer Raise Error When running on the CPU
1 participant