Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #37: WIP - M1 NCCL Error - Utilizing Llama2 M1 Bug Fix #44

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JamesHighsmith
Copy link

Description

This is an initial attempt to fix the M1 NCCL Error bug from issue #37 by utilizing the llama2 m1 bug fix. However, the changes introduced in this PR are: not yet fully functional, and further work is required to resolve the issue.

The main changes include:

  • Detecting the device (MPS, CUDA, or CPU) and setting the appropriate device for tensor operations.
  • Updating the build function in llama/generation.py to handle different device types and set the default tensor type accordingly.
  • Modifying the apply_rotary_emb and repeat_kv functions in llama/model.py to move tensors to the appropriate device.
  • Updating the forward method in llama/model.py to handle device-specific operations for the attention mask.
  • Modifying the decode method in llama/tokenizer.py to filter out invalid token IDs (-1) from the decoded output.

Resources:

Please note that this is a work in progress, and the changes introduced in this PR may not completely resolve the issue or introduce new bugs. Further testing and debugging are required.

According to best practices at Meta/Facebook, it is recommended to create a draft pull request first and solicit feedback from the community before finalizing the changes. This approach allows for collaborative problem-solving and ensures that the proposed solution aligns with the project's goals and coding standards.

Attempt to fix: #37

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 18, 2024
@IFFranciscoME
Copy link

IFFranciscoME commented Apr 26, 2024

Hey, thanks for the work so far, @JamesHighsmith. I have downloaded the llama-3 files, cloned the repo and tried to use the models in a very plain and simple way:

Running this in the terminal

system_profiler SPSoftwareDataType SPHardwareDataType

resulted in:

Software:

    System Software Overview:

      System Version: macOS 14.4.1 (23E224)
      Kernel Version: Darwin 23.4.0
      Boot Volume: Macintosh HD
      Boot Mode: Normal
      Computer Name: XXXXX
      User Name: XXXXX
      Secure Virtual Memory: Enabled
      System Integrity Protection: Enabled
      Time since boot: 3 days, 2 hours, 44 minutes

Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: MacBookPro18,2
      Model Number: 5K1A3LL/A
      Chip: Apple M1 Max
      Total Number of Cores: 10 (8 performance and 2 efficiency)
      Memory: 32 GB
      System Firmware Version: 10151.101.3
      OS Loader Version: 10151.101.3
      Serial Number (system): M4Q4035QXF
      Hardware UUID: E81B4AED-6FCB-57C3-AB0A-159F4D5333CA
      Provisioning UDID: 00006001-001248C12E23401E
      Activation Lock Status: Enabled

After cloning your branch, creating a virtualenv and installing all from requirements.txt, ran the following in the terminal:

torchrun --nproc_per_node 1 example_text_completion.py \
    --ckpt_dir Meta-Llama-3-8B/ \
    --tokenizer_path Meta-Llama-3-8B/tokenizer.model \
    --max_seq_len 128 --max_batch_size 4

Note, this setting PYTORCH_ENABLE_MPS_FALLBACK='1' made no difference, both cases it produces the same output.

Resulted in the following:

[2024-04-26 16:19:59,268] torch.distributed.elastic.multiprocessing.redirects: [WARNING] NOTE: Redirects are currently not supported in Windows or MacOs.
> initializing model parallel with size 1
> initializing ddp with size 1
> initializing pipeline with size 1
/opt/homebrew/lib/python3.11/site-packages/torch/__init__.py:696: UserWarning: torch.set_default_tensor_type() is deprecated as of PyTorch 2.1, please use torch.set_default_dtype() and torch.set_default_device() as alternatives. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/torch/csrc/tensor/python_tensor.cpp:453.)
  _C._set_default_tensor_type(t)
Loaded in 153.30 seconds
Traceback (most recent call last):
  File "/Users/franciscome/git/iteralabs/llama_models/llama_3/example_text_completion.py", line 64, in <module>
    fire.Fire(main)
  File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/franciscome/git/iteralabs/llama_models/llama_3/example_text_completion.py", line 51, in main
    results = generator.text_completion(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/franciscome/git/iteralabs/llama_models/llama_3/llama/generation.py", line 274, in text_completion
    generation_tokens, generation_logprobs = self.generate(
                                             ^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/franciscome/git/iteralabs/llama_models/llama_3/llama/generation.py", line 215, in generate
    torch.isin(next_token, stop_tokens)
NotImplementedError: The operator 'aten::isin.Tensor_Tensor_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
[2024-04-26 16:24:19,585] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 80227) of binary: /opt/homebrew/opt/python@3.11/bin/python3.11
Traceback (most recent call last):
  File "/opt/homebrew/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
example_text_completion.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-26_16:24:19
  host      : 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 80227)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

On m1 pro - "Distributed package doesn't have NCCL built in
3 participants