Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is it possible to use AirLLM with a quantized input model? #117

Open
Verdagon opened this issue Mar 10, 2024 · 3 comments
Open

Is it possible to use AirLLM with a quantized input model? #117

Verdagon opened this issue Mar 10, 2024 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@Verdagon
Copy link

Hi there! Thanks for this amazing library. I was able to run a 70B model on my M2 Macbook Pro!

I was able to get about one token every 100 seconds, which is almost good enough for my overnight tasks; I'm hoping i can get it down to 20 seconds per token though.

Is it possible to quantize the input model to make it faster?

I've tried quantizing with llama.cpp, but I think the output format is wrong for that. I see that pytorch has a way to quantize, but I can't figure out how to do it with AutoModel.

Any pointers in the right direction would help. Thanks!

@Verdagon
Copy link
Author

Verdagon commented Mar 10, 2024

I just re-read the README again and learned about the compression option!

However, it doesn't quite work, I get this error:

Traceback (most recent call last):
  File "/Users/verdagon/AirLLM/air_llm/main.py", line 12, in <module>
    model = AutoModel.from_pretrained(
  File "/Users/verdagon/AirLLM/air_llm/airllm/auto_model.py", line 49, in from_pretrained
    return AirLLMLlamaMlx(pretrained_model_name_or_path, *inputs, ** kwargs)
  File "/Users/verdagon/AirLLM/air_llm/airllm/airllm_llama_mlx.py", line 224, in __init__
    self.model_local_path, self.checkpoint_path = find_or_create_local_splitted_path(model_local_path_or_repo_id,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 351, in find_or_create_local_splitted_path
    return Path(model_local_path_or_repo_id), split_and_save_layers(model_local_path_or_repo_id, layer_shards_saving_path,
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 303, in split_and_save_layers
    layer_state_dict = compress_layer_state_dict(layer_state_dict, compression)
  File "/Users/verdagon/AirLLM/air_llm/airllm/utils.py", line 169, in compress_layer_state_dict
    v_quant, quant_state = bnb.functional.quantize_blockwise(v.cuda(), blocksize=2048)
  File "/Users/verdagon/Library/Python/3.9/lib/python/site-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

I tried changing that v.cuda() to v.cpu() but it didn't help, instead I get an error down in bitsandbytes.

And reading the bitsandbytes docs, it says that bitsandbytes is a CUDA library, so I'm guessing this compression feature is only meant for CUDA computers. They are working on supporting Mac but not done yet. Unfortunate!

Hopefully there's a way to quantize the input instead.

@Verdagon
Copy link
Author

Looking at the code more, it looks like AirLLM only supports pytorch and safetensors file formats. This might work if I can get something quantized into one of those.

@lyogavin lyogavin self-assigned this Apr 21, 2024
@lyogavin lyogavin added the enhancement New feature or request label Apr 21, 2024
@lyogavin
Copy link
Owner

will add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants