Replies: 4 comments 5 replies
-
There is zero prior progress to my knowledge. AutoGPTQ relies on three types of C/C++ kernels in order to do quantisation:
In order to get macOS support - which I know would be appreciated by a lot of people - you would need to write an equivalent kernel, which I assume would make use of PyTorch's MPS module, which provides GPU acceleration on macOS systems. I have a lot of experience with using and quantising GPTQs, and I have both an Intel and Silicon (MBP 2023 M2 Ultra) macOS system, and would be glad to help with testing. But I can't help with the coding I'm afraid. I don't know if you have any experience of using llama.cpp, but they have full GPU accelerated Silicon support. They use a different quantisation method and they don't use PyTorch at all, but it could perhaps be useful for reference, as source of info on GPU accelerated LLM quantisation inference on Silicon from C++. Let me know if you decide to take on the project! |
Beta Was this translation helpful? Give feedback.
-
Hi. Any updates on this? |
Beta Was this translation helpful? Give feedback.
-
Hi @ozak, unfortunately no... Maybe some folks from Apple had a look at mixed precision quantization at https://github.com/ml-explore/mlx ? |
Beta Was this translation helpful? Give feedback.
-
The issue is that many models in huggingface use |
Beta Was this translation helpful? Give feedback.
-
Hi folks, looking to install auto-gptq packages on a conda environment running on a new Apple Macbook Pro M3 Max (128GB). pytorch is working fine, but it looks like we might need dev work on auto-gptq to secure compatibility with Apple Silicon?
Any suggestions on where to begin? I'm willing to help implement this but need details on prior progress, if any before I start to spend countless days diving into the code.
Respectfully,
Beta Was this translation helpful? Give feedback.
All reactions