Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

where can I download the predictor of Relu-Falcon-40B (float16)? #120

Open
chenglimin opened this issue Jan 15, 2024 · 7 comments
Open

where can I download the predictor of Relu-Falcon-40B (float16)? #120

chenglimin opened this issue Jan 15, 2024 · 7 comments
Labels
question Further information is requested

Comments

@chenglimin
Copy link

Is the predictors in link https://huggingface.co/PowerInfer/ReluFalcon-40B-Predictor for Relu-Falcon-40B (float16)? Or Relu-Falcon-40B (int4)? If is int4, where can I download the predictors of Relu-Falcon-40B (float16)?

@chenglimin chenglimin added the question Further information is requested label Jan 15, 2024
@hodlen
Copy link
Collaborator

hodlen commented Jan 22, 2024

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

@chenglimin
Copy link
Author

I convert the model and predictor of Falcon-40B into PowerInfer GGUF as mentioned in your README, and keep the directory as you show in README. However, it come with the following error, failed to offload anything to GPU

ggml_cuda_set_main_device: using device 0 (NVIDIA A100 80GB PCIe) as main device
llm_load_sparse_model_tensors: mem required = 62456.22 MB
llm_load_sparse_model_tensors: VRAM used: 23903.56 MB
...................................................................................................
invoking powerinfer Python module to generate gpu split for 55896.81 MiB of VRAM
solver args: Namespace(activation='./ReluFalcon-40B-PowerInfer-GGUF/activation', neuron=32768, capacity=1788696, layer=60, vram_capacity=58612056064, batch=256, threshold=0, output='./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer .gguf.generated.gpuidx')
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/main.py", line 25, in
solved = solve_gpu_split(
^^^^^^^^^^^^^^^^
File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/solver.py", line 23, in solve_gpu_split
freq, _ = torch.sort(freq, descending=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: sort() received an invalid combination of arguments - got (collections.OrderedDict, descending=bool), but expected one of:

  • (Tensor input, *, bool stable, int dim, bool descending, tuple of Tensors out)
  • (Tensor input, int dim, bool descending, *, tuple of Tensors out)
  • (Tensor input, *, bool stable, name dim, bool descending, tuple of Tensors out)
  • (Tensor input, name dim, bool descending, *, tuple of Tensors out)

llm_load_gpu_split_with_budget: error: failed to generate gpu split
llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used
offload_ffn_split: applying augmentation to model - please wait ...
............................................................ done (3.64 ms)
llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

@hodlen
Copy link
Collaborator

hodlen commented Jan 25, 2024

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

@chenglimin
Copy link
Author

Here are content in your "requirements.txt":

"numpy>=1.24.4
sentencepiece>=0.1.98
transformers>=4.33.2
-e ./gguf-py
-e ./powerinfer-py"

Here are my package versions:

"numpy 1.26.2
sentencepiece 0.1.99
transformers 4.36.2
"

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

@hodlen
Copy link
Collaborator

hodlen commented Jan 27, 2024

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.

# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)

We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

@chenglimin
Copy link
Author

My PyTorch version is also 2.1.2, as shown in the following picture. And when I run with LLaMa-13B model,this problem never appear.
1706494637728

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.

# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)

We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

@hodlen
Copy link
Collaborator

hodlen commented Jan 29, 2024

Hmmmmm. Were the activation files corrupted, or manually renamed before? They should be the same format for all model architectures. I would suggest you to purge & redownload all these files to make sure everything is clean and as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants