where can I download the predictor of Relu-Falcon-40B (float16)? #120

chenglimin · 2024-01-15T09:10:25Z

Is the predictors in link https://huggingface.co/PowerInfer/ReluFalcon-40B-Predictor for Relu-Falcon-40B (float16)? Or Relu-Falcon-40B (int4)? If is int4, where can I download the predictors of Relu-Falcon-40B (float16)?

hodlen · 2024-01-22T19:00:37Z

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

chenglimin · 2024-01-24T03:39:44Z

I convert the model and predictor of Falcon-40B into PowerInfer GGUF as mentioned in your README, and keep the directory as you show in README. However, it come with the following error, failed to offload anything to GPU

ggml_cuda_set_main_device: using device 0 (NVIDIA A100 80GB PCIe) as main device
llm_load_sparse_model_tensors: mem required = 62456.22 MB
llm_load_sparse_model_tensors: VRAM used: 23903.56 MB
...................................................................................................
invoking powerinfer Python module to generate gpu split for 55896.81 MiB of VRAM
solver args: Namespace(activation='./ReluFalcon-40B-PowerInfer-GGUF/activation', neuron=32768, capacity=1788696, layer=60, vram_capacity=58612056064, batch=256, threshold=0, output='./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.powerinfer .gguf.generated.gpuidx')
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/main.py", line 25, in
solved = solve_gpu_split(
^^^^^^^^^^^^^^^^
File "/home/chenglimin/speedup/PowerInfer/powerinfer-py/powerinfer/solver.py", line 23, in solve_gpu_split
freq, _ = torch.sort(freq, descending=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: sort() received an invalid combination of arguments - got (collections.OrderedDict, descending=bool), but expected one of:

(Tensor input, *, bool stable, int dim, bool descending, tuple of Tensors out)
(Tensor input, int dim, bool descending, *, tuple of Tensors out)
(Tensor input, *, bool stable, name dim, bool descending, tuple of Tensors out)
(Tensor input, name dim, bool descending, *, tuple of Tensors out)

llm_load_gpu_split_with_budget: error: failed to generate gpu split
llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used
offload_ffn_split: applying augmentation to model - please wait ...
............................................................ done (3.64 ms)
llm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU

Yes. All predictors we published are in FP16. To use it with a FP16 model, you can convert the model and predictor into PowerInfer GGUF as mentioned in our README.

If you want to run a INT4-quantized model + predictor, you can quantize the generated FP16 model, and the predictor will be quantized at the same time.

hodlen · 2024-01-25T13:28:40Z

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

chenglimin · 2024-01-27T01:15:07Z

Here are content in your "requirements.txt":

"numpy>=1.24.4
sentencepiece>=0.1.98
transformers>=4.33.2
-e ./gguf-py
-e ./powerinfer-py"

Here are my package versions:

"numpy 1.26.2
sentencepiece 0.1.99
transformers 4.36.2
"

Can you confirm that your PyTorch version aligns with our requirements.txt? Seems like an incompatibility in PyTorch API.

hodlen · 2024-01-27T02:45:26Z

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.

# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)

We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

chenglimin · 2024-01-29T02:20:01Z

My PyTorch version is also 2.1.2, as shown in the following picture. And when I run with LLaMa-13B model，this problem never appear.

I tested code around the error shown below, and I believe it's some kind of incompatibility of PyTorch.
# Load and sort activation data for each layer
freq = torch.load(f"{activation_path}/activation_{i}.pt")
freq, _ = torch.sort(freq, descending=True)
We assumed freq is a tensor, and it is, in our environment with PyTorch 2.1.2. But if PyTorch loaded freq as an OrderedDict it can break things. So, can you try with the same PyTorch version and see if the bug still exists?

hodlen · 2024-01-29T12:37:54Z

Hmmmmm. Were the activation files corrupted, or manually renamed before? They should be the same format for all model architectures. I would suggest you to purge & redownload all these files to make sure everything is clean and as expected.

chenglimin added the question Further information is requested label Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

where can I download the predictor of Relu-Falcon-40B (float16)? #120

where can I download the predictor of Relu-Falcon-40B (float16)? #120

chenglimin commented Jan 15, 2024

hodlen commented Jan 22, 2024

chenglimin commented Jan 24, 2024

hodlen commented Jan 25, 2024

chenglimin commented Jan 27, 2024

hodlen commented Jan 27, 2024

chenglimin commented Jan 29, 2024

hodlen commented Jan 29, 2024

where can I download the predictor of Relu-Falcon-40B (float16)? #120

where can I download the predictor of Relu-Falcon-40B (float16)? #120

Comments

chenglimin commented Jan 15, 2024

hodlen commented Jan 22, 2024

chenglimin commented Jan 24, 2024

hodlen commented Jan 25, 2024

chenglimin commented Jan 27, 2024

hodlen commented Jan 27, 2024

chenglimin commented Jan 29, 2024

hodlen commented Jan 29, 2024