[ROCm] Is AMD ROCm support available in near future? #158

Orion-zhen · 2024-02-29T14:29:43Z

Prerequisites

Before submitting your issue, please ensure the following:

I am running the latest version of PowerInfer. Development is rapid, and as of now, there are no tagged versions.
I have carefully read and followed the instructions in the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).

Feature Description

To compile and run a model on AMD ROCm/HIP backend just as llama.cpp do.

Motivation

Given that a large amount of users own an AMD card, it's quite beneficial to release a ROCm version of PowerInfer.

Meanwhile, llama.cpp can be compiled with hipblas, when PowerInfer, which is inspired by llama.cpp, could achieve this easily.

Also, ZLUDA is working in progress, it's possible that we can steal cuda libraries to run on an AMD card :)

Possible Implementation

hipBLAS: that's how llama.cpp implemented on AMD cards
ZLUDA: can make AMD cards compatible with cuda, etc.
other methods that I don't know so far

hodlen · 2024-04-06T14:26:53Z

Supporting ROCm can help us outreach broader communities with no doubt, but it is currently not our prioritized item. #139 seems to fix the compiling trouble on ROCm by adding hipBLAS headers, though we haven't tested its functionality.

We appreciate it if you can give us feedbacks of compilation/running under ROCm.

Orion-zhen · 2024-04-08T07:43:51Z

Try to deplpy

following the readme file, I got error messages below:

$ cmake -S . -B build -DLLAMA_HIPBLAS=ON
-- The C compiler identification is GNU 13.2.1
-- The CXX compiler identification is GNU 13.2.1
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Warning at CMakeLists.txt:353 (message):
  Only LLVM is supported for HIP, hint: CC=/opt/rocm/llvm/bin/clang


CMake Warning at CMakeLists.txt:356 (message):
  Only LLVM is supported for HIP, hint: CXX=/opt/rocm/llvm/bin/clang++


-- HIP and hipBLAS found
GNU ld (GNU Binutils) 2.42.0
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done (0.4s)
-- Generating done (0.0s)
-- Build files have been written to: /home/orion/ai/powerinfer/build
$ cmake --build build --config Release
[  1%] Building CXX object CMakeFiles/ggml-rocm.dir/ggml-cuda.cu.o
c++: 错误：语言 hip 未能被识别
c++: 错误：语言 hip 未能被识别
make[2]: *** [CMakeFiles/ggml-rocm.dir/build.make:76：CMakeFiles/ggml-rocm.dir/ggml-cuda.cu.o] 错误 1
make[1]: *** [CMakeFiles/Makefile2:626：CMakeFiles/ggml-rocm.dir/all] 错误 2
make: *** [Makefile:146：all] 错误 2

It seems that HIPBLAS can only be configured, but cannot be built.

My system setup:

OS: arch linux
python: 3.11.8
GPU: AMD RX 7900XTX
ROCm: 6.0.2
PyTorch: 2.2.2+rocm5.7

freelulul · 2024-05-11T09:07:21Z

Now PowerInfer can run on the AMD device with ROCm, the following is the debug log on AMD RX 7900XTX for your reference. The correct compilation instructions have been updated in README.

Environment:

~$ python3 -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.3.0+rocm5.7
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 5.7.31921-d1770ee1b

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 17.0.0 (https://github.com/RadeonOpenCompute/llvm-project roc-6.0.2 24012 af27734ed982b52a9f1be0f035ac91726fc697e4)
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.2.0-26-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 5.7.31921
MIOpen runtime version: 2.20.0
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Address sizes:                   48 bits physical, 48 bits virtual
Byte Order:                      Little Endian
CPU(s):                          16
On-line CPU(s) list:             0-15
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen 7 7800X3D 8-Core Processor
CPU family:                      25
Model:                           97
Thread(s) per core:              2
Core(s) per socket:              8
Socket(s):                       1
Stepping:                        2
Frequency boost:                 enabled
CPU max MHz:                     5049.0229
CPU min MHz:                     3000.0000
BogoMIPS:                        8399.69
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                  AMD-V
L1d cache:                       256 KiB (8 instances)
L1i cache:                       256 KiB (8 instances)
L2 cache:                        8 MiB (8 instances)
L3 cache:                        96 MiB (1 instance)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Not affected
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-triton-rocm==2.3.0
[pip3] torch==2.3.0+rocm5.7
[pip3] torchaudio==2.3.0+rocm5.7
[pip3] torchvision==0.18.0+rocm5.7

Debug Log

1. Fail to generate GPU split

.............................................
invoking powerinfer Python module to generate gpu split for 59036.48 MiB of VRAM
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy-py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/xxx/.local/lib/python3.8/site-packages/powerinfer/__main__.py", line 5, in <module>
    from .export_split import export_split
  File "/home/xxx/.local/lib/python3.8/site-packages/powerinfer/export_split.py", line 50, in <module>
    def export_split(activations_path: str, output_path: str, solved_list: list[int], vram_capacity: int):
TypeError: 'type' object is not subscriptable
l1m_load_gpu_split_with_budget: error: failed to generate gpu split
llm_load_gpu_split: error: failed to generate gpu split, an empty one will be used
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (6.43 ms)
1lm_load_gpu_split: offloaded 0.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama new context with model: freq scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama kv cache init: offloading k cache to GPU
llama kv cache init: VRAM kv self = 256.00 MB
11ama_new_context_with_model: kv self size  = 256.00 MB
llama_build_graph: non-view tensors processed: 548/836

Initial Python version is 3.8.xx, which not satisfied with Pre-requisites.

Solution: Upgrade Python Version to 3.8+ or pip install -r requirements.txt

2. Segmentation fault (core dumped)

llama_model_loader: - tensor   58:                   blk.29.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   59:                blk.29.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [ 11008,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 18367365120
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (0.75 ms)
offload_ffn_split: applying augmentation to model - please wait ...
Segmentation fault (core dumped)

Now this bug has been fixed. Please refer to #139

Attempted solutions: Change cudaMemcpyToSymbol(dev_sparse_threshold, &sparse_pred_threshold, sizeof(float)) to cudaMemcpyToSymbol(&dev_sparse_threshold, &sparse_pred_threshold, sizeof(float)) But Bug 3 occur

3. CUDA error 13: invalid device symbol

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.91 ms)
offload_ffn_split: applying augmentation to model - please wait ...

CUDA error 13 at /var/lib/jenkins/PowerInfer/ggml-cuda.cu:9440: invalid device symbol
current device: 0

Now this bug has been fixed. Please refer to #139

4. CUDA error 303: shared object initialization failed

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.81 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (1764.28 ms)
llm_load_gpu_split: offloaded 1980.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)

CUDA error 303 at /var/lib/jenkins/PowerInfer/ggml-cuda.cu:7877: shared object initialization failed
current device: 0

All kernel function can't be launched correctly and all trap into CUDA error 303

Solution: Add additional compilation options: -DAMDGPU_TARGETS=gfx1100 (Replace 1100 to your card architecture, you can get it by rocminfo)

5. Segmentation fault (core dumped)

llama_model_loader: - tensor   60:                   blk.30.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   61:                blk.30.gpu_bucket i32      [  1792,     1,     1,     1 ]
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  2048,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 2093465600
load_gpu_idx_for_model: applying gpu_idx adapter from './ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (1.76 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (1744.70 ms)
llm_load_gpu_split: offloaded 1980.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 256.00 MB
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)
Segmentation fault (core dumped)

I add some label in the code and claim that program can finish function llama_new_context_with_model and haven't get into function const char * llama_print_system_info(void). Except that, all CUDA function can be executed correctly.

llama_new_context_with_model: compute buffer total size = 36.25 MB
llama_new_context_with_model: VRAM scratch buffer: 34.69 MB
llama_new_context_with_model: total VRAM used: 8210.20 MB (model: 5939.52 MB, context: 290.69 MB)
111
222
Operation: ggml_cuda_op_rms_norm
Operation: ggml_cuda_op_mul
Operation: ggml_cuda_op_rope
Operation: ggml_cuda_op_rope
Operation: ggml_cuda_op_scale
Operation: ggml_cuda_op_add
add_finish
Operation: ggml_cuda_op_soft_max
Operation: ggml_cuda_op_add
add_finish
Operation: ggml_cuda_op_rms_norm
Operation: ggml_cuda_op_mul
Operation: ggml_cuda_op_relu
Operation: ggml_cuda_op_add
add_finish
Segmentation fault (core dumped)

Solution: Add an additional running command parameter: --reset-gpu-index (To avoid any stale cache.)

6. Finish

#How to run it correctly
rm -rf build
CC=/opt/rocm/llvm/bin/clang CXX=/opt/rocm/llvm/bin/clang++ cmake -S . -B build -DLLAMA_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100
cmake --build build --config Release -j 24
./build/bin/main -m ./ReluLLaMA-7B/llama-7b-relu.powerinfer.gguf -n 128 -p "Once upon a time" --ignore-eos --seed 0 --top-k 1 --reset-gpu-index

Orion-zhen added the enhancement New feature or request label Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Is AMD ROCm support available in near future? #158

[ROCm] Is AMD ROCm support available in near future? #158

Orion-zhen commented Feb 29, 2024 •

edited

hodlen commented Apr 6, 2024

Orion-zhen commented Apr 8, 2024 •

edited

freelulul commented May 11, 2024 •

edited

[ROCm] Is AMD ROCm support available in near future? #158

[ROCm] Is AMD ROCm support available in near future? #158

Comments

Orion-zhen commented Feb 29, 2024 • edited

Prerequisites

Feature Description

Motivation

Possible Implementation

hodlen commented Apr 6, 2024

Orion-zhen commented Apr 8, 2024 • edited

Try to deplpy

freelulul commented May 11, 2024 • edited

Environment:

Debug Log

1. Fail to generate GPU split

2. Segmentation fault (core dumped)

3. CUDA error 13: invalid device symbol

4. CUDA error 303: shared object initialization failed

5. Segmentation fault (core dumped)

6. Finish

Orion-zhen commented Feb 29, 2024 •

edited

Orion-zhen commented Apr 8, 2024 •

edited

freelulul commented May 11, 2024 •

edited