在A100-80G上无法找到cuda的情况 #182

bulaikexiansheng · 2024-04-24T09:31:02Z

你好，我在A100-80G机器上复现powerinfer，但是遇到了以下的错误，看起来貌似是没有检测出机器上的i显卡？

机器的cuda版本：12.4

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos Log start main: build = 1578 (906830b) main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu main: seed = 1713950721 llama_model_loader: loaded meta data with 23 key-value pairs and 883 tensors from /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf (version GGUF V3 (latest)) llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ] llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ] llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ] ... llama_model_loader: - kv 0: general.architecture str llama_model_loader: - kv 1: general.name str ... llama_model_loader: - type f32: 161 tensors llama_model_loader: - type q4_0: 722 tensors llama_model_load: PowerInfer model loaded. Sparse inference will be used. llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 2048 llm_load_print_meta: n_embd = 8192 llm_load_print_meta: n_head = 64 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 80 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_gqa = 8 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 28672 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 2048 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: model type = 70B llm_load_print_meta: model ftype = mostly Q4_0 llm_load_print_meta: model params = 74.98 B llm_load_print_meta: model size = 39.28 GiB (4.50 BPW) llm_load_print_meta: general.name = nvme llm_load_print_meta: BOS token = 1 '<s>' llm_load_print_meta: EOS token = 2 '</s>' llm_load_print_meta: UNK token = 0 '<unk>' llm_load_print_meta: PAD token = 0 '<unk>' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_print_meta: sparse_pred_threshold = 0.00 error loading model: CUDA is not loaded llama_load_model_from_file_with_context: failed to load model llama_init_from_gpt_params: error: failed to load model '/home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf' main: error: unable to load model

我在编译阶段的输出是：
(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ cmake -S . -B build -DLLAMA_CUBLAS=ON -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.99") -- cuBLAS found -- The CUDA compiler identification is NVIDIA 11.5.119 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /usr/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Using CUDA architectures: 52;61;70 GNU ld (GNU Binutils for Ubuntu) 2.38 -- CMAKE_SYSTEM_PROCESSOR: x86_64 -- x86 detected -- Configuring done -- Generating done -- Build files have been written to: /home/turbo/projects/PowerInfer/build

和

``

The text was updated successfully, but these errors were encountered:

bulaikexiansheng · 2024-04-24T09:36:35Z

抱歉，我提供的日志看起来很凌乱，下面可能会清楚一些：

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ ./build/bin/main -m /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --ignore-eos

Log start
main: build = 1578 (906830b)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1713951199
llama_model_loader: loaded meta data with 23 key-value pairs and 883 tensors from /home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight q4_0 [ 8192, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 2: blk.0.attn_k.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 3: blk.0.attn_v.weight q4_0 [ 8192, 1024, 1, 1 ]
llama_model_loader: - tensor 4: blk.0.attn_output.weight q4_0 [ 8192, 8192, 1, 1 ]
llama_model_loader: - tensor 5: blk.0.ffn_gate.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 6: blk.0.ffn_up.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 7: blk.0.ffn_down_t.weight q4_0 [ 8192, 28672, 1, 1 ]
llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 8192, 1, 1, 1 ]
llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 8192, 1, 1, 1 ]
......
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool
llama_model_loader: - kv 22: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_0: 722 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 74.98 B
llm_load_print_meta: model size = 39.28 GiB (4.50 BPW)
llm_load_print_meta: general.name = nvme
llm_load_print_meta: BOS token = 1 ''
llm_load_print_meta: EOS token = 2 ''
llm_load_print_meta: UNK token = 0 ''
llm_load_print_meta: PAD token = 0 ''
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
error loading model: CUDA is not loaded
llama_load_model_from_file_with_context: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/turbo/models/ReluLLaMA-70B/llama-70b-relu.q4.powerinfer.gguf'
main: error: unable to load model

(base) turbo@sma100-02:/home/turbo/projects/PowerInfer$ cmake -S . -B build -DLLAMA_CUBLAS=ON

-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Looking for pthread.h
-- Looking for pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
-- Found CUDAToolkit: /usr/local/cuda/include (found version "12.4.99")
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.5.119
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Using CUDA architectures: 52;61;70
GNU ld (GNU Binutils for Ubuntu) 2.38
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Configuring done
-- Generating done
-- Build files have been written to: /home/turbo/projects/PowerInfer/build

hodlen · 2024-04-25T06:41:03Z

CMake的输出代表你的环境中有CUDA编译工具链，但运行时报错“CUDA is not loaded”，这种情况可能是驱动没有正常加载。可以尝试一下在同一个环境下运行 nvidia-smi 查看是否能够检测到GPU，如果同样会报错就可以确定是驱动的问题。

如果你的显卡硬件正常，驱动安装正确，这种临时问题通常可以通过重启机器或容器解决。

bulaikexiansheng added the bug-unconfirmed Unconfirmed bugs label Apr 24, 2024

hodlen added question Further information is requested and removed bug-unconfirmed Unconfirmed bugs labels Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在A100-80G上无法找到cuda的情况 #182

在A100-80G上无法找到cuda的情况 #182

bulaikexiansheng commented Apr 24, 2024

bulaikexiansheng commented Apr 24, 2024

hodlen commented Apr 25, 2024

在A100-80G上无法找到cuda的情况 #182

在A100-80G上无法找到cuda的情况 #182

Comments

bulaikexiansheng commented Apr 24, 2024

bulaikexiansheng commented Apr 24, 2024

hodlen commented Apr 25, 2024