Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significantly different results (and WRONG) inference when GPU is enabled. #7048

Closed
phishmaster opened this issue May 2, 2024 · 40 comments
Closed
Labels
bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs

Comments

@phishmaster
Copy link

I am running llama_cpp version 0.2.68 on Ubuntu 22.04LTS under conda environment. Attached are two Jupyter notebooks with ONLY one line changed (use CPU vs GPU). As you can see for exact same environmental conditions switching between CPU/GPU gives vastly different answers where the GPU is completely wrong. Some pointers on how to debug this I would appreciate it.

The only significant difference between the two files is this one liner
#n_gpu_layers=-1, # Uncomment to use GPU acceleration

The model used was openhermes-2.5-mistral-7b.Q5_K_M.gguf

mistral_llama_large-gpu.pdf
mistral_llama_large-cpu.pdf

@JohannesGaessler
Copy link
Collaborator

Are you getting correct results when you use the llama.cpp binaries directly without any Python bindings? If not, are you getting correct results when you compile with LLAMA_CUDA_FORCE_MMQ?

@phishmaster
Copy link
Author

phishmaster commented May 6, 2024 via email

@JohannesGaessler
Copy link
Collaborator

Sorry, I wanted to check this issue but forgot. Did you download a ready-made GGUF file from Huggingface or did you convert it yourself? If it's the former, can you provide a link to the exact file you downloaded?

@phishmaster
Copy link
Author

phishmaster commented May 8, 2024 via email

@JohannesGaessler
Copy link
Collaborator

I cannot reproduce the issue on master. Can you re-download the model and check that this issue isn't due to a corrupted file?

@phishmaster
Copy link
Author

phishmaster commented May 8, 2024

Here is my git master

* 83330d8c - (HEAD -> master, origin/master, origin/HEAD) main : add --conversation / -cnv flag (#7108) (2 hours ago) [Dawid Potocki]
* 465263d0 - sgemm : AVX Q4_0 and Q8_0 (#6891) (2 hours ago) [Eve]
* 911b3900 - server : add_special option for tokenize endpoint (#7059) (4 hours ago) [Johan]

clean and rebuild with


LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make clean
LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 make

re-downloaded the model (also matches with my previously downloaded file)


(base) hvu@Kaui:/data/DemoV1/Model4Demo$ md5sum openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf

f7faa7e315e81c3778aae55fcb5fc02c openhermes-2.5-mistral-7b.Q5_K_M_v1.gguf

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf \                                                                                                
    -c 16192 -b 1024 -n 256 --keep 48 \                                                                           
    --repeat_penalty 5.0 --color -i \                                                                             
    -r "User:" -f prompts/chat-with-bob.txt  
...
llm_load_print_meta: EOT token        = 32000 '<|im_end|>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.15 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4893.00 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1147.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 356

system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI 
= 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 
| VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
Bob: The Capital City Of france Is Paris.<|im_end|>

===========================================
I tried various values of -ngl and all seems to return garbage

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistra
l-7b.Q5_K_M.gguf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -f promp
ts/chat-with-bob.txt -ngl 40
Log start
main: build = 2817 (83330d8c)
...
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.44 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:        CPU buffer size =    85.94 MiB
llm_load_tensors:      CUDA0 buffer size =  2352.25 MiB
llm_load_tensors:      CUDA1 buffer size =  2454.81 MiB
...
ob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User:what is the capital city of France?
##################################################################################################################
##################################################################################################################
###################
------------------------------------------------------

Other values of -ngl 16

...
llm_load_tensors: ggml ctx size =    0.44 MiB                                                                     
llm_load_tensors: offloading 16 repeating layers to GPU                                                           
llm_load_tensors: offloaded 16/33 layers to GPU                                                                   
llm_load_tensors:        CPU buffer size =  4893.00 MiB                                                           
llm_load_tensors:      CUDA0 buffer size =  1160.19 MiB                                                           
llm_load_tensors:      CUDA1 buffer size =  1192.06 MiB                                                           
.....
User:what is the capital city of France?
#<s>▅

$<s>#"
      "!<s> 
</s>

        $
!<s>!!"

       "
$#
                 
""<s>

Pretty much the same for -ngl 8

@JohannesGaessler
Copy link
Collaborator

If I remember correctly the output

#<s>▅

$<s>#"
      "!<s> 
</s>

        $
!<s>!!"

       "
$#
                 
""<s>

is effectively what you get when a NO_DEVICE_CODE isn't being correctly triggered. My intuition is that this issue is specific to a V100 GPU (and maybe also the CUDA version). If possible, please check the following:

  • Results after running export CUDA_VISIBLE_DEVICES=0 (makes it so that only the first GPU is used).
  • Results on a non-V100 GPU.
  • Results using another model, ideally with n_vocab of 32000 (check the console log, Mistral base model should have that vocab size).
  • Results using CUDA 12.

@JohannesGaessler
Copy link
Collaborator

Also: with SXM your V100s are effectively NVLinked, right? Can you check results when compiling with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999?

@phishmaster
Copy link
Author

Remake with suggested flags

LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999 make

I think I already have CUDA 12

(llama_cpp_py39) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0

Now for the run

(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ export CUDA_VISIBLE_DEVICES=0                              
(pytorch_py39_cu11.8) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ ./main_gpu -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gguf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -f prompts/chat-with-bob.txt -ngl 8      
...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 8 repeating layers to GPU
llm_load_tensors: offloaded 8/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4893.00 MiB
llm_load_tensors:      CUDA0 buffer size =  1192.06 MiB
...................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 1024
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =  1536.00 MiB
llama_kv_cache_init:      CUDA0 KV buffer size =   512.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.12 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1147.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    40.01 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 268
...
User:what is the capital city of France?
 ▅</s>"!"!

          " 

</s></s▅"$#<s># "<s>


Same results

@JohannesGaessler
Copy link
Collaborator

When checking the NVCC version your shell prefix is (llama_cpp_py39). When you actually run the model the prefix is (pytorch_py39_cu11.8). Are you sure that in both cases CUDA 12 is being used?

@JohannesGaessler
Copy link
Collaborator

Also, I didn't mean to compile with both LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 and LLAMA_CUDA_PEER_MAX_BATCH_SIZE=999999 at the same time. I meant to test either option individually. But if you still get incorrect results with CUDA_VISIBLE_DEVICES=0 that's not going to be the problem anyways.

@phishmaster
Copy link
Author

Remake in base conda environment (default nvcc)

(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ LLAMA_CUDA=1 LLAMA_CUDA_FORCE_MMQ=1 CUDA_VISIBLE_DEVICES=0 LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 make
....
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes                                                                        
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no                                                                         
ggml_cuda_init: found 2 CUDA devices:                                                                             
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes                                                
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes                                                
llm_load_tensors: ggml ctx size =    0.44 MiB                                                                     
llm_load_tensors: offloading 8 repeating layers to GPU                                                            
llm_load_tensors: offloaded 8/33 layers to GPU                                                                    
llm_load_tensors:        CPU buffer size =  4893.00 MiB                                                           
llm_load_tensors:      CUDA0 buffer size =   588.06 MiB                                                           
llm_load_tensors:      CUDA1 buffer size =   604.00 MiB            
...

Same results

</s>What is the capital city of France?

 #"<s> </s>
<s>$
 #▅"#
!<s><s>/s><s$<s>" <s>"

llama_print_timings:        load time =    2860.74 ms
llama_print_timings:      sample time =      70.35 ms /   161 runs   (    0.44 ms per token,  2288.46 tokens per second)
llama_print_timings: prompt eval time =     360.26 ms /     9 tokens (   40.03 ms per token,    24.98 tokens per second)
llama_print_timings:        eval time =   12501.69 ms /   160 runs   (   78.14 ms per token,    12.80 tokens per second)
llama_print_timings:       total time =   13594.98 ms /   169 tokens
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ 

@slaren
Copy link
Collaborator

slaren commented May 8, 2024

Do you get any errors with compute-sanitizer? Run compute-sanitizer ./main -m ..

@phishmaster
Copy link
Author

No errors, and the "compute-sanitizer" didn't seem to help; however, it seems to work better if I use -ngl -1 instead of any specific values. Does that help?

@slaren
Copy link
Collaborator

slaren commented May 8, 2024

-ngl -1 is effectively the same as -ngl 0.

@slaren
Copy link
Collaborator

slaren commented May 8, 2024

What driver version are you using? Run nvidia-smi.

@phishmaster
Copy link
Author

(base) hvu@Kaui:~/Demo_v1/llama.cpp_v3$ nvidia-smi 
Wed May  8 13:37:21 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:1D:00.0 Off |                  Off |
| N/A   40C    P0             54W /  300W |    1673MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off |   00000000:1E:00.0 Off |                  Off |
| N/A   46C    P0             54W /  300W |     623MiB /  16384MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1557      G   /usr/lib/xorg/Xorg                              4MiB |
|    0   N/A  N/A   2365880      C   ...envs/pytorch_py39_cu11.8/bin/python        584MiB |
|    0   N/A  N/A   2665195      C   ...envs/pytorch_py39_cu11.8/bin/python        498MiB |
|    0   N/A  N/A   2683103      C   ...envs/pytorch_py39_cu11.8/bin/python        584MiB |
|    1   N/A  N/A      1557      G   /usr/lib/xorg/Xorg                              4MiB |
|    1   N/A  N/A   2365880      C   ...envs/pytorch_py39_cu11.8/bin/python        308MiB |
|    1   N/A  N/A   2683103      C   ...envs/pytorch_py39_cu11.8/bin/python        308MiB |
+-----------------------------------------------------------------------------------------+

@JohannesGaessler
Copy link
Collaborator

According to the HuggingFace repository, the model was made with llama.cpp revision 629f917. Do you get correct results with that revision?

@phishmaster
Copy link
Author

Here is my current repo.

(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b.Q5_K_M.gg
uf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -p "What is the capita
l city of France?" -ngl 40                                                                                        
warning: not compiled with GPU offload support, --n-gpu-layers option will be ignored                             
warning: see main README.md for information on enabling GPU BLAS support                                          
Log start                                                                                                         
main: build = 1477 (629f917c)                                                                                     
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu                                    
main: seed  = 1715190618                                                                                          
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))
...

llama_new_context_with_model: n_ctx      = 16192
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  = 2024.00 MB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 2141.88 MB

...

What is the capital city of France?
Paris. The French government has its seat in Paris, which also serves as an important center for culture and business within Europe due to being one if not THE most famous cities globally known across different industries such arts or fashion among others that contribute significantly towards global economy with their unique products like h
aute couture designs from Christian Dior house based right here!
Paris is located in the northern part of France, near Normandy and Brittany. It has a population over 2 million people making it one if not THE largest cities globally known across different industries such arts or fashion among
 others that contribute significantly towards global economy with their unique products like haute couture designs
 from Christian Dior house based right here!
Paris is often referred to as the “City of Love”, and for good reason. The city boasts some amazing architecture, including Notre-Dame Cathedral which has been featured in countless films; it’s also home base during fashion weeks where designers showcase their latest collections on runways around town!
Paris was founded by Celtic tribes known as Parisii back before Christ when they settled along the Seine River. Later conquered and ruled successively through Roman, Frankish (Merovingian), Carolingians

@phishmaster
Copy link
Author

Recompile with the right option to enable CUDA and same problem

(base) hvu@Kaui:~/Demo_v1/llama.cpp_629f917$ ./main -m /data/DemoV1/Model4Demo/openhermes-2.5-mistral-7b[377/1833]
uf     -c 16192 -b 1024 -n 256 --keep 48     --repeat_penalty 5.0 --color -i     -r "User:" -p "What is the capita
l city of France?" -ngl 40                                                                                        
Log start                                                                                                         
main: build = 1477 (629f917c)                                                                                     
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu                                    
main: seed  = 1715191104                                                                                          
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from /data/DemoV1/Model4Demo/openherm
es-2.5-mistral-7b.Q5_K_M.gguf (version GGUF V3 (latest))

...
llm_load_tensors: ggml ctx size =    0.11 MB                                                             [17/1833]
llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =   86.05 MB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 4807.06 MB
...
What is the capital city of France?###############################################################################
##################################################################################################################
###############################################################

@JohannesGaessler JohannesGaessler added bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs and removed bug-unconfirmed labels May 8, 2024
@phishmaster
Copy link
Author

Experimenting with various -ngl values, it seems like keeping it below 20 seems to help for many models. At a certain point it just flipped from working to garbage. In the experiment the model llama-2-13b/ggml-model-q5_K_M.bin works with -ngl at 22 or below.

Here is an example of it working with python binding

...
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 16 repeating layers to GPU
llm_load_tensors: offloaded 16[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors:        CPU buffer size =  5362.94 MiB
llm_load_tensors:      CUDA0 buffer size =  1700.92 MiB
llm_load_tensors:      CUDA1 buffer size =  1737.77 MiB
...
output = llm(
      "Q: Name all the planets in the solar system? A:", # Prompt
      max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-6ddf6146-082d-40ed-9188-7acea7ee3f6d', 'object': 'text_completion', 'created': 1715260091, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: Mercury, Venus, Earth, Mars, Jupiter, Saturn, Uranus, Neptune', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 24, 'total_tokens': 39}}

With 16 layers, only 2.2x2GB of VRAM was used.
22 layers, 3x2GB VRAM --> Working

At 23 layers, the answer came back garbage and 3x2GB VRAM usage which is way below the 16x2 GB VRAM available.

llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   yes
ggml_cuda_init: CUDA_USE_TENSOR_CORES: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
  Device 1: Tesla V100-SXM2-16GB, compute capability 7.0, VMM: yes
llm_load_tensors: ggml ctx size =    0.55 MiB
llm_load_tensors: offloading 23 repeating layers to GPU
llm_load_tensors: offloaded 23[/41](http://kaui:8888/41) layers to GPU
llm_load_tensors:        CPU buffer size =  3882.31 MiB
llm_load_tensors:      CUDA0 buffer size =  2545.23 MiB
llm_load_tensors:      CUDA1 buffer size =  2374.08 MiB
...
output = llm(
      "Q: Name all the planets in the solar system? A:", # Prompt
      max_tokens=256, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)
...
{'id': 'cmpl-9fc9e4b3-b052-401a-8984-67b72fbe5b31', 'object': 'text_completion', 'created': 1715260540, 'model': '/data/llama2/llama-2-13b/ggml-model-q5_K_M.bin', 'choices': [{'text': 'Q: Name all the planets in the solar system? A: 23,495', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 15, 'completion_tokens': 8, 'total_tokens': 23}}

If the user goes beyond the supported value, there should be a warning or error? Also is there a deterministic way of knowing what value of -ngl will work vs when it will return garbage?

@slaren
Copy link
Collaborator

slaren commented May 9, 2024

It should work with any value. You could try running the eval-callback example with CPU and with full offload and try to see what's the first operation that produces significantly different values.

@phishmaster
Copy link
Author

phishmaster commented May 9, 2024

eval-callback_gpu_f16.log
eval-callback_cpu.log
eval-callback.log
The eval-callback logs are attached.
gpu_f16 - Uses the f16 model instead of the quantized
eval_callback.log - Use the Q5 model
cpu - (-ngl 0)

@slaren
Copy link
Collaborator

slaren commented May 9, 2024

Sorry, eval-callback was broken and the numbers are useless. Please try again with #7184 or after it is merged.

@phishmaster
Copy link
Author

phishmaster commented May 14, 2024

Pulled from master

and reran the eval-callback.
Logs are attached
eval-callback_q5km_gpu.log
eval-callback_q5km_cpu.log
eval-callback_f16_gpu.log

eval-callback_f16_cpu.log

@phishmaster
Copy link
Author

Additionally for version 4f02636
-ngl 20 and below seems to work fine, anything above this the results are garbage
..

<s> What is the capital city of France?
▅! "# 

<s</s>
##$</s>▅        ▅

</s>
        "       "</s> ▅ 

@slaren
Copy link
Collaborator

slaren commented May 15, 2024

Can you share the full command line that you used to generate the eval-callback logs? What f16 model did you use?
It seems to break down at the end of layer 11. Did you try enabling ECC? (nvidia-smi -e 1)
Also try using the environment variable CUDA_LAUNCH_BLOCKING=1.

ggml_debug:          ffn_gate_par-11 = (f32)        MUL(ffn_silu-11{13824, 2, 1, 1}, ffn_up-11{13824, 2, 1, 1}}) = {13824, 2, 1, 1}
                                     [
                                      [
                                       [     -0.0039,       0.0012,      -0.0001, ...,       0.0058,      -0.0061,       0.0020],
                                       [     -0.0036,      -0.0129,      -0.0303, ...,       0.0178,      -0.0390,       0.0095],
                                      ],
                                     ]
                                     sum = -0.059607
ggml_debug:               ffn_out-11 = (f32)    MUL_MAT(blk.11.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-11{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan
ggml_debug:                 l_out-11 = (f32)        ADD(ffn_out-11{5120, 2, 1, 1}, ffn_inp-11{5120, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan
ggml_debug:                  norm-12 = (f32)   RMS_NORM(l_out-11{5120, 2, 1, 1}, }) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                       [         nan,          nan,          nan, ...,          nan,          nan,          nan],
                                      ],
                                     ]
                                     sum = nan

@slaren
Copy link
Collaborator

slaren commented May 15, 2024

My guess is that this is a hardware failure of some sort. Are you using a custom build these V100 that might not provide enough power or cooling?

@phishmaster
Copy link
Author

I highly doubt that it's enough power or cooling as the source. Mainly it would imply a lot more randomness vs being very deterministic in term of failures at ~20 layers offload.

As for cooling, the server is housed in a rack and air conditioned.

Let me try enabling ECC and send results.

root@Kaui:/data# nvidia-smi -e 1
Enabled ECC support for GPU 00000000:1D:00.0.
Enabled ECC support for GPU 00000000:1E:00.0.
All done.
Reboot required

@slaren
Copy link
Collaborator

slaren commented May 15, 2024

It's not likely to be an incompatibility with the GPU architecture, in fact the ggml-ci tests every commit on master on a PCIE V100. Whatever the issue is, it seems to be specific to your system. I know that some people have been trying to use V100 on custom builds since they are relatively cheap when bought used, and if this is the case here, I think that the most likely cause is some issue with the build.

@phishmaster
Copy link
Author

We don't have anything "custom" that I am aware of. Pretty much standard server with 2 V100 GPUs. As for software it is Ubuntu 22 LTS and pre-built drivers.

@phishmaster
Copy link
Author

Going to ask our IT folks to run a complete VRAM diagnostic also.

@phishmaster
Copy link
Author

My command line
./eval-callback -m models/Mistral-7B-v01/ggml-model-f16.gguf --prompt hello --seed 1023
the f16 model is from the command
python3 convert.py models/llama-2-13b --outtype f16
the Q5 version is from doing quantization
./quantize models/Mistral-7B-v0.1/ggml-model-f16.gguf models/Mistral-7B-v0.1/ggml-model-q5_K_M.bin q5_K_M

@slaren
Copy link
Collaborator

slaren commented May 15, 2024

I figured that you used -ngl 33 with llama-2-13b f16, and tried to reproduce the eval-callback result. The first significant difference I see is this:

ggml_debug:                ffn_out-7 = (f32)    MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [     -0.2148,      -0.5171,       0.0000, ...,       0.1681,      -0.2013,      -0.0247],
                                       [      0.0996,      -0.0289,      -0.1234, ...,      -0.1000,       0.1619,      -0.0583],
                                      ],
                                     ]
                                     sum = -0.838913


ggml_debug:                ffn_out-7 = (f32)    MUL_MAT(blk.7.ffn_down.weight{13824, 5120, 1, 1}, ffn_gate_par-7{13824, 2, 1, 1}}) = {5120, 2, 1, 1}
                                     [
                                      [
                                       [     -0.2151,      -0.5142,      -0.0017, ...,       0.1688,      -0.2012,      -0.0216],
                                       [      0.0499,      -0.0531,      -0.1272, ...,      -0.0833,       0.1644,      -0.0731],
                                      ],
                                     ]

With each matrix multiplication, results get progressively worse, until eventually it produces only nan in a matrix multiplication. I can only explain this with either data corruption or hardware error.

@phishmaster
Copy link
Author

Running pytorch-gpu-benchmark and we are at above 6G of VRAM usage (way more than the 2G for the test above) and it's humming right along without any issues so far. I will post the final benchmark result when it completes.
Screen Shot 2024-05-15 at 5 34 10 PM

@phishmaster
Copy link
Author

Maybe it is c/c++ version of this ? torch.cuda.synchronize() that is missing somewhere?

@slaren
Copy link
Collaborator

slaren commented May 15, 2024

You can test for that by using the CUDA_LAUNCH_BLOCKING=1 env variable.

@phishmaster
Copy link
Author

phishmaster commented May 16, 2024

Is there a way to force single GPU usage? The pytorch benchmark seems to run fine on one GPU but have issues when dual GPU were used.

@phishmaster
Copy link
Author

Using the CUDA_LAUNCH_BLOCKING=1 env variable yielded the same results.

@phishmaster
Copy link
Author

Thank you for your help. After running GPU VRAM tests, we found that there may be indeed hardware issues.

...
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module  550.54.15  Tue Mar  5 22:23:56 UTC 2024
[05/17/2024 13:59:52][Kaui][0]:ERROR: The unit serial number is 0320918003682
[05/17/2024 13:59:52][Kaui][0]:ERROR: (move_inv_read) 16504 errors found in block 3200
[05/17/2024 13:59:52][Kaui][0]:ERROR: the last 10 error addresses are:  0x76d48b5fcbec  0x76d48b5fcbf4  0x76d48b5fcbfc  0x76d48b1fe110 0x76d48b5fcb5c   0x76d48b5fcba4  0x76d48b5fcbac  0x76d48b5fcbb4  0x76d48b5fcbbc  0x76d48b5fcbe4
[05/17/2024 13:59:52][Kaui][0]:ERROR: 0th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 1th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 2th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 3th error, expected value=0x850f3f1b, current value=0x850f3f1f, diff=0x4 (second_read=0x850f3f1f, expect=0x850f3f1b, diff with expected value=0x4)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 4th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 5th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 6th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 7th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 8th error, expected value=0x850f3f1b, current value=0xd50a3f1b, diff=0x50050000 (second_read=0xd50a3f1b, expect=0x850f3f1b, diff with expected value=0x50050000)
[05/17/2024 13:59:52][Kaui][0]:ERROR: 9th error, expected value=0x850f3f1b, current value=0x850a3f1b, diff=0x50000 (second_read=0x850a3f1b, expect=0x850f3f1b, diff with expected value=0x50000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Nvidia GPU Issues specific to Nvidia GPUs
Projects
None yet
Development

No branches or pull requests

3 participants