Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault from /embedding endpoint #404

Closed
k8si opened this issue May 7, 2024 · 0 comments
Closed

Segfault from /embedding endpoint #404

k8si opened this issue May 7, 2024 · 0 comments
Labels

Comments

@k8si
Copy link
Contributor

k8si commented May 7, 2024

Getting segfault from server.cpp /embedding endpoint when using e5-mistral-7b-instruct-f16 and tinyllama-1.1b-chat-v1.0.Q5_K_M. Both of these models work when using llama.cpp.

Tested with:

  • llamafile commit: a2d159e
  • llama.cpp commit: 947d3ad2
  • MacBook Pro with M2 (32 GB)
  • MacOS 14.2.1

Error trace:

...
Apple Metal GPU support successfully loaded
{"build":1500,"commit":"a30b324","function":"server_cli","level":"INFO","line":2853,"msg":"build info","tid":"1099517343568","timestamp":1715097635}
{"function":"server_cli","level":"INFO","line":2856,"msg":"system info","n_threads":8,"n_threads_batch":-1,"system_info":"AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | ","tid":"1099517343568","timestamp":1715097635,"total_threads":12}
llama_model_loader: loaded meta data with 23 key-value pairs and 201 tensors from models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = tinyllama_tinyllama-1.1b-chat-v1.0
llama_model_loader: - kv   2:                       llama.context_length u32              = 2048
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   4:                          llama.block_count u32              = 22
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 64
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 4
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  11:                          general.file_type u32              = 17
llama_model_loader: - kv  12:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  16:                      tokenizer.ggml.merges arr[str,61249]   = ["▁ t", "e r", "i n", "▁ a", "e n...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  20:            tokenizer.ggml.padding_token_id u32              = 2
llama_model_loader: - kv  21:                    tokenizer.chat_template str              = {% for message in messages %}\n{% if m...
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   45 tensors
llama_model_loader: - type q5_K:  135 tensors
llama_model_loader: - type q6_K:   21 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 22
llm_load_print_meta: n_rot            = 64
llm_load_print_meta: n_embd_head_k    = 64
llm_load_print_meta: n_embd_head_v    = 64
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: n_embd_k_gqa     = 256
llm_load_print_meta: n_embd_v_gqa     = 256
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 1B
llm_load_print_meta: model ftype      = Q5_K - Medium
llm_load_print_meta: model params     = 1.10 B
llm_load_print_meta: model size       = 745.11 MiB (5.68 BPW)
llm_load_print_meta: general.name     = tinyllama_tinyllama-1.1b-chat-v1.0
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 2 '</s>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.20 MiB
ggml_backend_metal_log_allocated_size: allocated buffer, size =   745.12 MiB, (  745.19 / 21845.34)
llm_load_tensors: offloading 22 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 23/23 layers to GPU
llm_load_tensors:      Metal buffer size =   745.12 MiB
llm_load_tensors:        CPU buffer size =    42.97 MiB
.........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M2 Pro
ggml_metal_init: picking default device: Apple M2 Pro
ggml_metal_init: default.metallib not found, loading from source
ggml_metal_init: GGML_METAL_PATH_RESOURCES = nil
ggml_metal_init: loading '/var/folders/xb/95_yf2vx3nld70rq_0zsbx9r0000gn/T/.llamafile/ggml-metal.metal'
ggml_metal_init: GPU name:   Apple M2 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 22906.50 MB
llama_kv_cache_init:      Metal KV buffer size =    11.00 MiB
llama_new_context_with_model: KV self size  =   11.00 MiB, K (f16):    5.50 MiB, V (f16):    5.50 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.13 MiB
llama_new_context_with_model:      Metal compute buffer size =    66.50 MiB
llama_new_context_with_model:        CPU compute buffer size =     5.01 MiB
llama_new_context_with_model: graph nodes  = 710
llama_new_context_with_model: graph splits = 2
{"function":"initialize","level":"INFO","line":489,"msg":"initializing slots","n_slots":1,"tid":"1099517343568","timestamp":1715097635}
{"function":"initialize","level":"INFO","line":498,"msg":"new slot","n_ctx_slot":512,"slot_id":0,"tid":"1099517343568","timestamp":1715097635}
{"function":"server_cli","level":"INFO","line":3074,"msg":"model loaded","tid":"1099517343568","timestamp":1715097635}

llama server listening at http://127.0.0.1:8080

{"function":"server_cli","hostname":"127.0.0.1","level":"INFO","line":3197,"msg":"HTTP server listening","port":"8080","tid":"1099517343568","timestamp":1715097635}
{"function":"update_slots","level":"INFO","line":1656,"msg":"all slots are idle and system prompt is empty, clear the KV cache","tid":"1099517343568","timestamp":1715097635}
{"function":"launch_slot_with_data","level":"INFO","line":879,"msg":"slot is processing task","slot_id":0,"task_id":0,"tid":"1099517343568","timestamp":1715097655}
{"function":"update_slots","level":"INFO","line":1907,"msg":"kv cache rm [p0, end)","p0":0,"slot_id":0,"task_id":0,"tid":"1099517343568","timestamp":1715097655}
llama_get_embeddings_ith: invalid embeddings id 0, reason: batch.logits[0] != true
{"function":"send_embedding","level":"ERR","line":1313,"msg":"failed to get embeddings","seq_id":0,"tid":"1099517343568","timestamp":1715097655,"token":1}
llama_get_embeddings_ith: invalid embeddings id 1, reason: batch.logits[1] != true
{"function":"send_embedding","level":"ERR","line":1313,"msg":"failed to get embeddings","seq_id":0,"tid":"1099517343568","timestamp":1715097655,"token":12113}
llama_get_embeddings_ith: invalid embeddings id 2, reason: batch.logits[2] != true
{"function":"send_embedding","level":"ERR","line":1313,"msg":"failed to get embeddings","seq_id":0,"tid":"1099517343568","timestamp":1715097655,"token":29879}
llama_get_embeddings_ith: invalid embeddings id 3, reason: batch.logits[3] != true
{"function":"send_embedding","level":"ERR","line":1313,"msg":"failed to get embeddings","seq_id":0,"tid":"1099517343568","timestamp":1715097655,"token":526}
llama_get_embeddings_ith: invalid embeddings id 4, reason: batch.logits[4] != true
{"function":"send_embedding","level":"ERR","line":1313,"msg":"failed to get embeddings","seq_id":0,"tid":"1099517343568","timestamp":1715097655,"token":2654}

error: Uncaught SIGSEGV (SEGV_ACCERR) on unknown pid 9396 tid 9396
 /Users/ksilverstein/dev/embs/mk-llamafiles/minimal/llamafile/bin/llamafile
 Operation not permitted
  Cosmopolitan 3.3.4 MODE=aarch64
 cosmoaddr2line /Users/ksilverstein/dev/embs/mk-llamafiles/minimal/llamafile/bin/llamafile 1000027acec 100000408c0 100000669c8 100000710b0 1000006e25c 1000002af2c 10000004a9c 100000140f8 10000000148
 faulting address is 0000000000000006
 0000000000000006 x0 0000000000000020 x8  0000000000000004 x16 0000000000000028 x24
 0000000000000006 x1 0000000000000000 x9  00000001eaf63220 x17 0000000000000006 x25
 0000000000000000 x2 0000000000000002 x10 0000000000000000 x18 aaaaaaaaaaaaaaab x26
 00000000000fb421 x3 0000100080529e80 x11 000010008052e1b0 x19 000000000000000c x27
 0000000000000006 x4 0000000000000290 x12 0000000000000006 x20 0000010000570a00 x28
 0000000000000400 x5 0000000000002263 x13 0000000000000006 x21 000000016b7b5770 x29
 0000000000000004 x6 0000000000000000 x14 0000000000000002 x22 0000010000015f04 x30
 0000010000570698 x7 00000000000000e5 x15 0000010000570a38 x23 000000016b7b5770 x31
 000000016b7b5770 sp 1000027acec pc strlen+20
 000000016b7b5770 sp 10000015f04 lr std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>::basic_string<std::nullptr_t>(char const*)+36
 000000016b7b5770 fp 100000408c0 lr nlohmann::json_abi_v3_11_3::detail::json_ref<nlohmann::json_abi_v3_11_3::basic_json<std::__1::map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long, unsigned long, double, std::__1::allocator, nlohmann::json_abi_v3_11_3::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void>>::json_ref<char const (&) [10], 0>(char const (&) [10])+140
 000000016b7b5790 fp 100000669c8 lr llama_server_context::send_embedding(llama_client_slot&)+1508
 000000016b7b5ad0 fp 100000710b0 lr llama_server_context::update_slots() (.isra.0)+11120
 000000016b7b5f10 fp 1000006e25c lr llama_server_queue::start_loop()+760
 000000016b7b6610 fp 1000002af2c lr server_cli(int, char**)+8456
 000000016b7b6a30 fp 10000004a9c lr main+464
 000000016b7b8120 fp 100000140f8 lr cosmo+1196
 000000016b7ba480 fp 10000000148 lr _start
curl: (52) Empty reply from server
./test.sh: line 30:  9396 Segmentation fault: 11  "${LLAMAFILE}" --model "${MODEL}" --embedding --server --nobrowser --port 8080 2>&1
./test.sh: line 33: kill: (9396) - No such process
error

Test script:

#!/bin/bash

# BUILD
build/download-cosmocc.sh .cosmocc/3.3.4 3.3.4 98e5b361c525603f5296351e0c11820fd25908b52fe1ce8ff394d66b1537a259
export PATH=.cosmocc/3.3.4/bin:${PATH}
make -j8
make install PREFIX=.
LLAMAFILE="bin/llamafile"

# DOWNLOAD MODEL
mkdir models
cd models
wget -nc https://huggingface.co/second-state/E5-Mistral-7B-Instruct-Embedding-GGUF/resolve/main/e5-mistral-7b-instruct-f16.gguf
wget -nc https://huggingface.co/TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF/resolve/main/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf
cd -

# RUN TEST
#MODEL="models/e5-mistral-7b-instruct-f16.gguf"
MODEL="models/tinyllama-1.1b-chat-v1.0.Q5_K_M.gguf"

"${LLAMAFILE}" --model "${MODEL}" --embedding --server --nobrowser --port 8080 2>&1 &
serverpid=$!
[ $? -ne 0 ] && exit 1
sleep 20

curl \
-X POST \
-H "Content-Type: application/json" \
-d '{"content": "Apples are red."}' \
http://localhost:8080/embedding
err=$?

kill $serverpid

if [ "${err}" -ne 0 ]; then echo "error"; exit 1; fi
@k8si k8si added the bug label May 7, 2024
@jart jart closed this as completed in 0e2845a May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant