Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Crash in item(). #82

Open
tqtifnypmb opened this issue May 4, 2024 · 16 comments
Open

[Bug] Crash in item(). #82

tqtifnypmb opened this issue May 4, 2024 · 16 comments

Comments

@tqtifnypmb
Copy link

tqtifnypmb commented May 4, 2024

Code:

let tokenId: Int = logits[idx].item()

Occurs occasionally, not 100% reproducible

Crash Stacks:

0   libsystem_kernel.dylib        	0x00000001e9a86974 __pthread_kill + 8 (:-1)
1   libsystem_pthread.dylib       	0x00000001fd5090ec pthread_kill + 268 (pthread.c:1717)
2   libsystem_c.dylib             	0x00000001a94d7b80 abort + 180 (abort.c:118)
3   libc++abi.dylib               	0x00000001fd428ccc abort_message + 132 (abort_message.cpp:78)
4   libc++abi.dylib               	0x00000001fd418e68 demangling_terminate_handler() + 320 (cxa_default_handlers.cpp:72)
5   libobjc.A.dylib               	0x00000001994860cc _objc_terminate() + 160 (objc-exception.mm:499)
6   libc++abi.dylib               	0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib               	0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib               	0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9  Demo              	                 0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)
10  Demo              	                 0x00000001031cb78c mlx_reshape + 108 (ops.cpp:633)
11  Demo              	                 0x00000001036c2cd8 MLXArray.reshaped(_:stream:) + 256 (MLXArray+Ops.swift:1974)
12  Demo              	                 0x00000001036b2610 getItemND(src:operations:stream:) + 2404 (MLXArray+Indexing.swift:722)
13  Demo              	                 0x00000001036b5190 MLXArray.subscript.getter + 436

Any hints about what is happening behind the scenes or how to fix it?

Thanks

@davidkoski
Copy link
Collaborator

Sure, there may be more to the message -- it is throwing an exception here:

6   libc++abi.dylib               	0x00000001fd428090 std::__terminate(void (*)()) + 16 (cxa_handlers.cpp:59)
7   libc++abi.dylib               	0x00000001fd42b2e4 __cxxabiv1::failed_throw(__cxxabiv1::__cxa_exception*) + 88 (cxa_exception.cpp:152)
8   libc++abi.dylib               	0x00000001fd42b228 __cxa_throw + 308 (cxa_exception.cpp:283)
9   AI Transcription              	0x000000010360acf0 mlx::core::reshape(mlx::core::array const&, std::__1::vector<int, std::__1::allocator<int>>, std::__1::variant<std::__1::monostate, mlx::core::Stream, mlx::core::Device>) + 916 (ops.cpp:328)

That is this line:

https://github.com/ml-explore/mlx/blob/main/mlx/ops.cpp#L328

    msg << "[reshape] Cannot reshape array of size " << a.size()
        << " into shape " << shape << ".";
    throw std::invalid_argument(msg.str());

You are doing an array index operation, not a reshape, but there are reshapes inside the indexing code. It would be good to see what the exception message is -- if there is a crashlog it might be in there.

If you get this in Xcode you can go to frame 9 and:

p msg.str()

(I think that is the right call, but maybe msg.c_str().

@tqtifnypmb
Copy link
Author

Another two cases.

It appears they are due to the same issue. (Crash when eval)

I wonder if running it on the CPU will make a difference?


Case 1:

#0	0x00000001084a0428 in pthread_mutex_lock ()
#1	0x000000018c7f0300 in std::__1::mutex::lock() ()
#2	0x00000001035be5c8 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:41
#3	0x00000001035be504 in std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) at /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX14.4.sdk/usr/include/c++/v1/__mutex/unique_lock.h:40
#4	0x000000010425fd10 in void mlx::core::scheduler::StreamThread::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:61
#5	0x000000010425fce0 in void mlx::core::scheduler::Scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /gsebzzaktrijzzdiyqoncozizhfk/SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:150
#6	0x000000010424d908 in void mlx::core::scheduler::enqueue<mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3>(mlx::core::Stream const&, mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool)::$_3&&) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/scheduler.h:157
#7	0x000000010424d2cc in mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:165
#8	0x000000010424da74 in mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/transforms.cpp:176
#9	0x00000001035c0034 in mlx::core::array::eval() at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/mlx/mlx/array.cpp:101
#10	0x0000000103554b84 in ::mlx_array_eval(mlx_array) at /SourcePackages/checkouts/mlx-swift/Source/Cmlx/include/mlx/c/array.cpp:189
#11	0x00000001042f3578 in MLXArray.asArray<τ_0_0>(_:) at /SourcePackages/checkouts/mlx-swift/Source/MLX/MLXArray.swift:264

Case 2:

0   libsystem_pthread.dylib       	0x00000001ffd2f400 pthread_mutex_lock + 12 (pthread_mutex.c:1709)
1   libc++.1.dylib                	0x00000001a5e3ad20 std::__1::mutex::lock() + 16 (mutex.cpp:38)
2   Demo              	0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:41)
3   Demo              	0x0000000100f9505c std::__1::unique_lock<std::__1::mutex>::unique_lock[abi:ue170006](std::__1::mutex&) + 4 (unique_lock.h:40)
4   Demo              	0x0000000100f9505c void mlx::core::scheduler::StreamThread::enqueue<std::__1::function<void ()>>(std::__1::function<void ()>&&) + 28 (scheduler.h:61)
5   Demo              	0x0000000100f958b0 void mlx::core::scheduler::Scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 16 (scheduler.h:150)
6   Demo              	0x0000000100f958b0 void mlx::core::scheduler::enqueue<std::__1::function<void ()>>(mlx::core::Stream const&, std::__1::function<void ()>&&) + 20 (scheduler.h:157)
7   Demo              	0x0000000100f958b0 mlx::core::eval_impl(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>, bool) + 1788 (transforms.cpp:144)
8   Demo              	0x0000000100f95e54 mlx::core::eval(std::__1::vector<mlx::core::array, std::__1::allocator<mlx::core::array>>) + 56 (transforms.cpp:176)
9   Demo             	0x0000000100b0567c mlx::core::array::eval() + 196 (array.cpp:101)
10  Demo              	0x0000000101003de8 MLXArray.item<A>(_:) + 236 (MLXArray.swift:169)
11  Demo              	0x00000001009b8408 specialized MLXArray.item<A>() + 24 (<compiler-generated>:0)

@davidkoski
Copy link
Collaborator

These look a little different -- I think it might be an assertion failure inside the pthread mutex. Are you using threads or async?

@davidkoski
Copy link
Collaborator

If you have a snippet that can reproduce the error I can investigate it.

@tqtifnypmb
Copy link
Author

tqtifnypmb commented May 5, 2024

Three cases above are in a thread.

I can’t provide meaningful snippets. I just simply call “item” and it occurs occasionally. Found no pattern to reproduce them

@davidkoski
Copy link
Collaborator

MLXArray is not thread safe -- you can't create one on one thread and evaluate it (e.g/ calling item()) on another. This includes loading models and doing inference on different threads.

Try it without threads and see how it fares, if you can.

@davidkoski
Copy link
Collaborator

            throw std::invalid_argument(
                "[eval] Attempting to eval an array without a primitive.");

and I can reproduce that in python -- it looks like an issue in mlx.core.

@davidkoski
Copy link
Collaborator

Filed ml-explore/mlx#1083.

@davidkoski
Copy link
Collaborator

Strange, there used to be a snippet of code in here! Anyway, per the discussion in the other issue, this is the problem:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0].item(Float.self)
    return MLX.full([1, 100], values: MLXArray(logits))
}

The call to item() inside this function is illegal inside a compile (and it doesn't give a clear message) -- compile requires that the code produce an unbroken graph. Change it like this:

func foo(x: MLXArray) -> MLXArray {
    let y = MLX.softmax(x, axis: -1)
    let logits = y[0,0,0]
    return MLX.full([1, 100], values: y)
}

@tqtifnypmb
Copy link
Author

Oops! I deleted the snippet because it didn't accurately reflect my question.

@tqtifnypmb
Copy link
Author

More crash info:

Exception Type:        EXC_BAD_ACCESS (SIGSEGV)
Exception Codes:       KERN_INVALID_ADDRESS at 0x00005fff69a438b0
Exception Codes:       0x0000000000000001, 0x00005fff69a438b0

Termination Reason:    Namespace SIGNAL, Code 11 Segmentation fault: 11
Terminating Process:   exc handler [3477]

VM Region Info: 0x5fff69a438b0 is not in any region.  Bytes after previous region: 105069557332145  Bytes before following region: 2522597200
      REGION TYPE                    START - END         [ VSIZE] PRT/MAX SHRMOD  REGION DETAIL
      commpage (reserved)        1000000000-7000000000   [384.0G] ---/--- SM=NUL  ...(unallocated)
--->  GAP OF 0x5f9000000000 BYTES
      MALLOC_NANO              600000000000-600020000000 [512.0M] rw-/rwx SM=PRV  

@davidkoski
Copy link
Collaborator

OK, that is trying to access memory that is not mapped. The most likely cause is out of bounds indexing: the underlying mlx::core::array does not do bounds checking.

What does idx look like here and what is the shape of your array?

@tqtifnypmb
Copy link
Author

tqtifnypmb commented May 13, 2024

I am pretty sure that the idx is within the bounds logically. (disregarding the possibility of MLX's array operations themselves provide incorrect results)

I think there are some data racing issues, and changing the default device seems to trigger some synchronization mechanism. After testing for a few days, I have discovered the following workaround (at least so far, there have been no further instances of crashing):

// run on gpu
let logits = model(x)

MLX.Device.setDefault(device: .cpu)

// all array operations manipulating `logits` that call `eval` under the hook run on cpu

@davidkoski
Copy link
Collaborator

Well, running everything on the cpu probably isn't getting all the advantages of mlx :-) If you ever have a repo that can reproduce the issue I would be happy to investigate further.

If not, are we ok to close this?

@awni
Copy link
Member

awni commented May 20, 2024

It might be an issue with core as well.. hard to tell because there is no code to repro here. But @tqtifnypmb if you are able to provide the Swift repro code or similar code in Python that would be useful.

@tqtifnypmb
Copy link
Author

I would very much like to help, I've tried to extract a minimal repo. But in the process, I found that the logic is the same as whisper-mlx, so I guess using whisper should be the same. It's worth noting that most of the crashes come from the iOS system. The frequency on macOS is very low.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants