You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
using Metal
functionlocal_kernel(a, expected::AbstractArray{T}, desired::T) where T
i =thread_position_in_grid_1d()
b =MtlThreadGroupArray(T, 16)
#b[i] = a[i]
Metal.atomic_store_explicit(pointer(b, i), Metal.atomic_load_explicit(pointer(a, i)))
while Metal.atomic_compare_exchange_weak_explicit(pointer(b, i), expected[i], desired) != expected[i]
# keep on tryingend#a[i] = b[i]
Metal.atomic_store_explicit(pointer(a, i), Metal.atomic_load_explicit(pointer(b, i)))
returnendfunctionmain(; T=Int32, n=16)
a = Metal.zeros(T, n)
expected =copy(a)
desired =T(42)
@metal threads=n local_kernel(a, expected, desired)
Array(a)
end
Note how the load and stores that initialize the threadgroup memory and copy it back to global memory need to be atomics for this example to work, even though every thread has its own dedicated memory address to act upon. Demoting those operations to regular array operations results in the final array containing all zeros.
This smells like an upstream bug, especially because the above pattern is impossible to replicate in Metal C (where atomic_int is used as element type, promoting all operations to atomic):
Hmm, this seems to apply to other atomics as well:
using Metal
functionlocal_kernel(f, a, val::T) where T
i =thread_position_in_grid_1d()
b =MtlThreadGroupArray(T, 128)
#b[i] = a[i]
val = Metal.atomic_load_explicit(pointer(a, i))
Metal.atomic_store_explicit(pointer(b, i), val)
f(pointer(b, i), val)
#a[i] = b[i]
val = Metal.atomic_load_explicit(pointer(b, i))
Metal.atomic_store_explicit(pointer(a, i), val)
returnendfunctionmain(; T=Int32, n=16)
a =ones(T, n)
b =MtlArray(a)
val =one(T)
@metal threads=n local_kernel(Metal.atomic_fetch_add_explicit, b, val)
@show.+(a, val)
@showArray(b)
returnend
maleadt
changed the title
Atomic compare-exchange with threadgroup memory requires additional atomics
Threadgroup atomics require all-atomic operation
Jun 26, 2023
Note how the load and stores that initialize the threadgroup memory and copy it back to global memory need to be atomics for this example to work, even though every thread has its own dedicated memory address to act upon. Demoting those operations to regular array operations results in the final array containing all zeros.
I think this is a general truth (and why we don't have atomics for arrays yet)
If you mix atomic operations with non-atomic operations you will get issues.
But I would have expected that the load and stores to thread-local so b[I] would have been able to not be atomic. Just the loads and stores to global memory.
If you mix atomic operations with non-atomic operations you will get issues.
Why is that? Every thread is accessing its own memory locations, so why would mixing atomics with regular loads and stores not work? Note that removing atomics altogether works fine here.
MWE:
Note how the load and stores that initialize the threadgroup memory and copy it back to global memory need to be atomics for this example to work, even though every thread has its own dedicated memory address to act upon. Demoting those operations to regular array operations results in the final array containing all zeros.
This smells like an upstream bug, especially because the above pattern is impossible to replicate in Metal C (where
atomic_int
is used as element type, promoting all operations to atomic):The text was updated successfully, but these errors were encountered: