Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition: Test.@test_logs is not thread safe. #54439

Closed
NHDaly opened this issue May 10, 2024 · 1 comment · Fixed by #54497
Closed

Race Condition: Test.@test_logs is not thread safe. #54439

NHDaly opened this issue May 10, 2024 · 1 comment · Fixed by #54497
Labels
domain:multithreading Base.Threads and related functionality good first issue Indicates a good issue for first-time contributors to Julia kind:bug Indicates an unexpected problem or unintended behavior testsystem The unit testing framework and Test stdlib

Comments

@NHDaly
Copy link
Member

NHDaly commented May 10, 2024

EDIT: changed to a much simpler example that shows the issue.

The @test_logs macro is not thread safe, if the code under test uses multiple threads:

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto

This is definitely a bug, and should be fixed.

We see this showing up in our CI build farm from time to time.

The solution is almost certainly to make that vector thread-safe, by introducing a lock around it, and then copying the vector out before returning.

@NHDaly NHDaly added kind:bug Indicates an unexpected problem or unintended behavior domain:multithreading Base.Threads and related functionality good first issue Indicates a good issue for first-time contributors to Julia testsystem The unit testing framework and Test stdlib labels May 10, 2024
@NHDaly NHDaly changed the title Race Condition: Test.collect_test_logs() is not thread safe. Race Condition: Test.@test_logs is not thread safe. May 16, 2024
@NHDaly
Copy link
Member Author

NHDaly commented May 16, 2024

EDIT: I've edited the original message to the much more compelling example @test_logs example.
Moving the previous description here, as a secondary example:

Any Task spawned during a Test, which logs after Test.collect_test_logs() returns can trigger a data race, where the spawned task logs into the logs vector, while the caller is reading from that vector.

Here is a contrived example that I managed to get to reproduce the issue:

julia> function code_to_test()
           # do the work
           @info "starting"
           start_transaction()
           # ...
           @info "done"
           # Clean up
           end_transaction()
       end
code_to_test (generic function with 1 method)

julia> start_transaction() = nothing
start_transaction (generic function with 1 method)

julia> function end_transaction()
           # spawn a task to do some cleanup in the background, so we can
           # return to the client immediately
           Threads.@spawn begin
               # clean stuff up
               @info "Background Cleanup Task: All done cleaning up!"
               for _ in 1:1000
                   @info "now it's clean, goodbye"
               end
           end
       end
end_transaction (generic function with 1 method)

julia> using Test

julia> for _ in 1:10000
           let (logs,_) = Test.collect_test_logs() do
                   @info "hey"
                   code_to_test()
                   @warn "hi"
               end

               for log in logs
                   @test length(log.message) > 1
               end
           end
       end

ERROR: UndefRefError: access to undefined reference
Stacktrace:
 [1] getindex
   @ ./essentials.jl:13 [inlined]
 [2] iterate(A::Vector{LogRecord}, i::Int64)
   @ Base ./array.jl:945
 [3] top-level scope
   @ REPL[5]:10

This also sometimes segfaults, of course:

[66213] signal (11.2): Segmentation fault: 11
in expression starting at REPL[5]:1
signal (11) thread (1) sig_match_fast at /Users/nathandaly/builds/julia-RAI/src/gf.c:2914 [inlined]
signal (11) thread (1) jl_lookup_generic_ at /Users/nathandaly/builds/julia-RAI/src/gf.c:2994 [inlined]
signal (11) thread (1) ijl_apply_generic at /Users/nathandaly/builds/julia-RAI/src/gf.c:3073

We see this showing up in our CI build farm from time to time.

The solution is almost certainly to make that vector thread-safe, by introducing a lock around it, and then copying the vector out before returning.

NHDaly added a commit to RelationalAI/julia that referenced this issue May 22, 2024
Fixes JuliaLang#54439.

- Lock around concurrent accesses to .logs and .message_limits
- Copy the vector out of the logger at the end, to shield against dangling Tasks.

Before:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto
```
After:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
```
(no crash) :)
NHDaly added a commit that referenced this issue May 22, 2024
Fixes #54439.

- Lock around concurrent accesses to .logs and .message_limits
- Copy the vector out of the logger at the end, to shield against dangling Tasks.

Before:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto
```
After:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
```
(no crash) :)
NHDaly added a commit that referenced this issue May 22, 2024
Fixes #54439.

- Lock around concurrent accesses to .logs, .message_limits, and
   .shouldlog_args.
- Copy the vector out of the logger at the end, to shield against
   dangling Tasks.

Before:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto
```
After:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
```
(no crash) :)
NHDaly added a commit to RelationalAI/julia that referenced this issue May 22, 2024
Fixes JuliaLang#54439.

- Lock around concurrent accesses to .logs and .message_limits
- Copy the vector out of the logger at the end, to shield against dangling Tasks.

Before:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
julia+RAI(56155,0x1f5157ac0) malloc: double free for ptr 0x128248000
julia+RAI(56155,0x1f5157ac0) malloc: *** set a breakpoint in malloc_error_break to debug

[56155] signal (6): Abort trap: 6
in expression starting at REPL[8]:1

signal (6) thread (1) __pthread_kill at /usr/lib/system/libsystem_kernel.dylib (unknown line)
Allocations: 54370881 (Pool: 54363911; Big: 6970); GC: 119
[1]    56154 abort      julia -tauto
```
After:
```julia
julia> Threads.nthreads()
8

julia> function foo(n)
           @info "Doing foo with n=$n"
           @sync for i=1:n
               Threads.@Spawn @info "Iteration $i"
           end
           42
       end
foo (generic function with 1 method)

julia> for _ in 1:1000
           @test_logs (:info,"Doing foo with n=10000") match_mode=:any foo(10_000)
       end
```
(no crash) :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:multithreading Base.Threads and related functionality good first issue Indicates a good issue for first-time contributors to Julia kind:bug Indicates an unexpected problem or unintended behavior testsystem The unit testing framework and Test stdlib
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant