TaskRunner's concurrent performance exhibits a significant decline in case of poor networks conditions #8388

bringyou · 2024-04-26T15:15:59Z

Our server needs to call some external API, and the stablity of API providers is quite poor and not reliable.
And we found that during instances of network fluctuation, okHttp's internal TaskRunner will cause so many threads keep waiting for condition, finally resulting in all threads that have attempted HTTP calls becoming suspended. We can check the jstack pic below

After debugging we found that all OkHttpClient uses TaskRunner.INSTANCE by the default.
And nearly every method in TaskRunner acquire a instance level lock.
In another word, all instances of OkHttpClient configured as default will competing for the same lock instance in some scenarios, especially when external's network is poor and not reliable.
Is there any plan to improve the concurrent performance of TaskRunner? e.g. use atomic values rather than ReentrantLock

The text was updated successfully, but these errors were encountered:

bringyou · 2024-04-26T15:20:47Z

How to reproduce: calling a external API which can only serve limited tcp and takes a long time to create each tcp

yschimke · 2024-04-26T19:06:08Z

Can you show a stack trace or the thread that holds the lock?

Or attach the entire jstack output

yschimke · 2024-04-26T20:27:40Z

~~Are you sure those stacktraces aren't just the threads waiting for work to do?~~
https://product.hubspot.com/blog/understanding-thread-dumps

Can you point to other profiling showing the slowdown?

As far as locks vs synchronised. I looked at this recently #8366

It seems most detailed investigations show for contended work using Locks is the more performant option. So before switching to synchronized, do you have some links or benchmarks showing the benefit of synchronized?

swankjesse · 2024-04-27T00:45:26Z

This particular lock is not expected to be held for more than a few cycles – every acquire reads & writes a few lock-guarded fields and then releases the lock.

But under highly-concurrent use, the fact that we’re using locking at all is likely to cause some contention.

Can you tell me how many concurrent calls are you making?

If it’s not very many (say, 10 or fewer) and you’re seeing this problem, then it’s likely we’ve got a bug in our implementation. We’re holding the lock longer than we’re expecting to.

If you’re doing lots of concurrent calls and this lock is a bottleneck, then I think we should invest in find getting a lock-free alternative to TaskRunner; either by writing one or by using something else in the JDK.

bringyou · 2024-04-27T08:05:15Z

@yschimke Hi ,this is the full stacktrace , we can see that threads were blocked at TaskRunner.runTask and waiting for lock.
jstatck-2.txt

at java.util.concurrent.locks.ReentrantLock$Sync.lock(java.base@21/ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock(java.base@21/ReentrantLock.java:322)
at okhttp3.internal.concurrent.TaskRunner.runTask(TaskRunner.kt:126)

running on okhttpclient alpha-14

It seems most detailed investigations show for contended work using Locks is the more performant option. So before switching to synchronized, do you have some links or benchmarks showing the benefit of synchronized?

I didn't wish to change back to synchronized, sychronized is evan slower than lock~
In my honest opinion, there are two ways to imporve the performance:
First is sharing the RealBackend instead of sharing TaskRunner, because RealBackend holds the thread-pool, and multiple TaskRunner instance will reduce the competition.
Second is using lock-free components, like ConcurrentHashSet

bringyou · 2024-04-27T08:10:10Z

Can you tell me how many concurrent calls are you making?

If it’s not very many (say, 10 or fewer) and you’re seeing this problem, then it’s likely we’ve got a bug in our implementation. We’re holding the lock longer than we’re expecting to.

Hi, @swankjesse it's about 4 concurrent calls. But as mentioned before, the network of external server is not stable and reliable, sometimes the server will take a long time to ack TCP.

yschimke · 2024-04-27T09:53:04Z

Thanks for the stacktrace, it's really helpful. Apologies for not taking this so seriously initially, it does look like a legit issue.

The move from synchronized to ReentrantLock, does lose us the Thread dump info of who holds the lock. I can't see any thread in your trace that would have the lock, and is blocked.

I can however trivially reproduce what you see, not by slowdown, but by taking a lock and just not releasing it. but it sounds like that is not what you see.

I'll try to reproduce this with some additional tracing and socket delays. The atomic option would be to give you a slightly tweaked build with some tracing to capture the thread that has the lock at these times.

yschimke · 2024-04-27T13:12:44Z

I'm not seeing something concrete. In the creative testing I'm doing, the two things holding locks and taking a non trivial amount of time are

TaskRunner.awaitTaskToRun
Http2Connection.applyAndAckSettings

I'll see if I can work out where it's going.

RealBackend.execute seems to be occasionally taking 1-3 ms on my laptop. But I'm really not confident here, or what I'm seeing.

swankjesse · 2024-04-27T14:08:59Z

I think we might just be missing a signal() call on our monitor? This thread is waiting on the condition and it shouldn’t be.

"OkHttp TaskRunner" #1545 [255] daemon prio=5 os_prio=0 cpu=0.13ms elapsed=1117.04s tid=0x00007f8320012f40 nid=255 waiting on condition  [0x00007f82f09c8000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@21/Native Method)
	- parking to wait for  <0x00000000f14c1118> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@21/LockSupport.java:221)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@21/AbstractQueuedSynchronizer.java:754)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@21/AbstractQueuedSynchronizer.java:1761)
	at okhttp3.internal.concurrent.TaskRunner$RealBackend.coordinatorWait(TaskRunner.kt:320)
	at okhttp3.internal.concurrent.TaskRunner.awaitTaskToRun(TaskRunner.kt:229)
	at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:67)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@21/ThreadPoolExecutor.java:1144)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@21/ThreadPoolExecutor.java:642)
	at java.lang.Thread.runWith(java.base@21/Thread.java:1596)
	at java.lang.Thread.run(java.base@21/Thread.java:1583)

   Locked ownable synchronizers:
	- <0x00000000f33ff110> (a java.util.concurrent.ThreadPoolExecutor$Worker)

yschimke · 2024-04-27T14:19:24Z

@swankjesse I think you know this code a lot better than I do. I'll leave it to you. let me know if I can help.

bringyou · 2024-04-27T16:33:45Z

I think we might just be missing a signal() call on our monitor? This thread is waiting on the condition and it shouldn’t be.

Maybe the TaskRunner.coordinatorWaiting should be annotated as @Volatile ?

swankjesse · 2024-04-28T15:45:19Z

My hypothesis is that when TaskRunner is busy, it potentially launches unnecessary threads here:

okhttp/okhttp/src/main/kotlin/okhttp3/internal/concurrent/TaskRunner.kt

Lines 208 to 211 in aede7c5

    
           // Also start another thread if there's more work or scheduling to do. 
        
           if (multipleReadyTasks || !coordinatorWaiting && readyQueues.isNotEmpty()) { 
        
             backend.execute(this@TaskRunner, runnable) 
        
           }

          // Also start another thread if there's more work or scheduling to do.
          if (multipleReadyTasks || !coordinatorWaiting && readyQueues.isNotEmpty()) {
            backend.execute(this@TaskRunner, runnable)
          }

These threads exit quickly, but they contend with the lock in the process. And that exacerbates the problem.

Fixing this code to start only the threads it needs should be a straightforward fix.

We've got a race where we'll start a thread when we need one, even if we've already started a thread. This changes TaskRunner's behavior to never add a thread if we're still waiting for a recently-added one to start running. This is intended to reduce the number of threads contenting for the TaskRunner lock as reported in this issue: #8388

Bennett-Lynch · 2024-05-21T22:56:01Z

I encountered what appears to be the same issue and wanted to share some extra data/anecdotes. The contention around okhttp3.internal.concurrent.TaskRunner and okhttp3.internal.connection.RealConnection caused my server worker threads (all using OkHttp synchronously) to back up.

In below profile comparisons, A is healthy and B is unhealthy:

And below is a single host profile timeline:

(OkHttp version 4.10.0)

Some non-OkHttp elements are hidden for the sake of privacy but OkHttp was a significant outlier by both locking and thread creation.

@swankjesse Does this corroborate your theory? How much confidence do we have that #8391 might have fixed it?

Bennett-Lynch · 2024-05-22T00:23:17Z

Is it possible to remove the lock usage in favor of a single AtomicInteger which tracks outstanding calls requested? startAnotherThread logic could then use CAS semantics to attempt to decrement if positive and only start a thread if the value was successfully decremented, all without acquiring a lock:

okhttp/okhttp/src/main/kotlin/okhttp3/internal/concurrent/TaskRunner.kt

Lines 257 to 264 in 54238b4

    
           /** Start another thread, unless a new thread is already scheduled to start. */ 
        
           private fun startAnotherThread() { 
        
             lock.assertHeld() 
        
             if (executeCallCount > runCallCount) return // A thread is still starting. 
        
             executeCallCount++ 
        
             backend.execute(this@TaskRunner, runnable) 
        
           }

fun AtomicInteger.decrementIfPositive(): Boolean {
    while (true) {
        val current = get()
        if (current <= 0) {
            return false
        }
        if (compareAndSet(/* expectedValue = */ current, /* newValue = */ current - 1)) {
            return true
        }
        // If another thread modified the value in between, retry
    }
}

swankjesse · 2024-05-22T04:11:23Z

@Bennett-Lynch doing this with lock-free is a good idea, but I’d prefer to do that holistically. Right now we use locks throughout this thing and it’s easy to reason about this way. Mixing and matching concurrency primitives is more difficult to do well!

bringyou · 2024-05-22T05:41:28Z

Just share a code snippet that I am currently using to reduce the competing:

@file:Suppress("INVISIBLE_MEMBER", "INVISIBLE_REFERENCE")

fun dispatcherPoweredByVirtualThread() = Dispatcher(
    Executors.newThreadPerTaskExecutor(Thread.ofVirtual().name("OkHttp Dispatcher(Virtual Thread)", 0).factory())
)
fun virtualThreadClient(dispatcherCustomizer: Dispatcher.() -> Unit = {}): OkHttpClient.Builder {
    val ret = OkHttpClient.Builder()
    ret.dispatcher(dispatcherPoweredByVirtualThread().apply(dispatcherCustomizer))
    ret.taskRunner = TaskRunner(RealBackend(Thread.ofVirtual().name("OkHttp TaskRunner(Virtual Thread)", 0).factory()))
    return ret
}

With the power of virtual threads(since JDK21), we can easily change OkHttpClient's backend thread to virutal threads, and we can create much more OkHttpClient instances without additional cost. And, more instances bring more Lock items, this reduces competing

Bennett-Lynch · 2024-05-22T14:35:59Z

@bringyou Nice, good option to have in our back pocket. :)

@swankjesse Understood on the need for a holistic approach. Few questions if you can spare the time:

If synchronous requests use the "bring your own thread" model, why are we creating extra threads? Is there any added value in not running them on the caller's thread?
What type of tasks are being run on these threads? TCP connection establishment? Do you have any idea what scenarios may originally contribute to delayed task execution (thus triggering the cascading failure case of more threads, causing more contention/delays)?
How much confidence do we have that Start fewer threads in TaskRunner #8391 might have resolved the issue? If there is app-wide contention on every new connection establishment, might that still be too much contention, even without the extra threads?

swankjesse · 2024-05-22T18:51:00Z

For Happy Eyeballs, we started attempting multiple connections concurrently, on different threads.
We also use background threads for HTTP2 PING responses, WINDOW_UPDATE responses, and similar.
I’m optimistic? If you’d like to try out a SNAPSHOT build and see if it addresses you problem, that’d be awesome.

Bennett-Lynch · 2024-05-22T19:18:01Z

Thanks, @swankjesse.

Do the different threads hold the lock during the entire connection attempt, or just until the thread starts processing?

I see here that it will just attempt to acquire the lock to fetch the task to be run:

okhttp/okhttp/src/main/kotlin/okhttp3/internal/concurrent/TaskRunner.kt

Lines 65 to 68 in a228fd6

    
           val task = 
        
             this@TaskRunner.lock.withLock { 
        
               awaitTaskToRun() 
        
             } ?: return

But how does that explain the extremely high contention on the lock?

@bringyou seemed to suggest that this could be reproduced at very low concurrency if the remote TCP peer is slow to ACK. That seems to imply that the runnable task itself is likely acquiring the lock somehow?

bringyou added the bug Bug in existing code label Apr 26, 2024

yschimke mentioned this issue Apr 27, 2024

Centralise more locks #8389

Open

yschimke mentioned this issue Apr 27, 2024

Experiment with locks #8390

Draft

swankjesse mentioned this issue Apr 28, 2024

Start fewer threads in TaskRunner #8391

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskRunner's concurrent performance exhibits a significant decline in case of poor networks conditions #8388

TaskRunner's concurrent performance exhibits a significant decline in case of poor networks conditions #8388

bringyou commented Apr 26, 2024 •

edited

bringyou commented Apr 26, 2024

yschimke commented Apr 26, 2024 •

edited

yschimke commented Apr 26, 2024 •

edited

swankjesse commented Apr 27, 2024

bringyou commented Apr 27, 2024 •

edited

bringyou commented Apr 27, 2024 •

edited by yschimke

yschimke commented Apr 27, 2024

yschimke commented Apr 27, 2024 •

edited

swankjesse commented Apr 27, 2024

yschimke commented Apr 27, 2024

bringyou commented Apr 27, 2024

swankjesse commented Apr 28, 2024

Bennett-Lynch commented May 21, 2024 •

edited

Bennett-Lynch commented May 22, 2024 •

edited

swankjesse commented May 22, 2024

bringyou commented May 22, 2024 •

edited

Bennett-Lynch commented May 22, 2024 •

edited

swankjesse commented May 22, 2024

Bennett-Lynch commented May 22, 2024 •

edited

TaskRunner's concurrent performance exhibits a significant decline in case of poor networks conditions #8388

TaskRunner's concurrent performance exhibits a significant decline in case of poor networks conditions #8388

Comments

bringyou commented Apr 26, 2024 • edited

bringyou commented Apr 26, 2024

yschimke commented Apr 26, 2024 • edited

yschimke commented Apr 26, 2024 • edited

swankjesse commented Apr 27, 2024

bringyou commented Apr 27, 2024 • edited

bringyou commented Apr 27, 2024 • edited by yschimke

yschimke commented Apr 27, 2024

yschimke commented Apr 27, 2024 • edited

swankjesse commented Apr 27, 2024

yschimke commented Apr 27, 2024

bringyou commented Apr 27, 2024

swankjesse commented Apr 28, 2024

Bennett-Lynch commented May 21, 2024 • edited

Bennett-Lynch commented May 22, 2024 • edited

swankjesse commented May 22, 2024

bringyou commented May 22, 2024 • edited

Bennett-Lynch commented May 22, 2024 • edited

swankjesse commented May 22, 2024

Bennett-Lynch commented May 22, 2024 • edited

bringyou commented Apr 26, 2024 •

edited

yschimke commented Apr 26, 2024 •

edited

yschimke commented Apr 26, 2024 •

edited

bringyou commented Apr 27, 2024 •

edited

bringyou commented Apr 27, 2024 •

edited by yschimke

yschimke commented Apr 27, 2024 •

edited

Bennett-Lynch commented May 21, 2024 •

edited

Bennett-Lynch commented May 22, 2024 •

edited

bringyou commented May 22, 2024 •

edited

Bennett-Lynch commented May 22, 2024 •

edited

Bennett-Lynch commented May 22, 2024 •

edited