Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in akka.actor.LocalActorRef #5431

Open
Tracked by #5442
joni-jones opened this issue Aug 3, 2023 · 14 comments
Open
Tracked by #5442

Memory leak in akka.actor.LocalActorRef #5431

joni-jones opened this issue Aug 3, 2023 · 14 comments

Comments

@joni-jones
Copy link
Contributor

joni-jones commented Aug 3, 2023

Summary

I'm working on upgrading OpenWhisk to Akka 2.6.20 and Scala 2.13 and experienced the issue with OpenWhisk invokers consuming all available g1-old heap size after running for a couple of days with active traffic.

Doing heap profiling I got the following suggestions from Heap Hero:

One instance of akka.actor.LocalActorRef loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018
occupies 20,136,784 (18.14%) bytes.
The memory is accumulated in one instance of scala.collection.immutable.RedBlackTree$Tree,
loaded by jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x3c05c5018, which occupies 20,132,728 (18.14%) bytes.

Further analysis with Eclipse Memory Analyzer shows the following:
Screenshot 2023-08-03 at 11 26 51 AM
Screenshot 2023-08-03 at 11 29 54 AM

01

Environment details:

  • Scala 2.13
  • Akka 2.6.20
  • Akka HTTP 10.2.10
  • Akka Management 1.1.4

Any suggestions on where I should look to find the root cause of this memory leak?

@dgrove-oss
Copy link
Member

dgrove-oss commented Aug 3, 2023

I believe akka 2.6.20 is the first release under the non-open source BSL license, not Apache v2. Therefore changes to update OpenWhisk to akka 2.6.20 cannot be accepted by the Apache OpenWhisk project.

@bdoyle0182
Copy link
Contributor

bdoyle0182 commented Aug 3, 2023

@dgrove-oss 2.6.20 is still Apache. >2.7.x is BSL. They actually released another patch a couple months ago 2.6.21 to fix a TLS bug.

Apache Pekko has started doing official releases over the last month. Once we get on to 2.6.20 we can start discussing migrating the project to Pekko. So far the core modules, http, and kafka have been released. They’re about to do management and then the rest of the connectors. I think there should be releases for everything by September at the pace they’re going.

For the topic of this memory leak, more information is needed. Is the memory leak only with 2.6.20? Can you reproduce off master? Are you using the new scheduler which uses the v2 FPCInvoker or the original invokers?

@dgrove-oss
Copy link
Member

dgrove-oss commented Aug 3, 2023

Cool, @bdoyle0182 thanks for clarifying. I had found an old post that said 2.6.19 was the last Apache version and 2.6.20 and beyond were going to be BSL.

A strategy of getting to the most recent Apache licensed version from Lightbend and then switching to Pekko sounds right to me.

@joni-jones
Copy link
Contributor Author

@bdoyle0182 we are migrating our project from Akka 2.5.26, on this version there is no memory leak. As our project has some slight modifications to the OpenWhisk, I'm not able to use the OpenWhisk master branch to run the same load and collect heap dumps. We use original invokers.

@pjfanning
Copy link

Apache Pekko, a fork of Akka 2.6 has been released. v1.0.1 is out - very similar to Akka 2.6.21.

https://pekko.apache.org/docs/pekko/current/project/migration-guides.html

@He-Pin
Copy link
Member

He-Pin commented Aug 5, 2023

@joni-jones Is there any chance you provide a self contained reproducer?

@pjfanning
Copy link

If you want to raise a Pekko issue about this, someone may be able to help.

https://github.com/apache/incubator-pekko

@jrudolph
Copy link

jrudolph commented Aug 7, 2023

Since the strings are all IP addresses and it is below the stream materializer, this could be incoming connections that are hanging / not cleaned up (without knowing anything about openwhisk). Hard to say without knowing anything about the setup.

@joni-jones
Copy link
Contributor Author

joni-jones commented Aug 7, 2023

@jrudolph I'm looking at these graphs and strings with IPs having 0% in comparison to RedBlackTree allocation. But I'm still looking if it could be an issue.

I see that these RedBlackTree have flow-*-0-ignoreSink as a value.

@jrudolph
Copy link

What you are probably looking at is the child actors of the materializer actor where one actor is spawned for every stream you run. So, it might be a bit hard to see what the actual issue is because the memory might be spread over all these actors. One way to go about it would be to see the output of a class histogram just over the elements referenced by that children tree and see what kind of data is in there.

@joni-jones
Copy link
Contributor Author

Thanks @jrudolph. Yes, I tried to go down through these trees and leaves are pointing to child actors and ignore-sink.

Screenshot 2023-08-14 at 3 51 48 PM

I don't know if it's related, but some time ago when OpenWhisk was upgraded from 2.5.x Akka to 2.6.12 and the actor materialized has been removed there was a materializer.shutdown() https://github.com/apache/openwhisk/pull/5065/files#diff-e0bd51cbcd58c3894e1ffa4894de22ddfd47ae87352912de0e30cd60db315758L131-R130. I don't know all the internals of Materializer, but if such method was used to destroy all related actors is it possible that after it being removed on connection.shutdown some actors might hang up?

The version that we are upgrading from still uses 2.5.x Akka and we don't have issues with memory there.

@joni-jones
Copy link
Contributor Author

It seems the issue in https://github.com/apache/openwhisk/blob/master/common/scala/src/main/scala/org/apache/openwhisk/http/PoolingRestClient.scala#L76, without materializer.shutdown() removed by Akka upgrade to 2.6.12 it leaks memory. Also, OverflowStrategy.dropNew has been deprecated in 2.6.11, and underneath the queue for the same behavior has been changed from SourceQueueWithComplete to BoundedSourceQueueStage which looks like without proper clean up of materialized resources doesn't free up the memory.

In our implementation, we use a wrapper on top of PoolingRestClient for HTTP communication between invokers and actions pods instead of OW ApacheBlockingContainerClient.

I did a couple of different implementations, including:

  1. Use OverflowStrategy.dropHead to continue using SourceQueueWithComplete instead of the new BoundedSourceQueueStage with extra logic on shutdown and no memory leaks were observed.
  2. Continuing using OverflowStrategy.dropNew with no changes for shutdown seems to be leaking memory.
  3. Use of the queue with BoundedSourceQueueStage but with proper clean up on shutdown by using KillSwitch and queue.complete seems to be working fine as well with no memory issues.

@He-Pin
Copy link
Member

He-Pin commented Aug 22, 2023

@joni-jones Thanks for sharing the update.

@joni-jones
Copy link
Contributor Author

It looks like I was able to fix the memory leak and it was stable on our production so far.
I will be working on the PR shortly, as I believe it happens due to improper resource cleanup in PoolingRestClient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants