Memory leak issue with Benthos memory cache #2404
Labels
bug
caches
Any tasks or issues relating to cache resources
needs investigation
It looks as though have all the information needed but investigation is required
Benthos Version: 4.24.0 (As pulled from the jeffail/benthos docker repo)
We have had an issue recently where our high volume Benthos dedupe pods (in the realm of ~2-3k messages per second) will slowly use more and more memory until eventually crashing. This does escalate slowly, for example below is this graph of memory usage over a 12 hour period for a pair of pods that dedupe ~1.5-2k messages per second, and as such are making 1.5k-2k dedupe checks against the cache each.
These pods have very little running on them beside a dedupe against the kafka key, I have given an example of our benthos configs below, removing specifics about our input/output and two (simple) bloblang processors. We have several very similar pipelines with slightly different bloblang processors, but all use this style of cache and all suffer the slow growth of memory usage until an OOM occurs.
In trying to diagnose why this was happening we spun the pods up in debug mode and ran pprof against the debug endpoints to track memory usage over time. The culprit was the memory cache, as I've shown below. It sits at 147.75MB in this screenshot but on leaving overnight went up by well over a hundred megabytes in this scenario despite the TTL on the cache being ~15 minutes (unfortunately I did not screenshot the later high memory usage. I can spin the pods back up in debug if this would help)
We tried messing with the memory cache resources (i.e. explicitly setting a compaction interval) but we did not see any change in the growth of the cache size. I also looked at one of our other benthos pods which performs a much larger set of operations and used a small memory cache (at around 40/s message throughput) and after tracking for a day, noticed that it grew too - although at a much, much slower rate.
In this way, I believe (unless I've missed something stupid, please correct me if so) that there is some form of memory leak in the Benthos memory cache which is directly proportional to the throughput utilising the cache.
The text was updated successfully, but these errors were encountered: