Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(core): add RSS memory limit to prevent OOM kills from the OS #4480

Merged
merged 167 commits into from
May 20, 2024

Conversation

mtopolnik
Copy link
Contributor

@mtopolnik mtopolnik commented May 7, 2024

Adds the setting cairo.rss.memory.limit and enforces it in Unsafe.checkAllocLimit(). Throws a non-critical CairoException with a new flag, isOutOfMemory, raised.

Also fixes #4510.

Future work

We need to show the proper error message to the user. It must say which query ran out of memory.

We need to implement a heuristic to set the default RSS limit automatically.

Discussion

The majority of the PR deals with the new failure point this introduces, and making the code robust to it.

There are two main categories of problems: initializing objects, and reopening objects.

1. Initializating objects

Object initialization (constructor) is a critical moment in its lifecycle: many constructors create several Closeable child objects, but the constructor isn't typically wrapped in a try-catch that closes any objects already constructed. If the constructor throws an exception, these child objects become unreachable, and leak all the resources they allocated.

Another case is higher up the call stack, complex business logic that prepares the object graph to serve a query. Prime example is SqlCodeGenerator. This code creates some objects, keeping them in local variables, and then eventually passes them to a RecordCursorFactory. It is highly complex and branchy, and it's difficult to keep track of each individual closeable object along many different code paths.

2. Reopening objects

Reopenable objects are typically stored in object pools, and reused through their of() method. The close() method in many of these objects has an if guard that does cleanup only if the object is open. Most typically it's if (isOpen) { cleanup }. The critical problem arises in reopen(), where this flag may be raised not as the first thing before any native allocation, but in the end -- which makes intuitive sense, but introduces a leak. An OOM exception thrown from one of the initialization steps will trigger calling close(), but it will be a no-op because isOpen is still false.

This is a very subtle problem and it's likely to keep getting re-introduced to the codebase.

Testing

In order to find failure points in the codebase, I made some temporary changes in assertMemoryLeak():

  1. Add Unsafe.setRssMemLimit(8_750_000) just above code.run(), then reset it to zero right after.
  2. Catch and ignore any exception the test body throws. Unless there are memory leaks, the test should pass.
  3. Perform all the leak checks, and report any errors as test failure.
  4. Add diagnostic code to Unsafe.checkAllocLimit() such as printing the stack trace of the OOM exception. Exceptions sometimes get translated and the original stack trace is lost.
  5. As a further refinement, introduce an allocation operation counter, and throw an OOM when a specific count is reached. Allows more precise targetting of an individual malloc() call.

These are the commits that make the temporary changes:

7e5d4a4
4d88403
63edda9
a3b453d

These commits are reverted later on in the PR.

We should use a similar approach to create an automated test run that will keep monitoring our codebase for leaks.

@mtopolnik mtopolnik marked this pull request as ready for review May 7, 2024 12:02
@mtopolnik mtopolnik requested a review from ideoma May 7, 2024 12:03
@mtopolnik mtopolnik marked this pull request as draft May 8, 2024 10:59
bluestreak01
bluestreak01 previously approved these changes May 20, 2024
@ideoma
Copy link
Collaborator

ideoma commented May 20, 2024

[PR Coverage check]

😍 pass : 2181 / 2539 (85.90%)

file detail

path covered line new line coverage
🔵 io/questdb/mp/AlertedException.java 0 1 00.00%
🔵 io/questdb/network/IODispatcherOsx.java 0 7 00.00%
🔵 io/questdb/network/Kqueue.java 0 12 00.00%
🔵 io/questdb/cairo/CairoConfigurationWrapper.java 0 1 00.00%
🔵 io/questdb/mp/TimeoutException.java 0 1 00.00%
🔵 io/questdb/std/ex/ZLibException.java 0 1 00.00%
🔵 io/questdb/network/IOContextFactoryImpl.java 3 12 25.00%
🔵 io/questdb/std/MemoryPages.java 2 5 40.00%
🔵 io/questdb/griffin/engine/union/UnionAllRecordCursorFactory.java 2 5 40.00%
🔵 io/questdb/griffin/engine/window/WindowRecordCursorFactory.java 2 5 40.00%
🔵 io/questdb/griffin/engine/LimitRecordCursorFactory.java 2 5 40.00%
🔵 io/questdb/std/Unsafe.java 20 42 47.62%
🔵 io/questdb/griffin/engine/union/UnionRecordCursorFactory.java 3 6 50.00%
🔵 io/questdb/cairo/O3PartitionJob.java 7 14 50.00%
🔵 io/questdb/ServerMain.java 1 2 50.00%
🔵 io/questdb/griffin/engine/table/SelectedRecordCursorFactory.java 3 6 50.00%
🔵 io/questdb/griffin/engine/orderby/SortedRecordCursorFactory.java 3 6 50.00%
🔵 io/questdb/griffin/engine/groupby/CountRecordCursorFactory.java 3 6 50.00%
🔵 io/questdb/griffin/engine/table/LatestByValuesIndexedFilteredRecordCursorFactory.java 4 7 57.14%
🔵 io/questdb/griffin/engine/table/LatestByAllSymbolsFilteredRecordCursorFactory.java 4 7 57.14%
🔵 io/questdb/griffin/engine/union/ExceptRecordCursorFactory.java 8 13 61.54%
🔵 io/questdb/griffin/engine/union/IntersectRecordCursorFactory.java 8 13 61.54%
🔵 io/questdb/cairo/VacuumColumnVersions.java 5 8 62.50%
🔵 io/questdb/griffin/engine/union/ExceptAllRecordCursorFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/table/LatestByAllFilteredRecordCursorFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/functions/catalogue/PgClassFunctionFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/union/IntersectAllRecordCursorFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/orderby/RecordTreeChain.java 5 8 62.50%
🔵 io/questdb/griffin/engine/table/LatestByAllIndexedRecordCursorFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/functions/catalogue/PgAttrDefFunctionFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/groupby/AbstractSampleByFillRecordCursorFactory.java 5 8 62.50%
🔵 io/questdb/griffin/engine/groupby/SampleByFillNoneRecordCursorFactory.java 6 9 66.67%
🔵 io/questdb/griffin/engine/table/LatestByRecordCursorFactory.java 6 9 66.67%
🔵 io/questdb/griffin/engine/groupby/GroupByRecordCursorFactory.java 6 9 66.67%
🔵 io/questdb/griffin/engine/functions/window/AvgDoubleWindowFunctionFactory.java 24 35 68.57%
🔵 io/questdb/cutlass/text/SerialCsvFileImporter.java 7 10 70.00%
🔵 io/questdb/griffin/engine/join/HashOuterJoinLightRecordCursorFactory.java 14 20 70.00%
🔵 io/questdb/cairo/sql/async/PageFrameReduceTask.java 7 10 70.00%
🔵 io/questdb/cutlass/text/TextDelimiterScanner.java 7 10 70.00%
🔵 io/questdb/griffin/engine/groupby/DistinctTimeSeriesRecordCursorFactory.java 15 21 71.43%
🔵 io/questdb/std/DirectIntList.java 8 11 72.73%
🔵 io/questdb/log/LogRollingFileWriter.java 8 11 72.73%
🔵 io/questdb/griffin/engine/groupby/DistinctRecordCursorFactory.java 8 11 72.73%
🔵 io/questdb/cutlass/line/tcp/LineTcpNetworkIOJob.java 8 11 72.73%
🔵 io/questdb/std/DirectLongList.java 8 11 72.73%
🔵 io/questdb/griffin/engine/join/HashJoinRecordCursorFactory.java 14 19 73.68%
🔵 io/questdb/griffin/engine/join/HashJoinLightRecordCursorFactory.java 17 23 73.91%
🔵 io/questdb/cairo/TableReaderMetadata.java 9 12 75.00%
🔵 io/questdb/griffin/engine/groupby/vect/GroupByNotKeyedVectorRecordCursorFactory.java 9 12 75.00%
🔵 io/questdb/griffin/engine/join/HashOuterJoinRecordCursorFactory.java 15 20 75.00%
🔵 io/questdb/cutlass/line/tcp/LineTcpReceiver.java 10 13 76.92%
🔵 io/questdb/griffin/engine/functions/window/SumDoubleWindowFunctionFactory.java 26 34 76.47%
🔵 io/questdb/mp/RingQueue.java 19 25 76.00%
🔵 io/questdb/cairo/O3PartitionPurgeJob.java 11 14 78.57%
🔵 io/questdb/cutlass/text/CopyJob.java 11 14 78.57%
🔵 io/questdb/griffin/engine/join/HashOuterJoinFilteredRecordCursorFactory.java 20 25 80.00%
🔵 io/questdb/cairo/ColumnPurgeOperator.java 24 30 80.00%
🔵 io/questdb/TelemetryJob.java 13 16 81.25%
🔵 io/questdb/griffin/SqlOptimiser.java 9 11 81.82%
🔵 io/questdb/griffin/engine/join/HashOuterJoinFilteredLightRecordCursorFactory.java 13 16 81.25%
🔵 io/questdb/griffin/engine/join/LtJoinLightRecordCursorFactory.java 14 17 82.35%
🔵 io/questdb/griffin/engine/join/AsOfJoinLightRecordCursorFactory.java 14 17 82.35%
🔵 io/questdb/griffin/engine/table/LatestBySubQueryRecordCursorFactory.java 15 18 83.33%
🔵 io/questdb/cairo/RecordChain.java 15 18 83.33%
🔵 io/questdb/griffin/engine/join/SpliceJoinLightRecordCursorFactory.java 15 18 83.33%
🔵 io/questdb/cutlass/line/tcp/LineTcpConnectionContext.java 16 19 84.21%
🔵 io/questdb/griffin/engine/table/AbstractDeferredTreeSetRecordCursorFactory.java 16 19 84.21%
🔵 io/questdb/griffin/engine/join/LtJoinRecordCursorFactory.java 16 19 84.21%
🔵 io/questdb/griffin/engine/join/AsOfJoinRecordCursorFactory.java 17 20 85.00%
🔵 io/questdb/cutlass/text/TextLoader.java 17 20 85.00%
🔵 io/questdb/TelemetryConfigLogger.java 20 23 86.96%
🔵 io/questdb/cairo/wal/seq/TableSequencerImpl.java 65 74 87.84%
🔵 io/questdb/griffin/engine/window/CachedWindowRecordCursorFactory.java 50 57 87.72%
🔵 io/questdb/cutlass/text/CopyRequestJob.java 23 26 88.46%
🔵 io/questdb/cairo/ColumnPurgeJob.java 25 28 89.29%
🔵 io/questdb/griffin/engine/groupby/SampleByInterpolateRecordCursorFactory.java 76 83 91.57%
🔵 io/questdb/griffin/SqlCompilerImpl.java 68 74 91.89%
🔵 io/questdb/griffin/SqlCodeGenerator.java 262 280 93.57%
🔵 io/questdb/cutlass/line/tcp/LineTcpMeasurementScheduler.java 56 59 94.92%
🔵 io/questdb/cutlass/http/processors/JsonQueryProcessor.java 48 51 94.12%
🔵 io/questdb/cutlass/text/CsvFileIndexer.java 55 58 94.83%
🔵 io/questdb/cutlass/text/ParallelCsvFileImporter.java 47 50 94.00%
🔵 io/questdb/cairo/map/OrderedMap.java 70 73 95.89%
🔵 io/questdb/cutlass/pgwire/PGConnectionContext.java 79 83 95.18%
🔵 io/questdb/MessageBusImpl.java 81 84 96.43%
🔵 io/questdb/cairo/map/Unordered2Map.java 41 42 97.62%
🔵 io/questdb/griffin/engine/functions/test/TestSumTDoubleGroupByFunction.java 2 2 100.00%
🔵 io/questdb/std/str/Path.java 1 1 100.00%
🔵 io/questdb/griffin/engine/orderby/SortedRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/map/Unordered4Map.java 52 52 100.00%
🔵 io/questdb/cairo/TableReader.java 5 5 100.00%
🔵 io/questdb/griffin/engine/functions/constants/LongConstant.java 2 2 100.00%
🔵 io/questdb/griffin/engine/join/AsOfJoinNoKeyRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/AvgDoubleVectorAggregateFunction.java 5 5 100.00%
🔵 io/questdb/griffin/engine/join/NestedLoopLeftJoinRecordCursorFactory.java 5 5 100.00%
🔵 io/questdb/PropServerConfiguration.java 2 2 100.00%
🔵 io/questdb/std/Misc.java 2 2 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFillNullNotKeyedRecordCursorFactory.java 3 3 100.00%
🔵 io/questdb/griffin/engine/table/VirtualRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/LatestByAllSymbolsFilteredRecordCursor.java 2 2 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/AvgShortVectorAggregateFunction.java 5 5 100.00%
🔵 io/questdb/griffin/model/QueryColumn.java 1 1 100.00%
🔵 io/questdb/cairo/map/Unordered8Map.java 52 52 100.00%
🔵 io/questdb/griffin/engine/table/LatestByAllRecordCursor.java 2 2 100.00%
🔵 io/questdb/cairo/TableUtils.java 2 2 100.00%
🔵 io/questdb/griffin/engine/join/AsOfJoinNoKeyFastRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/griffin/engine/ops/OperationDispatcher.java 3 3 100.00%
🔵 io/questdb/griffin/engine/orderby/SortedLightRecordCursorFactory.java 2 2 100.00%
🔵 io/questdb/cutlass/http/HttpResponseSink.java 1 1 100.00%
🔵 io/questdb/std/ObjectPool.java 1 1 100.00%
🔵 io/questdb/cairo/wal/seq/TableTransactionLogV2.java 4 4 100.00%
🔵 io/questdb/griffin/engine/orderby/LimitedSizeSortedLightRecordCursorFactory.java 2 2 100.00%
🔵 io/questdb/griffin/engine/orderby/LimitedSizeSortedLightRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/join/CrossJoinRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/cairo/BitmapIndexFwdReader.java 1 1 100.00%
🔵 io/questdb/griffin/engine/orderby/SortedLightRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/union/ExceptRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/DefaultCairoConfiguration.java 1 1 100.00%
🔵 io/questdb/griffin/model/QueryModel.java 25 25 100.00%
🔵 io/questdb/griffin/engine/functions/catalogue/WalTransactionsFunctionFactory.java 3 3 100.00%
🔵 io/questdb/mp/WorkerPoolUtils.java 1 1 100.00%
🔵 io/questdb/griffin/model/ExpressionNode.java 1 1 100.00%
🔵 io/questdb/cutlass/http/processors/TextImportProcessor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/functions/rnd/RndStringMemory.java 21 21 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/AvgLongVectorAggregateFunction.java 5 5 100.00%
🔵 io/questdb/griffin/engine/orderby/LimitedSizePartiallySortedLightRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/join/LtJoinNoKeyFastRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/cairo/PartitionBy.java 3 3 100.00%
🔵 io/questdb/griffin/engine/functions/test/TestSumStringGroupByFunction.java 2 2 100.00%
🔵 io/questdb/std/NumericException.java 1 1 100.00%
🔵 io/questdb/cairo/map/Unordered16Map.java 53 53 100.00%
🔵 io/questdb/cairo/CairoEngine.java 25 25 100.00%
🔵 io/questdb/griffin/engine/join/LtJoinNoKeyRecordCursorFactory.java 4 4 100.00%
🔵 io/questdb/Telemetry.java 5 5 100.00%
🔵 io/questdb/griffin/engine/RegisteredRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/cutlass/text/types/InputFormatConfiguration.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/AbstractTreeSetRecordCursorFactory.java 1 1 100.00%
🔵 io/questdb/griffin/engine/union/IntersectCastRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFirstLastRecordCursorFactory.java 23 23 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/AvgIntVectorAggregateFunction.java 5 5 100.00%
🔵 io/questdb/PropertyKey.java 1 1 100.00%
🔵 io/questdb/log/LogFileWriter.java 2 2 100.00%
🔵 io/questdb/cairo/pool/WriterPool.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFillNoneRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/wal/seq/TableSequencerAPI.java 14 14 100.00%
🔵 io/questdb/cairo/AbstractIndexReader.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/vect/GroupByRecordCursorFactory.java 44 44 100.00%
🔵 io/questdb/griffin/engine/union/IntersectRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/sql/NetworkSqlExecutionCircuitBreaker.java 1 1 100.00%
🔵 io/questdb/griffin/engine/union/ExceptCastRecordCursor.java 1 1 100.00%
🔵 io/questdb/mp/Worker.java 1 1 100.00%
🔵 io/questdb/griffin/engine/union/IntersectAllRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/union/ExceptAllRecordCursor.java 1 1 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFillValueRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/TableWriter.java 6 6 100.00%
🔵 io/questdb/mp/OpenBarrier.java 1 1 100.00%
🔵 io/questdb/griffin/engine/table/LatestByAllFilteredRecordCursor.java 2 2 100.00%
🔵 io/questdb/cairo/map/UnorderedVarcharMap.java 39 39 100.00%
🔵 io/questdb/cutlass/http/processors/JsonQueryProcessorState.java 1 1 100.00%
🔵 io/questdb/cairo/IDGenerator.java 2 2 100.00%
🔵 io/questdb/griffin/engine/groupby/SampleByFillPrevRecordCursor.java 1 1 100.00%
🔵 io/questdb/cairo/CairoException.java 3 3 100.00%

@bluestreak01 bluestreak01 merged commit 65f84ec into master May 20, 2024
24 checks passed
@bluestreak01 bluestreak01 deleted the mt_mem-cap branch May 20, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core Related to storage, data type, etc. Enhancement Enhance existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE in window function with partition by clause
4 participants