fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

sivukhin · 2024-04-19T23:03:29Z

Context

New weekend - new PR to the QuestDB!

I read the doc and noticed nice support for window aggregation functions. I started to experimenting with the queries in the live demo (it's really GREAT invention of QuestDB; I don't think I will ever start to touch the code without this live demo console!) and sample from the doc with both bounds specified started to get weird results:

SELECT symbol, price, timestamp,
       sum(price) OVER (
        PARTITION BY symbol
        ORDER BY timestamp
        RANGE BETWEEN 2 second PRECEDING AND 1 second PRECEDING)
as moving_avg
FROM trades

This query doesn't pass basic sanity check which I performed by eye - some values in the moving_agg column were negative (but all prices are positive for sure!). So, there is definitely something wrong with aggregation functions.

Changes

I started with unit test implementation because definitely current SQL-based tests are too heavy for testing many edges cases + implement some basic fuzzing functionality. So, you can find WindowFunctionUnitTest.java with 2 fuzz tests for window functions with and without partitioning + bunch of basic unit tests from which I started to investigate the bug
From the bug fixes perspective there are following changes:
- All window aggregation implementations (partitioned and not partitioned) with specified right bound suffered from the bug where values were subtracted from the frame even when they doesn't belong to it (frameSize > 0 check were added)
- UNBOUNDED partitioned window aggregation had incorrect initialization of the buffer parameters. As code written in such a way that for unbounded case only suffix outside of the frame stored in the buffer - QuestDB need to be careful and differentiate between cases when rangeHi = 0 and rangeHi != 0
- In case of the buffer resize firstIdx can be updated to zero but it can be suddenly overwritten with some stale value of newFirstIdx
- Buffer resize for non partitioned queries didn't worked well because of incorrect new capacity value
- Out of buffer memory access were in the FirstValueOverRowsFrameFunction impl

Questions

It seems like QuestDB almost don't use any "containers" for operating with several fields of same data structures (e.g. sum, frameSize, startOffset, ...). Instead, QuestDB code organized in such a way that all these fields "embedded" in the root class which performs actual computation. This approach leads to the following main issues (from my point of view):
- It's hard to extract functions for some steps of the algorithms
- Code mutate internal state a lot
After investigation of this bug I started to feel that this is not only premature optimization, but also it really hurts during development as these pretty simple bugs can be avoided with another structure of implementation.
So, my question is - do QuestDB really need to enforce such style of coding or there are other options (which maybe rely on some clever Java compiler optimization techniques) which will allow to structure code more nicely but also have zero overhead at the runtime?

bluestreak01 · 2024-04-21T23:31:09Z

thank you for the contribution once again, good find!

Fuzz test is very useful, we need more of those. It looks to me though that fuzz test just might be made to use SQL. What do you think? Outside of SQL tests are fragile. For example, you can generate random number of data partitions. If you know partition boundaries, you can run vanilla sum() on these boundaries and compare result with window function. Hope this makes sense?

Regarding fuzz tests you have:

we try to emphasise test execution time and reduce iteration count
to make test fuzzy each run on of the test uses different random seeds
we catch errors over time (fuzz tests can run on CI constantly throughout the day) but we catch wider range of errors

sivukhin · 2024-04-22T06:54:22Z

Hi @bluestreak01

I'm not sure if fuzz tests in this specific scenario should be on the SQL level. I see following benefits from current "low-level" implementation of fuzz-tests

Test have full and explicit control of all important parameters for window aggregation function. This fact make it easy to fuzz test aggregation parameters too + it makes harder to forget about some important parameter now or in future (so, when new parameter for some tricky optimization will be added to the constructor - developer will need to add it in the fuzz tests too and likely will think about best set of values for fuzzing).
- This helped me to identify bug with capacity for range aggregations. Default capacity is pretty large and it makes fuzzing infeasible (because it runs in linear time) - but when I lowered capacity in the constructor fuzzing quickly identified the bug
- Sure, we can do the same with SQL based tests - but it require more heavy work on the test side and it's actually pretty easy to forget about some tuning of important parameter
Tests run faster when they written in low-level abstractions. Faster tests - more cases for fuzzing.
Tests are easier to debug. In total I think I spend hours debugging different edge cases found by fuzzing in this PR and the fact that all steps (computeNext / getDouble) were explicit in the tests helped me to speed up this process.

But, I agreed that current tests are pretty fragile (at least f.getDouble(null) really bothers me). But maybe instead of making tests SQL-centric we can extract more isolated component for window aggregations and make it more stable in terms of exposed API? At least this component can have very few methods (update(record), get()) without whole bunch of methods from Function interface.

Let me know your opinion about this suggestion and points above, @bluestreak01.

Regarding the general fuzz-tests structure I agreed with you. I guess it's better for me to look at some good example of existing fuzz tests and make mine similar to them. Can you point me to some good references in the codebase?

(a bit of off-topic): also, I kind of like how fuzzing implemented in Go stdlib. On the high level it has 2 modes: test mode & fuzz mode.

In test mode fuzzing just take all recorded problems (which stored in the directory near the test definition) and just runs test on them without any brute force or randomization and fails if any of them fails.
In fuzz mode test runs for specified duration of time on randomized seeds provided by go stdlib internal machinery. When new edge case is found - it recorded in the directory and test failed.

Not sure if same DX can be achieved in Java (in Go single test can dynamically generate sub-tests, e.g. based on the directory content, which help in the debugging), but I think that 2 modes for fuzz tests can be useful (not sure, maybe QuestDB already have this).

sivukhin · 2024-04-24T09:40:00Z

@bluestreak01, what do you think?

sivukhin · 2024-05-04T17:00:47Z

BTW, we can extract fuzz test out from this branch and merge patch of window functions. And continue work/discussion on fuzz tests in separate PR.

@bluestreak01, @nwoolmer - thoughts?

bluestreak01 · 2024-05-04T18:15:25Z

hi Nikita, can you keep the fuzz test but start it from random seed TestUtils.generateRandom() and reduce the number of iterations to get test run much faster. We're going to lean towards the number of times CI runs your test (which will be a lot) to go thru the variants.

The fix itself is very good and useful!

sivukhin · 2024-05-04T18:58:27Z

@bluestreak01, yes, sure. Done. Now all tests in WindowFunctionUnitTest passing within ~7 seconds (so it's approximately ~1.5sec per fuzz test).

Is it enough or should I reduce it even further?

…on impl

- total run time of WindowFunctionUnitTest now ~7sec (on my local laptop)

core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionUnitTest.java

bluestreak01 · 2024-05-07T12:13:33Z

tests can be optimised further, to make execution time neglegable

Could you also format the modified files using IntelliJ formatter?

sivukhin · 2024-05-13T12:55:17Z

@bluestreak01, what blocks us from merging the branch 😛 ?

bluestreak01 · 2024-05-13T13:12:17Z

PRs from forks run limited CI unfortunately. We're still dealing with the aftermath of merging previous PRs. Perhaps for the next release. We need to figure out a way to run full CI.

sivukhin · 2024-05-13T13:45:40Z

Oh, ok, got it( That's fine.

sivukhin changed the title ~~fix(sql): fix several bugs in aggregation (avg/sum) window functions~~ fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions Apr 20, 2024

sivukhin added 4 commits May 7, 2024 02:39

fix(sql): fix several bugs in aggregation (avg/sum) window functions

0a80f47

fix(sql): add more tests and fix bug in FirstValueOverRowsFrameFuncti…

651e8be

…on impl

fix: adjust docstring comment for SwarUtils

85152c0

randomize fuzz tests on each run and speed them up

5e15f43

- total run time of WindowFunctionUnitTest now ~7sec (on my local laptop)

sivukhin force-pushed the eval/fix-window-agg-bugs branch from dbd058a to 5e15f43 Compare May 6, 2024 22:39

bluestreak01 reviewed May 7, 2024

View reviewed changes

core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionUnitTest.java Outdated Show resolved Hide resolved

bluestreak01 reviewed May 7, 2024

View reviewed changes

core/src/test/java/io/questdb/test/griffin/engine/window/WindowFunctionUnitTest.java Outdated Show resolved Hide resolved

nwoolmer added SQL Issues or changes relating to SQL execution ready for review labels May 9, 2024

sivukhin and others added 5 commits May 11, 2024 01:39

simplify fuzz test a bit

cdeedcc

reformat

ff163cc

Merge branch 'master' into eval/fix-window-agg-bugs

652accf

Merge branch 'master' into eval/fix-window-agg-bugs

9e43e5c

Merge branch 'master' into eval/fix-window-agg-bugs

d7f34af

bluestreak01 approved these changes May 17, 2024

View reviewed changes

bluestreak01 merged commit d13c4e1 into questdb:master May 17, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

sivukhin commented Apr 19, 2024 •

edited

bluestreak01 commented Apr 21, 2024

sivukhin commented Apr 22, 2024 •

edited

sivukhin commented Apr 24, 2024

sivukhin commented May 4, 2024

bluestreak01 commented May 4, 2024

sivukhin commented May 4, 2024 •

edited

bluestreak01 commented May 7, 2024

sivukhin commented May 13, 2024

bluestreak01 commented May 13, 2024

sivukhin commented May 13, 2024

fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

Conversation

sivukhin commented Apr 19, 2024 • edited

Context

Changes

Questions

bluestreak01 commented Apr 21, 2024

sivukhin commented Apr 22, 2024 • edited

sivukhin commented Apr 24, 2024

sivukhin commented May 4, 2024

bluestreak01 commented May 4, 2024

sivukhin commented May 4, 2024 • edited

bluestreak01 commented May 7, 2024

sivukhin commented May 13, 2024

bluestreak01 commented May 13, 2024

sivukhin commented May 13, 2024

sivukhin commented Apr 19, 2024 •

edited

sivukhin commented Apr 22, 2024 •

edited

sivukhin commented May 4, 2024 •

edited