Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions #4429

Merged
merged 9 commits into from
May 17, 2024

Conversation

sivukhin
Copy link
Contributor

@sivukhin sivukhin commented Apr 19, 2024

Context

New weekend - new PR to the QuestDB!

I read the doc and noticed nice support for window aggregation functions. I started to experimenting with the queries in the live demo (it's really GREAT invention of QuestDB; I don't think I will ever start to touch the code without this live demo console!) and sample from the doc with both bounds specified started to get weird results:

SELECT symbol, price, timestamp,
       sum(price) OVER (
        PARTITION BY symbol
        ORDER BY timestamp
        RANGE BETWEEN 2 second PRECEDING AND 1 second PRECEDING)
as moving_avg
FROM trades

image

This query doesn't pass basic sanity check which I performed by eye - some values in the moving_agg column were negative (but all prices are positive for sure!). So, there is definitely something wrong with aggregation functions.

Changes

  • I started with unit test implementation because definitely current SQL-based tests are too heavy for testing many edges cases + implement some basic fuzzing functionality. So, you can find WindowFunctionUnitTest.java with 2 fuzz tests for window functions with and without partitioning + bunch of basic unit tests from which I started to investigate the bug
  • From the bug fixes perspective there are following changes:
    • All window aggregation implementations (partitioned and not partitioned) with specified right bound suffered from the bug where values were subtracted from the frame even when they doesn't belong to it (frameSize > 0 check were added)
    • UNBOUNDED partitioned window aggregation had incorrect initialization of the buffer parameters. As code written in such a way that for unbounded case only suffix outside of the frame stored in the buffer - QuestDB need to be careful and differentiate between cases when rangeHi = 0 and rangeHi != 0
    • In case of the buffer resize firstIdx can be updated to zero but it can be suddenly overwritten with some stale value of newFirstIdx
    • Buffer resize for non partitioned queries didn't worked well because of incorrect new capacity value
    • Out of buffer memory access were in the FirstValueOverRowsFrameFunction impl

Questions

  • It seems like QuestDB almost don't use any "containers" for operating with several fields of same data structures (e.g. sum, frameSize, startOffset, ...). Instead, QuestDB code organized in such a way that all these fields "embedded" in the root class which performs actual computation. This approach leads to the following main issues (from my point of view):
    • It's hard to extract functions for some steps of the algorithms
    • Code mutate internal state a lot
  • After investigation of this bug I started to feel that this is not only premature optimization, but also it really hurts during development as these pretty simple bugs can be avoided with another structure of implementation.
  • So, my question is - do QuestDB really need to enforce such style of coding or there are other options (which maybe rely on some clever Java compiler optimization techniques) which will allow to structure code more nicely but also have zero overhead at the runtime?

@sivukhin sivukhin changed the title fix(sql): fix several bugs in aggregation (avg/sum) window functions fix(sql): fix several bugs in aggregation (avg/sum/first_value) window functions Apr 20, 2024
@bluestreak01
Copy link
Member

thank you for the contribution once again, good find!

Fuzz test is very useful, we need more of those. It looks to me though that fuzz test just might be made to use SQL. What do you think? Outside of SQL tests are fragile. For example, you can generate random number of data partitions. If you know partition boundaries, you can run vanilla sum() on these boundaries and compare result with window function. Hope this makes sense?

Regarding fuzz tests you have:

  • we try to emphasise test execution time and reduce iteration count
  • to make test fuzzy each run on of the test uses different random seeds
  • we catch errors over time (fuzz tests can run on CI constantly throughout the day) but we catch wider range of errors

@sivukhin
Copy link
Contributor Author

sivukhin commented Apr 22, 2024

Hi @bluestreak01

I'm not sure if fuzz tests in this specific scenario should be on the SQL level. I see following benefits from current "low-level" implementation of fuzz-tests

  1. Test have full and explicit control of all important parameters for window aggregation function. This fact make it easy to fuzz test aggregation parameters too + it makes harder to forget about some important parameter now or in future (so, when new parameter for some tricky optimization will be added to the constructor - developer will need to add it in the fuzz tests too and likely will think about best set of values for fuzzing).
    • This helped me to identify bug with capacity for range aggregations. Default capacity is pretty large and it makes fuzzing infeasible (because it runs in linear time) - but when I lowered capacity in the constructor fuzzing quickly identified the bug
    • Sure, we can do the same with SQL based tests - but it require more heavy work on the test side and it's actually pretty easy to forget about some tuning of important parameter
  2. Tests run faster when they written in low-level abstractions. Faster tests - more cases for fuzzing.
  3. Tests are easier to debug. In total I think I spend hours debugging different edge cases found by fuzzing in this PR and the fact that all steps (computeNext / getDouble) were explicit in the tests helped me to speed up this process.

But, I agreed that current tests are pretty fragile (at least f.getDouble(null) really bothers me). But maybe instead of making tests SQL-centric we can extract more isolated component for window aggregations and make it more stable in terms of exposed API? At least this component can have very few methods (update(record), get()) without whole bunch of methods from Function interface.

Let me know your opinion about this suggestion and points above, @bluestreak01.


Regarding the general fuzz-tests structure I agreed with you. I guess it's better for me to look at some good example of existing fuzz tests and make mine similar to them. Can you point me to some good references in the codebase?

(a bit of off-topic): also, I kind of like how fuzzing implemented in Go stdlib. On the high level it has 2 modes: test mode & fuzz mode.

  • In test mode fuzzing just take all recorded problems (which stored in the directory near the test definition) and just runs test on them without any brute force or randomization and fails if any of them fails.
  • In fuzz mode test runs for specified duration of time on randomized seeds provided by go stdlib internal machinery. When new edge case is found - it recorded in the directory and test failed.

Not sure if same DX can be achieved in Java (in Go single test can dynamically generate sub-tests, e.g. based on the directory content, which help in the debugging), but I think that 2 modes for fuzz tests can be useful (not sure, maybe QuestDB already have this).

@sivukhin
Copy link
Contributor Author

@bluestreak01, what do you think?

@sivukhin
Copy link
Contributor Author

sivukhin commented May 4, 2024

BTW, we can extract fuzz test out from this branch and merge patch of window functions. And continue work/discussion on fuzz tests in separate PR.

@bluestreak01, @nwoolmer - thoughts?

@bluestreak01
Copy link
Member

hi Nikita, can you keep the fuzz test but start it from random seed TestUtils.generateRandom() and reduce the number of iterations to get test run much faster. We're going to lean towards the number of times CI runs your test (which will be a lot) to go thru the variants.

The fix itself is very good and useful!

@sivukhin
Copy link
Contributor Author

sivukhin commented May 4, 2024

@bluestreak01, yes, sure. Done. Now all tests in WindowFunctionUnitTest passing within ~7 seconds (so it's approximately ~1.5sec per fuzz test).

Is it enough or should I reduce it even further?

@bluestreak01
Copy link
Member

tests can be optimised further, to make execution time neglegable

Could you also format the modified files using IntelliJ formatter?

@nwoolmer nwoolmer added SQL Issues or changes relating to SQL execution ready for review labels May 9, 2024
@sivukhin
Copy link
Contributor Author

@bluestreak01, what blocks us from merging the branch 😛 ?

@bluestreak01
Copy link
Member

PRs from forks run limited CI unfortunately. We're still dealing with the aftermath of merging previous PRs. Perhaps for the next release. We need to figure out a way to run full CI.

@sivukhin
Copy link
Contributor Author

Oh, ok, got it( That's fine.

@bluestreak01 bluestreak01 merged commit d13c4e1 into questdb:master May 17, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for review SQL Issues or changes relating to SQL execution
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants