Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Sporadic asof join test failure #40675

Closed
pitrou opened this issue Mar 19, 2024 · 5 comments · Fixed by #41614
Closed

[C++][Python] Sporadic asof join test failure #40675

pitrou opened this issue Mar 19, 2024 · 5 comments · Fixed by #41614

Comments

@pitrou
Copy link
Member

pitrou commented Mar 19, 2024

Describe the bug, including details regarding any error messages, version, and platform.

I sporadically get this failure when running the PyArrow tests locally:

__________________________________________________________________________ test_table_join_asof ___________________________________________________________________________
Traceback (most recent call last):
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/runner.py", line 340, in from_call
    result: Optional[TResult] = func()
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/runner.py", line 240, in <lambda>
    lambda: runtest_hook(item=item, **kwds), when=when, reraise=reraise
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 181, in _multicall
    return outcome.get_result()
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_result.py", line 99, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 166, in _multicall
    teardown.throw(outcome._exception)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/threadexception.py", line 87, in pytest_runtest_call
    yield from thread_exception_runtest_hook()
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/threadexception.py", line 63, in thread_exception_runtest_hook
    yield
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 166, in _multicall
    teardown.throw(outcome._exception)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 90, in pytest_runtest_call
    yield from unraisable_exception_runtest_hook()
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/unraisableexception.py", line 65, in unraisable_exception_runtest_hook
    yield
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 166, in _multicall
    teardown.throw(outcome._exception)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/logging.py", line 849, in pytest_runtest_call
    yield from self._runtest_for(item, "call")
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/logging.py", line 832, in _runtest_for
    yield
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 166, in _multicall
    teardown.throw(outcome._exception)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/capture.py", line 883, in pytest_runtest_call
    return (yield)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 166, in _multicall
    teardown.throw(outcome._exception)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/skipping.py", line 256, in pytest_runtest_call
    return (yield)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/runner.py", line 182, in pytest_runtest_call
    raise e
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/runner.py", line 172, in pytest_runtest_call
    item.runtest()
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/python.py", line 1772, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_hooks.py", line 501, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_manager.py", line 119, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 138, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/pluggy/_callers.py", line 102, in _multicall
    res = hook_impl.function(*args)
  File "/home/antoine/mambaforge/envs/pyarrow/lib/python3.10/site-packages/_pytest/python.py", line 195, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/home/antoine/arrow/dev/python/pyarrow/tests/test_table.py", line 2800, in test_table_join_asof
    assert r.combine_chunks() == pa.table({
AssertionError: assert pyarrow.Table\ncolA: int64\ncol2: string\ncolC: double\n----\ncolA: [[1,1,5,6,7]]\ncol2: [["a","b","a","b","f"]]\ncolC: [[null,null,null,null,null]] == pyarrow.Table\ncolA: int64\ncol2: string\ncolC: double\n----\ncolA: [[1,1,5,6,7]]\ncol2: [["a","b","a","b","f"]]\ncolC: [[1,null,null,null,null]]
  
  Full diff:
    pyarrow.Table
    colA: int64
    col2: string
    colC: double
    ----
    colA: [[1,1,5,6,7]]
    col2: [["a","b","a","b","f"]]
  - colC: [[1,null,null,null,null]]
  ?         ^
  + colC: [[null,null,null,null,null]]
  ?         ^^^^
========================================================================= short test summary info =========================================================================
FAILED pyarrow/tests/test_table.py::test_table_join_asof - assert pyarrow.Table\ncolA: int64\ncol2: string\ncolC: double\n----\ncolA: [[1,1,5,6,7]]\ncol2: [["a","b","a","b","f"]]\ncolC: [[null,null,null,null,null]] == pyarrow...

Component(s)

C++, Python

@pitrou
Copy link
Member Author

pitrou commented Mar 19, 2024

cc @JerAguilon

@pitrou pitrou changed the title [Python] Sporadic asof join test failure [C++][Python] Sporadic asof join test failure Mar 19, 2024
@jorisvandenbossche
Copy link
Member

This also happened in one nightly build in the last run: https://github.com/ursacomputing/crossbow/actions/runs/8351957186/job/22861244165

@zanmato1984
Copy link
Collaborator

Repro in C++, at the chance of about one failure of tens.

TEST(AsofJoinTest, Flaky) {
  std::vector<TypeHolder> left_types = {int64(), utf8()};
  auto left_batch = ExecBatchFromJSON(
      left_types, R"([[1, "a"], [1, "b"], [5, "a"], [6, "b"], [7, "f"]])");
  std::vector<TypeHolder> right_types = {int64(), utf8(), float64()};
  auto right_batch =
      ExecBatchFromJSON(right_types, R"([[2, "a", 1.0], [9, "b", 3.0], [15, "g", 5.0]])");

  Declaration left{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("colA", int64()), field("col2", utf8())}),
                                 {std::move(left_batch)})};
  Declaration right{
      "exec_batch_source",
      ExecBatchSourceNodeOptions(schema({field("colB", int64()), field("col3", utf8()),
                                         field("colC", float64())}),
                                 {std::move(right_batch)})};
  AsofJoinNodeOptions asof_join_opts({{{"colA"}, {{"col2"}}}, {{"colB"}, {{"col3"}}}}, 1);
  Declaration asof_join{"asofjoin", {left, right}, asof_join_opts};

  ASSERT_OK_AND_ASSIGN(auto result, DeclarationToExecBatches(asof_join));

  std::vector<TypeHolder> exp_types = {int64(), utf8(), float64()};
  auto exp_batch = ExecBatchFromJSON(
      exp_types,
      R"([[1, "a", 1.0], [1, "b", null], [5, "a", null], [6, "b", null], [7, "f", null]])");
  AssertExecBatchesEqualIgnoringOrder(result.schema, {exp_batch}, result.batches);
}

@zanmato1984
Copy link
Collaborator

This is causing wrong result so I'm adding critical label.

zanmato1984 added a commit to zanmato1984/arrow that referenced this issue May 10, 2024
@zanmato1984 zanmato1984 self-assigned this May 10, 2024
@zanmato1984
Copy link
Collaborator

Duplicate of #41149 and will be fixed by #41614.

pitrou pushed a commit that referenced this issue May 14, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in #40675 and #41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In #40675 and #41149 , incorrect results are produced.
* GitHub Issue: #41149 
* Also closes #40675

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
vibhatha pushed a commit to vibhatha/arrow that referenced this issue May 25, 2024
### Rationale for this change

Sporadic asof join test failures have been frequently and annoyingly observed in pyarrow CI, as recorded in apache#40675 and apache#41149.

Turns out the root causes are the same - a logical race (as opposed to physical race which can be detected by sanitizers). By injecting special delay in various places in asof join, as shown in zanmato1984@ea3b24c, the issue can be reproduced almost 100%. And I have put some descriptions in that commit to explain how the race happens.

### What changes are included in this PR?

Eliminate the logical race of emptiness by combining multiple call-sites of `Empty()`.

### Are these changes tested?

Include the UT to reproduce the issue.

### Are there any user-facing changes?

None.

**This PR contains a "Critical Fix".**
In apache#40675 and apache#41149 , incorrect results are produced.
* GitHub Issue: apache#41149 
* Also closes apache#40675

Authored-by: Ruoxi Sun <zanmato1984@gmail.com>
Signed-off-by: Antoine Pitrou <antoine@python.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants