GH-41834: [R] Better error handling in dplyr code #41576

nealrichardson · 2024-05-07T15:35:27Z

I started out trying to make it so that arrow_eval() could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:

We have two wrappers that raise classed errors: arrow_not_supported() (which previously existed but just called stop()) and validation_error(). They raise arrow_not_supported and validation_error, respectively. Function bindings now raise one or the other, never just stop/abort.
arrow_eval() modifies the errors raised by function bindings, inserting the expression as the call attribute of the error, which lets rlang handle the printing cleaner, and catching any non-classed errors and re-raising them as arrow_not_supported or validation_error, as appropriate.
New try_arrow_dplyr() wrapper around everything inside (most*) dplyr verb implementations, which only calls abandon_ship() on arrow_not_supported errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can call collect(). So errors generally bubble up, and each of these wrappers adds some context to the message.

The ultimate results of all of this:

We now don't tell people to collect() (or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.
Nicer error printing across the board, using rlang/cli for formatting, and cleaner calls and tracebacks. No more Error: Error : messages.
Adds the ability to provide helpful suggestions in error messages in bindings, for cases where there is an alternative available other than just collect(). In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".
For us, it should be easier to work with arrow_eval() and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).

Some concrete examples:

Invalid input in a binding. Retry with dplyr won't help, so don't automatically do it (if Table) or suggest it (if Dataset).

# Before: 
mtcars |> 
  arrow_table() |> 
  transmute(case_when())
#> Warning: Expression case_when() not supported in Arrow; pulling data into R
#> Error:
#> ℹ In argument: `case_when()`.
#> Caused by error in `case_when()`:
#> ! At least one condition must be supplied.

# After:
mtcars |>
  arrow_table() |>
  transmute(case_when())
#> Error in `case_when()`:
#> ! No cases provided

Dealing with unsupported features outside of the bindings. This example is something that is checked in summarize() but not caught inside arrow_eval() because it's not about the expressions.

# Before:
mtcars |> 
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error: Error : .groups = "rowwise" not supported in Arrow
#> Call collect() first to pull data into R.

# After:
mtcars |>
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error in `summarise.arrow_dplyr_query()`:
#> ! .groups = "rowwise" not supported in Arrow
#> → Call collect() first to pull data into R.

When there are ways to solve the issue other than calling collect(), we give the user options:

# After:
mtcars |>
  InMemoryDataset$create() |> 
  transmute(date = as.Date(mpg, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")))
#> Error in `as.Date()`:
#> ! `as.Date()` with multiple `tryFormats` not supported in Arrow
#> → Consider using the lubridate specialised parsing functions `ymd()`, `ymd()`, etc.
#> → Or, call collect() first to pull data into R.

GitHub Issue: [R] Improve error handling in the dplyr NSE code #41834

github-actions · 2024-05-07T15:35:52Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

…apping; make arrow_eval error

…add docs

github-actions · 2024-05-26T19:59:42Z

⚠️ GitHub issue #41834 has been automatically assigned in GitHub to PR creator.

github-actions bot added Component: R Component: Documentation awaiting review Awaiting review labels May 7, 2024

nealrichardson added 9 commits May 26, 2024 15:33

arrow_not_supported() raises a classed error

5166cf7

Better distinguish invalid from not supported; add try_arrow_dplyr wr…

4e2c735

…apping; make arrow_eval error

More classed error raising

fefbc0c

Use cli formatting so we can add bullets; wrap more in try_arrow_dplyr

be79caa

Rename error functions. Start updating tests

66ff9c0

Handle assert_that and match.arg; implement alternative suggestions; …

670e9e5

…add docs

Add some tests, fix some tests

0bda4e3

More test updating

36bf23a

Add more direct tests of dplyr-eval; update remaining expectations

0cd2ff3

nealrichardson force-pushed the better-errors branch from 178648a to 0cd2ff3 Compare May 26, 2024 19:34

nealrichardson marked this pull request as ready for review May 26, 2024 19:34

nealrichardson requested review from paleolimbot and thisisnic as code owners May 26, 2024 19:34

Add some cases with alternatives suggested

df9d081

nealrichardson mentioned this pull request May 26, 2024

[R] Improve error handling in the dplyr NSE code #41834

Open

nealrichardson changed the title ~~WIP [R] Better error handling in dplyr code~~ GH-41834: [R] Better error handling in dplyr code May 26, 2024

nealrichardson requested a review from jonkeane May 26, 2024 19:59

Tidy up match.call

4df3a7d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-41834: [R] Better error handling in dplyr code #41576

GH-41834: [R] Better error handling in dplyr code #41576

nealrichardson commented May 7, 2024 •

edited by github-actions bot

github-actions bot commented May 7, 2024

github-actions bot commented May 26, 2024

GH-41834: [R] Better error handling in dplyr code #41576

Are you sure you want to change the base?

GH-41834: [R] Better error handling in dplyr code #41576

Conversation

nealrichardson commented May 7, 2024 • edited by github-actions bot

github-actions bot commented May 7, 2024

github-actions bot commented May 26, 2024

nealrichardson commented May 7, 2024 •

edited by github-actions bot