Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41834: [R] Better error handling in dplyr code #41576

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented May 7, 2024

I started out trying to make it so that arrow_eval() could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:

  • We have two wrappers that raise classed errors: arrow_not_supported() (which previously existed but just called stop()) and validation_error(). They raise arrow_not_supported and validation_error, respectively. Function bindings now raise one or the other, never just stop/abort.
  • arrow_eval() modifies the errors raised by function bindings, inserting the expression as the call attribute of the error, which lets rlang handle the printing cleaner, and catching any non-classed errors and re-raising them as arrow_not_supported or validation_error, as appropriate.
  • New try_arrow_dplyr() wrapper around everything inside (most*) dplyr verb implementations, which only calls abandon_ship() on arrow_not_supported errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can call collect(). So errors generally bubble up, and each of these wrappers adds some context to the message.

The ultimate results of all of this:

  • We now don't tell people to collect() (or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.
  • Nicer error printing across the board, using rlang/cli for formatting, and cleaner calls and tracebacks. No more Error: Error : messages.
  • Adds the ability to provide helpful suggestions in error messages in bindings, for cases where there is an alternative available other than just collect(). In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".
  • For us, it should be easier to work with arrow_eval() and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).

Some concrete examples:

  1. Invalid input in a binding. Retry with dplyr won't help, so don't automatically do it (if Table) or suggest it (if Dataset).
# Before: 
mtcars |> 
  arrow_table() |> 
  transmute(case_when())
#> Warning: Expression case_when() not supported in Arrow; pulling data into R
#> Error:
#> ℹ In argument: `case_when()`.
#> Caused by error in `case_when()`:
#> ! At least one condition must be supplied.

# After:
mtcars |>
  arrow_table() |>
  transmute(case_when())
#> Error in `case_when()`:
#> ! No cases provided
  1. Dealing with unsupported features outside of the bindings. This example is something that is checked in summarize() but not caught inside arrow_eval() because it's not about the expressions.
# Before:
mtcars |> 
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error: Error : .groups = "rowwise" not supported in Arrow
#> Call collect() first to pull data into R.

# After:
mtcars |>
  InMemoryDataset$create() |> 
  group_by(cyl) |> 
  summarize(mean(hp), .groups = "rowwise")
#> Error in `summarise.arrow_dplyr_query()`:
#> ! .groups = "rowwise" not supported in Arrow
#> → Call collect() first to pull data into R.
  1. When there are ways to solve the issue other than calling collect(), we give the user options:
# After:
mtcars |>
  InMemoryDataset$create() |> 
  transmute(date = as.Date(mpg, tryFormats = c("%Y-%m-%d", "%Y/%m/%d")))
#> Error in `as.Date()`:
#> ! `as.Date()` with multiple `tryFormats` not supported in Arrow
#> → Consider using the lubridate specialised parsing functions `ymd()`, `ymd()`, etc.
#> → Or, call collect() first to pull data into R.

Copy link

github-actions bot commented May 7, 2024

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

In the case of PARQUET issues on JIRA the title also supports:

PARQUET-${JIRA_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@nealrichardson nealrichardson changed the title WIP [R] Better error handling in dplyr code GH-41834: [R] Better error handling in dplyr code May 26, 2024
Copy link

⚠️ GitHub issue #41834 has been automatically assigned in GitHub to PR creator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant