-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-41834: [R] Better error handling in dplyr code #41576
base: main
Are you sure you want to change the base?
Conversation
Thanks for opening a pull request! If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project. Then could you also rename the pull request title in the following format?
or
In the case of PARQUET issues on JIRA the title also supports:
See also: |
…apping; make arrow_eval error
178648a
to
0cd2ff3
Compare
|
I started out trying to make it so that
arrow_eval()
could just raise its errors, rather than catch them and have every caller inspect and re-raise. I ended up pulling on this further and ended up refactoring most of the error handling in the dplyr code paths. Summary of changes, from the bottom up:arrow_not_supported()
(which previously existed but just calledstop()
) andvalidation_error()
. They raisearrow_not_supported
andvalidation_error
, respectively. Function bindings now raise one or the other, never just stop/abort.arrow_eval()
modifies the errors raised by function bindings, inserting the expression as thecall
attribute of the error, which letsrlang
handle the printing cleaner, and catching any non-classed errors and re-raising them asarrow_not_supported
orvalidation_error
, as appropriate.try_arrow_dplyr()
wrapper around everything inside (most*) dplyr verb implementations, which only callsabandon_ship()
onarrow_not_supported
errors, and lets all other errors just raise. For datasets, it just adds an additional note to the error message advising you that you can callcollect()
. So errors generally bubble up, and each of these wrappers adds some context to the message.The ultimate results of all of this:
collect()
(or, if on in-memory data, just do it) in cases where it would also fail in regular dplyr because the input is invalid.Error: Error :
messages.collect()
. In fact, if there are suggestions with the ">" (arrow) bullet, we don't just add "Call collect()", we say "Or, call collect()".arrow_eval()
and the dplyr verbs in general. There's less bookkeeping you have to do to catch and rethrow errors, and it's consistent across the various parts of the evaluation (i.e. the same thing works inside the dplyr verbs as in the bindings).Some concrete examples:
summarize()
but not caught insidearrow_eval()
because it's not about the expressions.