Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-34484: [Substrait] add an option to disable augmented fields #41583

Merged
merged 5 commits into from May 14, 2024

Conversation

EpsilonPrime
Copy link
Contributor

@EpsilonPrime EpsilonPrime commented May 8, 2024

Rationale for this change

Augmented fields interfere with the schema passing between nodes. When enabled they cause names/schema mismatching at the end of the plan.

What changes are included in this PR?

Adds an option to disable augmented fields (defaulting to adding them), connects it everywhere it is called, and disables it in ReadRel conversion.

Are these changes tested?

Yes.

Are there any user-facing changes?

There are no API related changes however this will allow Substrait plans that consume local files to work without requiring a project/emit relation after the read relation to remove the unexpected fields.

@@ -287,10 +290,12 @@ struct ARROW_DS_EXPORT ProjectionDescr {

/// \brief Create a default projection referencing fields in the dataset schema
static Result<ProjectionDescr> FromNames(std::vector<std::string> names,
const Schema& dataset_schema);
const Schema& dataset_schema,
bool add_augmented_fields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set a default value to ensure we don't break any existing consumers of this function?


/// \brief Make a projection that projects every field in the dataset schema
static Result<ProjectionDescr> Default(const Schema& dataset_schema);
static Result<ProjectionDescr> Default(const Schema& dataset_schema,
bool add_augmented_fields);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Another alternative could be to pass through a ScanOptions data structure (still would require a default value though).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 8, 2024
Copy link
Member

@zeroshade zeroshade left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this looks fine to me, just a single nit pick from my end. But i'll leave final approval to someone more familiar with this section of the code.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 8, 2024
Copy link
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks :)

plan->StartProducing();
ASSERT_FINISHES_OK(plan->finished());
}

TEST(Substrait, RelWithHint) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: I think we have one or two spots in python where we have to do a column selection to workaround this issue. We can probably remove these now.

e.g. https://github.com/apache/arrow/blob/main/python/pyarrow/tests/test_substrait.py#L93

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels May 8, 2024
@zeroshade zeroshade merged commit a4a5cf1 into apache:main May 14, 2024
40 of 41 checks passed
@zeroshade zeroshade removed the awaiting merge Awaiting merge label May 14, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label May 14, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit a4a5cf1.

There was 1 benchmark result indicating a performance regression:

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…apache#41583)

### Rationale for this change

Augmented fields interfere with the schema passing between nodes.  When enabled they cause names/schema mismatching at the end of the plan.

### What changes are included in this PR?

Adds an option to disable augmented fields (defaulting to adding them), connects it everywhere it is called, and disables it in ReadRel conversion.

### Are these changes tested?

Yes.

### Are there any user-facing changes?

There are no API related changes however this will allow Substrait plans that consume local files to work without requiring a project/emit relation after the read relation to remove the unexpected fields.

* GitHub Issue: apache#34484

Authored-by: David Sisson <EpsilonPrime@users.noreply.github.com>
Signed-off-by: Matt Topol <zotthewizard@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants