Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-41609: [Java] Initialize non empty offset buffer for variable-size layout before exporting #41610

Closed
wants to merge 3 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented May 9, 2024

Rationale for this change

This is a follow up of #40038. In #40038, we fixed null offset buffer issue for BaseVariableWidthVector. For empty vector, instead of a empty offset buffer which turns to be a null buffer through C Data Interface, we export a non-empty offset which is supposed to contain zero value.

But the initialization code has a bug in the PR #40043, so the offset buffer is not initialized. Note that this is not a regression because the exported vector never works due to null offset buffer.

What changes are included in this PR?

Fixed the uninitialized offset buffer of empty BaseVariableWidthVector.

Are these changes tested?

Added test cases.

Are there any user-facing changes?

No

@viirya viirya requested a review from lidavidm as a code owner May 9, 2024 21:52
Copy link
Member Author

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the tests first to run them in CI to verify.

The fix is ready and will be submitted later.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels May 9, 2024
@viirya viirya marked this pull request as draft May 9, 2024 21:58
@viirya
Copy link
Member Author

viirya commented May 9, 2024

Marked it as a draft and will make it ready for review once I submit the fix.

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 9, 2024
@vibhatha
Copy link
Collaborator

@viirya just to make sure I understand the change here. Are we introducing the required change in this PR itself right?

And these test cases supposed to fail without that fix?

@viirya
Copy link
Member Author

viirya commented May 10, 2024

@vibhatha I have some fix locally but I am unable to run the C module tests locally on a Mac. I am trying to write these tests that are supposed to fail in CI to verify that.

Once the tests are verified and the fix are submitted, I will make this ready for review.

@vibhatha
Copy link
Collaborator

Thanks for explaining @viirya . If you need help please let me know.

@viirya
Copy link
Member Author

viirya commented May 10, 2024

Thank you @vibhatha

@viirya
Copy link
Member Author

viirya commented May 10, 2024

Hmm, I think I know why the tests are not failed in Java Arrow. This is how Java Arrow imports an Utf8 array:

try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
      final int start = offsets.getInt(0);
      final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
      final int len = end - start;
      ...
}

So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.

But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.

That is what I found for the first offset value from the spec:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.

It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is correct without issue?

@viirya
Copy link
Member Author

viirya commented May 10, 2024

Let me open an issue at arrow-rs and see if it makes more sense to fix it there.

@vibhatha
Copy link
Collaborator

@viirya just for my curiosity, so the error you experience comes when Arrow Java vector is imported in Rust?

@viirya
Copy link
Member Author

viirya commented May 10, 2024

@viirya just for my curiosity, so the error you experience comes when Arrow Java vector is imported in Rust?

Yes

@viirya
Copy link
Member Author

viirya commented May 13, 2024

I fixed the issue at arrow-rs by apache/arrow-rs#5756. So I am closing this.

Thanks @vibhatha for review.

@viirya viirya closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants