Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

Merged
merged 1 commit into from May 14, 2024

Conversation

hombit
Copy link
Contributor

@hombit hombit commented May 6, 2024

What changes are included in this PR?

This PR introduces a check to verify if the dtype of the input numpy array is byte-swapped. If it is, a not-implemented exception is raised. This precaution prevents the data from being cast incorrectly as if it were in the correct byte order, which would lead to wrong data values.

Are these changes tested?

I added a new test to check if not-implemented exception is raised - for both old (primitive types) and new (composed types) code.

Are there any user-facing changes?

No changes in API, but old code which gave incorrect results now would fail with a not-implemented exception

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the contribution!

@@ -3896,3 +3896,26 @@ def test_list_view_slice(list_view_type):
j = sliced_array.offsets[1].as_py()

assert sliced_array[0].as_py() == sliced_array.values[i:j].to_pylist() == [4]


@pytest.mark.parametrize('numpy_dtype', ['>u2', '>i4', '>f8'])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do have a nightly test runner on a big endian machine, so in that case this hardcoded test will fail?
So either we have to skip this test in that case, or rewrite the test to work in both cases (numpy has functionality to byteswap the data and change the dtype to the opposite byteorder, so to create non-native data regardless of the platform you are running the test on)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it seems we no longer have such CI coverage because our s390x build was dropped when Travis CI support was dropped ..

But still, we have regularly reports from people testing on big endian machines, so at least adding a skipif based on sys.byteorder would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche thank you, it is a good point. I've fixed the test to always use non-native byte order

@github-actions github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 8, 2024
@hombit hombit force-pushed the py-fix-byte-swapped-check branch from 13b248a to f1b92d9 Compare May 8, 2024 13:32
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 8, 2024
@hombit hombit force-pushed the py-fix-byte-swapped-check branch from f1b92d9 to baeafc1 Compare May 9, 2024 15:09
@hombit hombit force-pushed the py-fix-byte-swapped-check branch from baeafc1 to e4c612b Compare May 10, 2024 15:03
@hombit
Copy link
Contributor Author

hombit commented May 13, 2024

@jorisvandenbossche could you please look again?

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thank!

@jorisvandenbossche jorisvandenbossche merged commit fd84ec0 into apache:main May 14, 2024
13 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting change review Awaiting change review label May 14, 2024
@github-actions github-actions bot added the awaiting merge Awaiting merge label May 14, 2024
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit fd84ec0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…rays inside python objects (apache#41549)

### What changes are included in this PR?

This PR introduces a check to verify if the dtype of the input numpy array is byte-swapped. If it is, a not-implemented exception is raised. This precaution prevents the data from being cast incorrectly as if it were in the correct byte order, which would lead to wrong data values.

### Are these changes tested?

I added a new test to check if not-implemented exception is raised - for both old (primitive types) and new (composed types) code.

### Are there any user-facing changes?

No changes in API, but old code which gave incorrect results now would fail with a not-implemented exception

* GitHub Issue: apache#39129

Authored-by: Konstantin Malanchev <hombit@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants