GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

hombit · 2024-05-06T15:17:13Z

What changes are included in this PR?

This PR introduces a check to verify if the dtype of the input numpy array is byte-swapped. If it is, a not-implemented exception is raised. This precaution prevents the data from being cast incorrectly as if it were in the correct byte order, which would lead to wrong data values.

Are these changes tested?

I added a new test to check if not-implemented exception is raised - for both old (primitive types) and new (composed types) code.

Are there any user-facing changes?

No changes in API, but old code which gave incorrect results now would fail with a not-implemented exception

GitHub Issue: [Python] pa.array doesn't check for byte-swapped list-arrays #39129

jorisvandenbossche

Thanks a lot for the contribution!

jorisvandenbossche · 2024-05-08T07:03:13Z

python/pyarrow/tests/test_array.py

@@ -3896,3 +3896,26 @@ def test_list_view_slice(list_view_type):
    j = sliced_array.offsets[1].as_py()

    assert sliced_array[0].as_py() == sliced_array.values[i:j].to_pylist() == [4]
+
+
+@pytest.mark.parametrize('numpy_dtype', ['>u2', '>i4', '>f8'])


I think we do have a nightly test runner on a big endian machine, so in that case this hardcoded test will fail?
So either we have to skip this test in that case, or rewrite the test to work in both cases (numpy has functionality to byteswap the data and change the dtype to the opposite byteorder, so to create non-native data regardless of the platform you are running the test on)

Actually, it seems we no longer have such CI coverage because our s390x build was dropped when Travis CI support was dropped ..

But still, we have regularly reports from people testing on big endian machines, so at least adding a skipif based on sys.byteorder would be good.

@jorisvandenbossche thank you, it is a good point. I've fixed the test to always use non-native byte order

hombit · 2024-05-13T21:39:02Z

@jorisvandenbossche could you please look again?

jorisvandenbossche

Perfect, thank!

conbench-apache-arrow · 2024-05-14T11:45:18Z

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit fd84ec0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 3 possible false positives for unstable benchmarks that are known to sometimes produce them.

…rays inside python objects (apache#41549) ### What changes are included in this PR? This PR introduces a check to verify if the dtype of the input numpy array is byte-swapped. If it is, a not-implemented exception is raised. This precaution prevents the data from being cast incorrectly as if it were in the correct byte order, which would lead to wrong data values. ### Are these changes tested? I added a new test to check if not-implemented exception is raised - for both old (primitive types) and new (composed types) code. ### Are there any user-facing changes? No changes in API, but old code which gave incorrect results now would fail with a not-implemented exception * GitHub Issue: apache#39129 Authored-by: Konstantin Malanchev <hombit@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

github-actions bot added Component: Python awaiting review Awaiting review labels May 6, 2024

hombit force-pushed the py-fix-byte-swapped-check branch from ca290d8 to 13b248a Compare May 7, 2024 13:07

jorisvandenbossche reviewed May 8, 2024

View reviewed changes

github-actions bot added awaiting review Awaiting review awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting changes Awaiting changes labels May 8, 2024

hombit force-pushed the py-fix-byte-swapped-check branch from 13b248a to f1b92d9 Compare May 8, 2024 13:32

github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 8, 2024

hombit requested a review from jorisvandenbossche May 8, 2024 14:02

hombit force-pushed the py-fix-byte-swapped-check branch from f1b92d9 to baeafc1 Compare May 9, 2024 15:09

Check for numpy byte-swapped array

e4c612b

hombit force-pushed the py-fix-byte-swapped-check branch from baeafc1 to e4c612b Compare May 10, 2024 15:03

jorisvandenbossche approved these changes May 14, 2024

View reviewed changes

jorisvandenbossche merged commit fd84ec0 into apache:main May 14, 2024
13 checks passed

jorisvandenbossche removed the awaiting change review Awaiting change review label May 14, 2024

jorisvandenbossche mentioned this pull request May 14, 2024

[Python] pa.array doesn't check for byte-swapped list-arrays #39129

Closed

github-actions bot added the awaiting merge Awaiting merge label May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

hombit commented May 6, 2024 •

edited

jorisvandenbossche left a comment

jorisvandenbossche May 8, 2024

jorisvandenbossche May 8, 2024

hombit May 8, 2024

hombit commented May 13, 2024

jorisvandenbossche left a comment

conbench-apache-arrow bot commented May 14, 2024

GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

GH-39129 [Python] pa.array: add check for byte-swapped numpy arrays inside python objects #41549

Conversation

hombit commented May 6, 2024 • edited

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche May 8, 2024

Choose a reason for hiding this comment

jorisvandenbossche May 8, 2024

Choose a reason for hiding this comment

hombit May 8, 2024

Choose a reason for hiding this comment

hombit commented May 13, 2024

jorisvandenbossche left a comment

Choose a reason for hiding this comment

conbench-apache-arrow bot commented May 14, 2024

hombit commented May 6, 2024 •

edited