[SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() #46529

ianmcook · 2024-05-10T22:12:06Z

What changes were proposed in this pull request?

Add support for passing a PyArrow Table to createDataFrame().
Document this on the Apache Arrow in PySpark user guide page.
Fix an issue with timestamp and struct columns in toArrow().

Why are the changes needed?

This seems like a logical next step after the addition of a toArrow() DataFrame method in #45481.

Does this PR introduce any user-facing change?

Users will have the ability to pass PyArrow Tables to createDataFrame(). There are no changes to the parameters of createDataFrame(). The only difference is that data can now be a PyArrow Table.

How was this patch tested?

Many tests were added, for Spark Classic and Spark Connect. I ran the tests locally with older versions of PyArrow installed (going back to 10.0).

Was this patch authored or co-authored using generative AI tooling?

No

python/pyspark/sql/tests/typing/test_session.yml

python/pyspark/sql/session.py

python/pyspark/sql/connect/session.py

alippai · 2024-05-14T01:11:10Z

This makes the usage so much easier, thanks!

What will happen with the nanosecond timestamps? Truncated to milliseconds?

ianmcook · 2024-05-15T20:04:27Z

What will happen with the nanosecond timestamps? Truncated to milliseconds?

Truncated to microseconds for now.

alippai · 2024-05-15T20:46:22Z

I wish we had ns support in spark 4.0 as Java, parquet, arrow etc uses it natively now, but that’s certainly a different discussion and MRs :)

Thanks for the new API!

zhengruifeng

This PR is too big to follow and has many unrelated changes, can we break it to multiple PRs ?

python/pyspark/sql/connect/session.py

zhengruifeng · 2024-05-21T05:48:12Z

python/pyspark/sql/context.py

@@ -46,6 +46,7 @@

 if TYPE_CHECKING:
    from py4j.java_gateway import JavaObject
+    import pyarrow as pa


I think we can not assume pyarrow is installed by default in Spark Classic

https://spark.apache.org/docs/latest/api/python/getting_started/install.html#dependencies

@HyukjinKwon

I think it is OK when TYPE_CHECKING is True. There is another file in the repo that has had import pyarrow as pa inside an if TYPE_CHECKING: conditional since 2021: https://github.com/apache/spark/pull/34101/files#diff-a4f1631a18d1b4921b8727e1d78059a1014c433239eb107962c473c6466214e5R29

zhengruifeng · 2024-05-21T05:52:13Z

python/pyspark/sql/context.py

@@ -343,28 +344,29 @@ def createDataFrame(

    @overload
    def createDataFrame(
-        self, data: "PandasDataFrameLike", samplingRatio: Optional[float] = ...
+        self, data: Union["PandasDataFrameLike", "pa.Table"], samplingRatio: Optional[float] = ...


can we include pa.Table in PandasDataFrameLike?

I don't think so. PandasDataFrameLike is just an alias for pd.DataFrame.

Details

In pyspark.sql.pandas._typing, DataFrameLike is defined like this:

from pandas.core.frame import DataFrame as PandasDataFrame DataFrameLike = PandasDataFrame

Then in other parts of PySpark, PandasDataFrameLike is defined like this:

from pyspark.sql.pandas._typing import DataFrameLike as PandasDataFrameLike

If we define PandasDataFrameLike as a Union that includes pa.Table, that will cause other problems. For example, then we can't use it as the return type of toPandas().

python/pyspark/sql/pandas/types.py

ianmcook · 2024-05-21T11:28:57Z

This PR is too big to follow and has many unrelated changes, can we break it to multiple PRs ?

Yes I am happy to break out the non-required changes into separate PRs.

ianmcook · 2024-05-21T18:37:07Z

@zhengruifeng I broke out the non-required changes into separate PRs and I simplified the implementations. I think it should be easier to follow now. More than half of the added code is tests.

python/pyspark/sql/connect/dataframe.py

python/pyspark/sql/pandas/conversion.py

python/pyspark/sql/tests/test_arrow.py

python/pyspark/sql/pandas/types.py

ianmcook · 2024-05-21T20:47:00Z

Here is a PDF of the Apache Arrow in PySpark user guide page rendered from this PR, with the new and modified sections highlighted: Apache Arrow in PySpark — PySpark master documentation.pdf

python/pyspark/sql/tests/test_arrow.py

python/pyspark/sql/connect/session.py

python/pyspark/sql/pandas/types.py

jorisvandenbossche · 2024-05-22T13:50:36Z

python/pyspark/sql/pandas/types.py

+        return pa.ListArray.from_arrays(
+            a.offsets,
+            _check_arrow_array_timestamps_localize(a.values, at.elementType, truncate, timezone),
+        )


Also this will not preserve nulls? (similarly like the issue you raised for map type, although ListArray already has a mask keyword. And we should also add something like apache/arrow#23380 to simply apply an existing validity bitmap buffer)

Thanks for catching this. Fixed in 5db5e4b. Added tests in fc758e6 to fail if nulls are not preserved here.

I also confirmed that it's not necessary to pass mask when creating the dictionary array; the nulls in the dictionary indices pass through fine and are preserved.

I don't know how much more complex you want to make this part of the code, but one disadvantage with the current mask keyword is that it is not zero copy (since it inverts the validity bitmap twice), so you might want to avoid having to do this as much as possible:

you could check if the list's value_type is either timestamp or nested, and only in that case call _check_arrow_array_timestamps_localize on the value, and otherwise just return a (we have pa.types.is_nested that could help with that). That should already make any simple list of numeric type fully zero-copy

if you do have to recreated the ListArray, you can probably do mask=a.is_null() if a.null_count else None to avoid allocating a full bitmap in case there are no missing values

I think it is worthwhile if it keeps zero-copy. Added in 7ed45e0. I did it for MapArrays and StructArrays too.

python/pyspark/sql/pandas/types.py

xinrong-meng · 2024-05-22T18:36:05Z

examples/src/main/python/sql/arrow.py


    # Convert the Spark DataFrame to a PyArrow Table
-    table = df.select("*").toArrow()
+    result_table = df.select("*").toArrow()


Out of curiosity why do we explicitly "select("*")"?

I followed the pandas example (see below on line 69 of this same file). I was wondering this too, but I kept it just to match the pandas example. I'm happy to remove both if that would be better.

I suspect that the original purpose of the .select("*") was to represent some arbitrary transformations being lazily performed on the dataframe. That way readers will know that this works when there are transformations.

xinrong-meng · 2024-05-22T18:49:49Z

python/pyspark/sql/connect/session.py

+            # If no schema supplied by user then get the names of columns only
+            if schema is None:
+                _cols = data.column_names
+            if isinstance(schema, (list, tuple)) and cast(int, _num_cols) < len(data.columns):


nit:

if isinstance(schema, (list, tuple)) and cast(int, _num_cols) < len(data.columns): assert isinstance(_cols, list) _cols.extend([f"_{i + 1}" for i in range(cast(int, _num_cols), len(data.columns))]) _num_cols = len(_cols)

is uneasy to follow and duplicated, shall we extract and reuse, or add a comment to both to help understand? Feel free to do it in a separate PR

This is borrowed directly from the pandas section above (line 516). I would be happy to open a subsequent PR to try to deduplicate the logic and make it clearer.

github-actions bot added SQL PYTHON CONNECT labels May 10, 2024

ianmcook mentioned this pull request May 11, 2024

[Python] Type checking support apache/arrow#32609

Open

ianmcook force-pushed the SPARK-48220 branch from 9ac1c99 to 04d8837 Compare May 12, 2024 00:05

ianmcook commented May 12, 2024

View reviewed changes

python/pyspark/sql/tests/typing/test_session.yml Show resolved Hide resolved

ianmcook commented May 12, 2024

View reviewed changes

python/pyspark/sql/session.py Show resolved Hide resolved

HyukjinKwon reviewed May 12, 2024

View reviewed changes

python/pyspark/sql/connect/session.py Outdated Show resolved Hide resolved

ianmcook force-pushed the SPARK-48220 branch 8 times, most recently from c408d0b to 795d01a Compare May 13, 2024 22:14

ianmcook force-pushed the SPARK-48220 branch 4 times, most recently from 005c4f7 to 7e34472 Compare May 15, 2024 16:15

github-actions bot added EXAMPLES DOCS labels May 16, 2024

ianmcook marked this pull request as ready for review May 16, 2024 07:52

ianmcook requested a review from HyukjinKwon May 16, 2024 07:52

ianmcook force-pushed the SPARK-48220 branch from 13efe25 to c43f26c Compare May 16, 2024 14:39

ianmcook added 2 commits May 16, 2024 11:22

Initial implementation

5441c27

Fix timezone issues

88263bf

ianmcook force-pushed the SPARK-48220 branch from 06978cc to 631eef1 Compare May 18, 2024 20:19

zhengruifeng reviewed May 21, 2024

View reviewed changes

ianmcook mentioned this pull request May 21, 2024

[SPARK-48374][PYTHON] Support additional PyArrow Table column types #46688

Draft

ianmcook added 3 commits May 21, 2024 09:43

Break out changes into separate PR apache#46688

77d169c

Break out changes into separate PR apache#46687

3de5b16

Simplify Connect implementation

a8fc0e8

ianmcook force-pushed the SPARK-48220 branch from 7af2194 to 55645a8 Compare May 21, 2024 17:12

Simplify Classic implementation

72abb12

ianmcook force-pushed the SPARK-48220 branch from 55645a8 to 72abb12 Compare May 21, 2024 17:44