Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] NVTabular Dataset constructor cannot process cudf.StructType values. #1808

Open
drobison00 opened this issue May 1, 2023 · 5 comments
Assignees
Labels
bug Something isn't working P1
Milestone

Comments

@drobison00
Copy link

Describe the bug
Attempting to create an NVT Dataset using a cudf DataFrame containing a struct dtype fails.

Steps/Code to reproduce bug

Create a test file:

echo '[{"properties":{"id":"bddc03c8-8da3-4ef6-8ef9-a50324639100", "location":{"city":"Port Denisetown","state":"Smithton","countryOrRegion":"XR","geoCoordinates":{"latitude":3.5518965,"longitude":131.871582}}}}]' > example.json

reproducer.py

import cudf
import nvtabular as nvt

df = cudf.read_json("example.json")
print(f"DataFrame dtypes: \n{df.dtypes}")

ds = nvt.Dataset(df)

output

properties    struct
dtype: object

Traceback (most recent call last):
... SNIP ...
TypeError: Merlin doesn't provide a mapping from struct (<class 'cudf.core.dtypes.StructDtype'>) to a Merlin dtype. If you'd like to provide one, you can use `merlin.dtype.register()`.

Expected behavior
Since its a standard cuDF data type, I'd expect it to be processed correctly by NVT, or some type of graceful fallback behavior.

Environment details (please complete the following information):

  • Conda environment
conda list | grep 'nvtabular|merlin'
merlin-core               23.02.01                   py_0    nvidia
merlin-dataloader         23.02.01                   py_0    nvidia
nvtabular                 23.02.00                 py38_0    nvidia

Additional context
Add any other context about the problem here.

@drobison00 drobison00 added the bug Something isn't working label May 1, 2023
@karlhigley karlhigley self-assigned this May 2, 2023
@karlhigley karlhigley added this to the Merlin 23.05 milestone May 2, 2023
@drobison00
Copy link
Author

Updated repro that illustrates workflow issues in addition to Dataset creation.

def f_to_pandas(col, df):
    pd_series = col.to_pandas()

    return cudf.from_pandas(pd_series)

def test_cudf_struct_type_conversion():
    import cudf
    import nvtabular as nvt
    from nvtabular.ops import LambdaOp
    from nvtabular.ops.operator import ColumnSelector

    input_df = cudf.read_json("example.json")  #  different error if we use pd.read_json

    single_op = ColumnSelector("properties") >> LambdaOp(f=f_to_pandas)
    workflow = nvt.Workflow(single_op)

    ds = nvt.Dataset(input_df)
    result = workflow.fit_transform(ds).to_ddf().compute()

    print(result)

@rnyak rnyak added the P1 label May 5, 2023
@karlhigley
Copy link
Contributor

This is related to a lower-level issue that happens when converting cuDF struct columns that contain both nulls and empty structs to Pandas. It can be worked around by exploding structs into separate columns with series.struct.explode() before passing data into NVT.

@drobison00
Copy link
Author

This issue should be fully resolved when rapidsai/cudf#13315 goes in.

@rnyak
Copy link
Contributor

rnyak commented May 19, 2023

@drobison00 hello! is the issue solved at your end. looks like rapidsai/cudf#13315 was merged.

@drobison00
Copy link
Author

@rnyak I'll double check today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working P1
Projects
None yet
Development

No branches or pull requests

3 participants