Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched mapping does not raise an error if values for an existing column are empty #6879

Open
felix-schneider opened this issue May 7, 2024 · 0 comments

Comments

@felix-schneider
Copy link

Describe the bug

Using Dataset.map(fn, batched=True) allows resizing the dataset by returning a dict of lists, all of which must be the same size. If they are not the same size, an error like pyarrow.lib.ArrowInvalid: Column 1 named x expected length 1 but got length 0 is raised.

This is not the case if the function returns an empty list for an existing column in the dataset. In that case, the dataset is silently resized to 0 rows.

Steps to reproduce the bug

MWE:

import datasets
data = datasets.Dataset.from_dict({"test": [1]})

def mapping_fn(examples):
    return {"test": [], "y": [1]}

data = data.map(mapping_fn, batched=True)
print(len(data))

Note that when returning "x": [], the error is raised correctly, also when returning "test": [1,2].

Expected behavior

Expected an exception: pyarrow.lib.ArrowInvalid: Column 1 named test expected length 1 but got length 0 or pyarrow.lib.ArrowInvalid: Column 2 named y expected length 0 but got length 1.

Any exception would be acceptable.

Environment info

  • datasets version: 2.19.1
  • Platform: Linux-5.4.0-153-generic-x86_64-with-glibc2.31
  • Python version: 3.11.8
  • huggingface_hub version: 0.22.2
  • PyArrow version: 15.0.2
  • Pandas version: 2.2.1
  • fsspec version: 2024.2.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant