Add return_file_name in load_dataset #6310

juliendenize · 2023-10-17T13:36:57Z

Proposition to fix #5806.

Added an optional parameter return_file_name in the dataset builder config. When set to True, the function will include the file name corresponding to the sample in the returned output.

There is a difference between arrow-based and folder-based datasets to return the file name:

for arrow-based: a column is concatenated after the table is cast.
for folder-based: dataset.info.features has the entry file_name and the original file name is passed to the sample_metadata dictionary.

The difference in behavior might be a concern, also I do not know whether the file_name should return the original file path or the downloaded one for folder-based datasets.

I added some tests for the datasets that already had a test file.

src/datasets/builder.py

lhoestq

Thanks for the change !

Since return in python often refers to what is actually returned by the function (here load_dataset), I think we can use another word for the parameter. Maybe name it with_file_names?

cc @mariosasko in case you have an opinion

HuggingFaceDocBuilderDev · 2023-11-02T14:52:24Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

juliendenize · 2023-11-06T13:41:30Z

Thanks for the change !

Since return in python often refers to what is actually returned by the function (here load_dataset), I think we can use another word for the parameter. Maybe name it with_file_names?

cc @mariosasko in case you have an opinion

I changed the argument name to your suggestion, I agree that it should be less confusing :)

mariosasko

Thanks! I've left some comments.

@lhoestq WDYT about returning a data file's name (the last part) instead of the full path? This way we could have the same values in the streaming and the non-streaming mode. (In the non-streaming mode, we would also have to iterate over remote files to not output the files' hash (from the HF cache))

mariosasko · 2023-10-19T17:51:00Z

src/datasets/packaged_modules/arrow/arrow.py

@@ -64,10 +64,13 @@ def _generate_tables(self, files):
                try:
                    for batch_idx, record_batch in enumerate(pa.ipc.open_stream(f)):
                        pa_table = pa.Table.from_batches([record_batch])
+                        pa_table = self._cast_table(pa_table)
+                        if self.config.return_file_name:
+                            pa_table = pa_table.append_column("file_name", pa.array([file] * len(pa_table)))


Let's check here if the file_name column is already present in the table and raise an error if it is.

datasets does not support accessing columns with integers as pyarrow does, so multiple columns with the same name are not allowed.

I added the test for the different datasets, let me know if you want to change the error message.

mariosasko · 2023-10-19T17:58:08Z

src/datasets/packaged_modules/text/text.py

+                                if len(pa_table) > 0
+                                else pa.nulls(0, pa.string()),


I think we ensure there is some data in the table before yielding it, so this is not needed

Here if I remove the condition it fails so I left it.

src/datasets/packaged_modules/text/text.py

src/datasets/packaged_modules/arrow/arrow.py

mariosasko · 2023-11-06T14:56:08Z

src/datasets/packaged_modules/parquet/parquet.py

@@ -19,6 +19,7 @@ class ParquetConfig(datasets.BuilderConfig):
    batch_size: int = 10_000
    columns: Optional[List[str]] = None
    features: Optional[datasets.Features] = None
+    with_file_names: bool = False


The parquet builder performs a cast if features are defined in the file's metadata. So, you need to specify a feature for with_file_names in the self.info.features dictionary to avoid an error in this scenario.

PS: You can reproduce the error by creating a dataset, saving it with to_parquet("path/to/parquet_file"), and then loading it with load_dataset("parquet", data_files="path/to/parquet_file", with_file_name=True).

This particular error is fixed, however I did not test every use cases as I am not familiar enough with the library, if you have more suggestions to improve tests I'll work on it

juliendenize · 2023-11-12T18:47:07Z

Thanks! I've left some comments.

@lhoestq WDYT about returning a data file's name (the last part) instead of the full path? This way we could have the same values in the streaming and the non-streaming mode. (In the non-streaming mode, we would also have to iterate over remote files to not output the files' hash (from the HF cache))

Concerning the last part of the file name, do you have suggestions on how to do that? Because it can happen that the files are located in different folders with the same name so I am wondering what would be the way to go.

lhoestq reviewed Oct 20, 2023

View reviewed changes

src/datasets/builder.py Outdated Show resolved Hide resolved

juliendenize force-pushed the add_return_file_name branch from 09c2ee6 to 00cd92f Compare October 20, 2023 11:15

juliendenize requested a review from lhoestq October 24, 2023 14:21

lhoestq reviewed Nov 2, 2023

View reviewed changes

juliendenize force-pushed the add_return_file_name branch from 00cd92f to 190d3de Compare November 6, 2023 13:38

mariosasko reviewed Nov 6, 2023

View reviewed changes

juliendenize added 4 commits November 12, 2023 19:36

Add return_file_name in load_dataset

3a09e58

Fix code quality

c4f9d28

Move return_file_name config to subclasses

c349b25

Rename return_file_name to with_file_names

91441f1

juliendenize force-pushed the add_return_file_name branch from 190d3de to e9e157c Compare November 12, 2023 18:37

Refractor to include_file_name, Add ValueError, Fix parquet

fb53e7e

juliendenize force-pushed the add_return_file_name branch from e9e157c to fb53e7e Compare November 12, 2023 18:43

juliendenize requested review from mariosasko and lhoestq November 27, 2023 21:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add return_file_name in load_dataset #6310

Add return_file_name in load_dataset #6310

juliendenize commented Oct 17, 2023 •

edited

lhoestq left a comment •

edited

HuggingFaceDocBuilderDev commented Nov 2, 2023

juliendenize commented Nov 6, 2023

mariosasko left a comment

mariosasko Oct 19, 2023

juliendenize Nov 12, 2023

mariosasko Oct 19, 2023

juliendenize Nov 12, 2023

mariosasko Nov 6, 2023

juliendenize Nov 12, 2023

juliendenize commented Nov 12, 2023 •

edited

Add return_file_name in load_dataset #6310

Are you sure you want to change the base?

Add return_file_name in load_dataset #6310

Conversation

juliendenize commented Oct 17, 2023 • edited

lhoestq left a comment • edited

Choose a reason for hiding this comment

HuggingFaceDocBuilderDev commented Nov 2, 2023

juliendenize commented Nov 6, 2023

mariosasko left a comment

Choose a reason for hiding this comment

mariosasko Oct 19, 2023

Choose a reason for hiding this comment

juliendenize Nov 12, 2023

Choose a reason for hiding this comment

mariosasko Oct 19, 2023

Choose a reason for hiding this comment

juliendenize Nov 12, 2023

Choose a reason for hiding this comment

mariosasko Nov 6, 2023

Choose a reason for hiding this comment

juliendenize Nov 12, 2023

Choose a reason for hiding this comment

juliendenize commented Nov 12, 2023 • edited

juliendenize commented Oct 17, 2023 •

edited

lhoestq left a comment •

edited

juliendenize commented Nov 12, 2023 •

edited