Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add return_file_name in load_dataset #6310

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

juliendenize
Copy link

@juliendenize juliendenize commented Oct 17, 2023

Proposition to fix #5806.

Added an optional parameter return_file_name in the dataset builder config. When set to True, the function will include the file name corresponding to the sample in the returned output.

There is a difference between arrow-based and folder-based datasets to return the file name:

  • for arrow-based: a column is concatenated after the table is cast.
  • for folder-based: dataset.info.features has the entry file_name and the original file name is passed to the sample_metadata dictionary.

The difference in behavior might be a concern, also I do not know whether the file_name should return the original file path or the downloaded one for folder-based datasets.

I added some tests for the datasets that already had a test file.

src/datasets/builder.py Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change !

Since return in python often refers to what is actually returned by the function (here load_dataset), I think we can use another word for the parameter. Maybe name it with_file_names?

cc @mariosasko in case you have an opinion

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@juliendenize
Copy link
Author

Thanks for the change !

Since return in python often refers to what is actually returned by the function (here load_dataset), I think we can use another word for the parameter. Maybe name it with_file_names?

cc @mariosasko in case you have an opinion

I changed the argument name to your suggestion, I agree that it should be less confusing :)

Copy link
Collaborator

@mariosasko mariosasko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've left some comments.

@lhoestq WDYT about returning a data file's name (the last part) instead of the full path? This way we could have the same values in the streaming and the non-streaming mode. (In the non-streaming mode, we would also have to iterate over remote files to not output the files' hash (from the HF cache))

@@ -64,10 +64,13 @@ def _generate_tables(self, files):
try:
for batch_idx, record_batch in enumerate(pa.ipc.open_stream(f)):
pa_table = pa.Table.from_batches([record_batch])
pa_table = self._cast_table(pa_table)
if self.config.return_file_name:
pa_table = pa_table.append_column("file_name", pa.array([file] * len(pa_table)))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check here if the file_name column is already present in the table and raise an error if it is.

datasets does not support accessing columns with integers as pyarrow does, so multiple columns with the same name are not allowed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test for the different datasets, let me know if you want to change the error message.

Comment on lines +127 to +137
if len(pa_table) > 0
else pa.nulls(0, pa.string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we ensure there is some data in the table before yielding it, so this is not needed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here if I remove the condition it fails so I left it.

src/datasets/packaged_modules/text/text.py Outdated Show resolved Hide resolved
src/datasets/packaged_modules/arrow/arrow.py Outdated Show resolved Hide resolved
@@ -19,6 +19,7 @@ class ParquetConfig(datasets.BuilderConfig):
batch_size: int = 10_000
columns: Optional[List[str]] = None
features: Optional[datasets.Features] = None
with_file_names: bool = False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parquet builder performs a cast if features are defined in the file's metadata. So, you need to specify a feature for with_file_names in the self.info.features dictionary to avoid an error in this scenario.

PS: You can reproduce the error by creating a dataset, saving it with to_parquet("path/to/parquet_file"), and then loading it with load_dataset("parquet", data_files="path/to/parquet_file", with_file_name=True).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This particular error is fixed, however I did not test every use cases as I am not familiar enough with the library, if you have more suggestions to improve tests I'll work on it

@juliendenize
Copy link
Author

juliendenize commented Nov 12, 2023

Thanks! I've left some comments.

@lhoestq WDYT about returning a data file's name (the last part) instead of the full path? This way we could have the same values in the streaming and the non-streaming mode. (In the non-streaming mode, we would also have to iterate over remote files to not output the files' hash (from the HF cache))

Concerning the last part of the file name, do you have suggestions on how to do that? Because it can happen that the files are located in different folders with the same name so I am wondering what would be the way to go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Return the name of the currently loaded file in the load_dataset function.
4 participants