New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add return_file_name in load_dataset #6310
base: main
Are you sure you want to change the base?
Conversation
09c2ee6
to
00cd92f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change !
Since return
in python often refers to what is actually returned by the function (here load_dataset
), I think we can use another word for the parameter. Maybe name it with_file_names
?
cc @mariosasko in case you have an opinion
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
00cd92f
to
190d3de
Compare
I changed the argument name to your suggestion, I agree that it should be less confusing :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I've left some comments.
@lhoestq WDYT about returning a data file's name (the last part) instead of the full path? This way we could have the same values in the streaming and the non-streaming mode. (In the non-streaming mode, we would also have to iterate over remote files to not output the files' hash (from the HF cache))
@@ -64,10 +64,13 @@ def _generate_tables(self, files): | |||
try: | |||
for batch_idx, record_batch in enumerate(pa.ipc.open_stream(f)): | |||
pa_table = pa.Table.from_batches([record_batch]) | |||
pa_table = self._cast_table(pa_table) | |||
if self.config.return_file_name: | |||
pa_table = pa_table.append_column("file_name", pa.array([file] * len(pa_table))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check here if the file_name
column is already present in the table and raise an error if it is.
datasets
does not support accessing columns with integers as pyarrow
does, so multiple columns with the same name are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test for the different datasets, let me know if you want to change the error message.
if len(pa_table) > 0 | ||
else pa.nulls(0, pa.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we ensure there is some data in the table before yielding it, so this is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here if I remove the condition it fails so I left it.
@@ -19,6 +19,7 @@ class ParquetConfig(datasets.BuilderConfig): | |||
batch_size: int = 10_000 | |||
columns: Optional[List[str]] = None | |||
features: Optional[datasets.Features] = None | |||
with_file_names: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parquet
builder performs a cast if features
are defined in the file's metadata. So, you need to specify a feature for with_file_names
in the self.info.features
dictionary to avoid an error in this scenario.
PS: You can reproduce the error by creating a dataset, saving it with to_parquet("path/to/parquet_file")
, and then loading it with load_dataset("parquet", data_files="path/to/parquet_file", with_file_name=True)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular error is fixed, however I did not test every use cases as I am not familiar enough with the library, if you have more suggestions to improve tests I'll work on it
190d3de
to
e9e157c
Compare
e9e157c
to
fb53e7e
Compare
Concerning the last part of the file name, do you have suggestions on how to do that? Because it can happen that the files are located in different folders with the same name so I am wondering what would be the way to go. |
Proposition to fix #5806.
Added an optional parameter
return_file_name
in the dataset builder config. When set toTrue
, the function will include the file name corresponding to the sample in the returned output.There is a difference between arrow-based and folder-based datasets to return the file name:
dataset.info.features
has the entryfile_name
and the original file name is passed to thesample_metadata
dictionary.The difference in behavior might be a concern, also I do not know whether the
file_name
should return the original file path or the downloaded one for folder-based datasets.I added some tests for the datasets that already had a test file.