Return the name of the currently loaded file in the load_dataset function. #5806

s-JoL · 2023-04-28T13:50:15Z

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

mariosasko · 2023-05-08T17:10:26Z

Implementing this makes sense (e.g., tensorflow_datasets' imagefolder returns image filenames). Also, in Datasets 3.0, we plan only to store the bytes of an image/audio, not its path, so this feature would be useful when the path info is still needed.

tsabbir96 · 2023-05-16T10:49:30Z

Hey @mariosasko, Can I work on this issue, this one seems interesting to implement. I have contributed to jupyterlab recently, and would love to contribute here as well.

albertvillanova · 2023-05-16T13:02:41Z

@tsabbir96 if you are planning to start working on this, you can take on this issue by writing a comment with only the keyword: #self-assign

tsabbir96 · 2023-05-16T16:26:09Z

#self-assign

tsabbir96 · 2023-05-16T16:29:34Z

@albertvillanova thank you for letting me contribute here.
@albertvillanova @mariosasko As I am totally new to this repo, could you tell me something more about this issue or perhaps give me some idea on how I can proceed with it? Thanks!

EduardoPach · 2023-07-25T21:43:38Z

Hello there, is this issue resolved? @tsabbir96 are you still working on it? Otherwise I would love to give it a try

mariosasko · 2023-07-26T16:59:30Z

@EduardoPach This issue is still relevant, so feel free to work on it.

EduardoPach · 2023-07-28T22:08:17Z

Hey @mariosasko, I've taken the time to take a look at how we load the datasets usually. My main question now is about the final solution.

So the idea is that whenever we load the datasets we also add a new column in the _generate_tables() method from the builders called filename (or file_name) that should be related files contained in each split, right?

Do you have any suggestions of where I could add that?

BattiniSandeep · 2023-09-28T18:15:54Z

Is this issue still open? If yes, I'd like to work upon on it. Thanks

EduardoPach · 2023-09-28T18:53:31Z

Is this issue still open? If yes, I'd like to work upon on it. Thanks

Definitely still open. I gave it a try, but then didn't get any feedback on my last question so I stopped . Feel free to work on it.

mariosasko · 2023-09-29T17:49:53Z

It's still open, so feel free to work on it. This can be implemented by adding a param to the packaged builders' configs that inserts a column with file names (in _generate_tables) when True. Naming this column file_name sounds good to me.

aniruddh-23 · 2024-01-21T16:38:28Z

Hi is the issues still open, is see no activity since September but it shows that it is still assigned to tsabbir96. If
tsabbir96 is not planning to continue, can i get it assigned to me.

SWHL · 2024-04-12T01:37:51Z

Looking forward to your implementation. I also really need this feature.
Thanks

s-JoL added the enhancement New feature or request label Apr 28, 2023

mariosasko added the good first issue Good for newcomers label May 8, 2023

s-JoL mentioned this issue May 15, 2023

继续预训练的问题 s-JoL/Open-Llama#59

Closed

github-actions bot assigned tsabbir96 May 16, 2023

Amitesh-Patel linked a pull request Aug 23, 2023 that will close this issue

feat: Return the name of the currently loaded file #6170

Open

juliendenize linked a pull request Oct 17, 2023 that will close this issue

Add return_file_name in load_dataset #6310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return the name of the currently loaded file in the load_dataset function. #5806

Return the name of the currently loaded file in the load_dataset function. #5806

s-JoL commented Apr 28, 2023

mariosasko commented May 8, 2023

tsabbir96 commented May 16, 2023

albertvillanova commented May 16, 2023 •

edited

tsabbir96 commented May 16, 2023

tsabbir96 commented May 16, 2023

EduardoPach commented Jul 25, 2023

mariosasko commented Jul 26, 2023

EduardoPach commented Jul 28, 2023

BattiniSandeep commented Sep 28, 2023

EduardoPach commented Sep 28, 2023 •

edited

mariosasko commented Sep 29, 2023

aniruddh-23 commented Jan 21, 2024

SWHL commented Apr 12, 2024

Return the name of the currently loaded file in the load_dataset function. #5806

Return the name of the currently loaded file in the load_dataset function. #5806

Comments

s-JoL commented Apr 28, 2023

Feature request

Motivation

Your contribution

mariosasko commented May 8, 2023

tsabbir96 commented May 16, 2023

albertvillanova commented May 16, 2023 • edited

tsabbir96 commented May 16, 2023

tsabbir96 commented May 16, 2023

EduardoPach commented Jul 25, 2023

mariosasko commented Jul 26, 2023

EduardoPach commented Jul 28, 2023

BattiniSandeep commented Sep 28, 2023

EduardoPach commented Sep 28, 2023 • edited

mariosasko commented Sep 29, 2023

aniruddh-23 commented Jan 21, 2024

SWHL commented Apr 12, 2024

albertvillanova commented May 16, 2023 •

edited

EduardoPach commented Sep 28, 2023 •

edited