Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return the name of the currently loaded file in the load_dataset function. #5806

Open
s-JoL opened this issue Apr 28, 2023 · 13 comments · May be fixed by #6170 or #6310
Open

Return the name of the currently loaded file in the load_dataset function. #5806

s-JoL opened this issue Apr 28, 2023 · 13 comments · May be fixed by #6170 or #6310
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@s-JoL
Copy link

s-JoL commented Apr 28, 2023

Feature request

Add an optional parameter return_file_name in the load_dataset function. When it is set to True, the function will include the name of the file corresponding to the current line as a feature in the returned output.

Motivation

When training large language models, machine problems may interrupt the training process. In such cases, it is common to load a previously saved checkpoint to resume training. I would like to be able to obtain the names of the previously trained data shards, so that I can skip these parts of the data during continued training to avoid overfitting and redundant training time.

Your contribution

I currently use a dataset in jsonl format, so I am primarily interested in the json format. I suggest adding the file name to the returned table here https://github.com/huggingface/datasets/blob/main/src/datasets/packaged_modules/json/json.py#L92.

@s-JoL s-JoL added the enhancement New feature or request label Apr 28, 2023
@mariosasko
Copy link
Collaborator

Implementing this makes sense (e.g., tensorflow_datasets' imagefolder returns image filenames). Also, in Datasets 3.0, we plan only to store the bytes of an image/audio, not its path, so this feature would be useful when the path info is still needed.

@tsabbir96
Copy link

Hey @mariosasko, Can I work on this issue, this one seems interesting to implement. I have contributed to jupyterlab recently, and would love to contribute here as well.

@albertvillanova
Copy link
Member

albertvillanova commented May 16, 2023

@tsabbir96 if you are planning to start working on this, you can take on this issue by writing a comment with only the keyword: #self-assign

@tsabbir96
Copy link

#self-assign

@tsabbir96
Copy link

@albertvillanova thank you for letting me contribute here.
@albertvillanova @mariosasko As I am totally new to this repo, could you tell me something more about this issue or perhaps give me some idea on how I can proceed with it? Thanks!

@EduardoPach
Copy link

Hello there, is this issue resolved? @tsabbir96 are you still working on it? Otherwise I would love to give it a try

@mariosasko
Copy link
Collaborator

@EduardoPach This issue is still relevant, so feel free to work on it.

@EduardoPach
Copy link

Hey @mariosasko, I've taken the time to take a look at how we load the datasets usually. My main question now is about the final solution.

So the idea is that whenever we load the datasets we also add a new column in the _generate_tables() method from the builders called filename (or file_name) that should be related files contained in each split, right?

Do you have any suggestions of where I could add that?

@BattiniSandeep
Copy link

Is this issue still open? If yes, I'd like to work upon on it. Thanks

@EduardoPach
Copy link

EduardoPach commented Sep 28, 2023

Is this issue still open? If yes, I'd like to work upon on it. Thanks

Definitely still open. I gave it a try, but then didn't get any feedback on my last question so I stopped . Feel free to work on it.

@mariosasko
Copy link
Collaborator

It's still open, so feel free to work on it. This can be implemented by adding a param to the packaged builders' configs that inserts a column with file names (in _generate_tables) when True. Naming this column file_name sounds good to me.

@juliendenize juliendenize linked a pull request Oct 17, 2023 that will close this issue
@aniruddh-23
Copy link

Hi is the issues still open, is see no activity since September but it shows that it is still assigned to tsabbir96. If
tsabbir96 is not planning to continue, can i get it assigned to me.

@SWHL
Copy link

SWHL commented Apr 12, 2024

Looking forward to your implementation. I also really need this feature.
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
8 participants