Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using [text,labels] instead of just [text] in Datasets #21

Open
imthebilliejoe opened this issue Jan 3, 2023 · 1 comment
Open

Using [text,labels] instead of just [text] in Datasets #21

imthebilliejoe opened this issue Jan 3, 2023 · 1 comment

Comments

@imthebilliejoe
Copy link

Hi, I'd like to start with a big thanx for your amazing work. I would like to use your library to fine tune GPT-NEO to a Text2Text task instead of TextGeneration. I'm try to adapt your script run_clm.py to handle not only a Dataset with just [text] but with a structure [text,label].

So I'm now trying to create a train_dataset that is built by these new two tokenized dataset, built this way:

def tokenize_function_text(examples): return tokenizer(examples["text"])

tokenized_datasets_text = datasets.map( tokenize_function_text, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache)

def tokenize_function_label(examples): return tokenizer(examples["label"])

tokenized_datasets_label = datasets.map( tokenize_function_label, batched=True, num_proc=data_args.preprocessing_num_workers, remove_columns=column_names, load_from_cache_file=not data_args.overwrite_cache, )

But I'm now really struggling to mix them togheter in a single object "train_dataset" that i want to give to the trainer. Do you have any tips or suggestion to give me?

thank you very much

@imthebilliejoe
Copy link
Author

in case someone was trying to do the same i solved the issue adapting the code on this article:

http://mohitmayank.com/a_lazy_data_science_guide/natural_language_processing/GPTs/#finetuning-gpt-2-for-sentiment-classification

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant