Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Preprocessing Speed #27

Open
aneesha opened this issue Aug 31, 2021 · 4 comments
Open

Improve Preprocessing Speed #27

aneesha opened this issue Aug 31, 2021 · 4 comments

Comments

@aneesha
Copy link

aneesha commented Aug 31, 2021

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

@silviatti
Copy link
Collaborator

Hi!
Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.

If you want to contribute, feel free to open a pull request :) Thanks,

Silvia

@aneesha
Copy link
Author

aneesha commented Sep 3, 2021

Thanks - I'll work on this and submit a pull request.

@silviatti
Copy link
Collaborator

Thank you! Let me know if you have any questions.

Silvia

@SaraAmd
Copy link

SaraAmd commented Feb 1, 2023

how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')
method for preprocessing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants