Improve Preprocessing Speed #27

aneesha · 2021-08-31T21:53:45Z

Preprocessing currently takes a long time for large datasets. One way to improve the speed is to use Spacy pipes, particularly for lemmatization. Preprocessing is a very useful class, that can do a lot with just simple argument configuration.

for doc in spacy_nlp.pipe(documents, batch_size=32, n_process=3, disable=["parser", "ner"]):
      # Lemmatize each token and convert to lower case if the token is not a pronoun
      tokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in doc ]

      # Remove stop words and punctuation
      tokens = [ word for word in tokens if word not in stop_words and word not in punctuations ]
      processed_documents.append(tokens)

I'm happy to contribute code to make this change.

The text was updated successfully, but these errors were encountered:

silviatti · 2021-09-02T15:41:17Z

Hi!
Lemmatization is definitely the biggest bottleneck for preprocessing. I didn't know Spacy pipes. It seems the right solution for us, since we already rely on Spacy for the lemmatization.

If you want to contribute, feel free to open a pull request :) Thanks,

Silvia

aneesha · 2021-09-03T05:51:29Z

Thanks - I'll work on this and submit a pull request.

silviatti · 2021-09-08T11:57:51Z

Thank you! Let me know if you have any questions.

Silvia

SaraAmd · 2023-02-01T02:19:57Z

how are we supposed to generate this vocabulary.tsx file in order to use this dataset = preprocessor.preprocess_dataset(documents_path=r'..\corpus.txt', labels_path=r'..\labels.txt')
method for preprocessing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Preprocessing Speed #27

Improve Preprocessing Speed #27

aneesha commented Aug 31, 2021

silviatti commented Sep 2, 2021

aneesha commented Sep 3, 2021

silviatti commented Sep 8, 2021

SaraAmd commented Feb 1, 2023

Improve Preprocessing Speed #27

Improve Preprocessing Speed #27

Comments

aneesha commented Aug 31, 2021

silviatti commented Sep 2, 2021

aneesha commented Sep 3, 2021

silviatti commented Sep 8, 2021

SaraAmd commented Feb 1, 2023