Skip to content
This repository has been archived by the owner on Feb 20, 2024. It is now read-only.
/ RubyCrumbler Public archive

A simple Ruby script that contains a GUI desktop application providing typical NLP tasks ready to apply on English or German text files. Available for macOS, Windows and Linux.

License

Notifications You must be signed in to change notification settings

joh-ga/RubyCrumbler

Repository files navigation

⚠️ Note: This project was archived in February 2024 and is no longer maintained.

Ready to crumble your text for common NLP tasks? This repository is home of RubyCrumbler, a simple script to download, that provides a GUI desktop application written in Ruby to apply common Natural Language Processing (NLP) tasks on your English or German text files.

Requirements

The script may also run with older Ruby versions. It was sucessfully tested with Ruby 2.7 on Linux. You're welcome to give us feedback if it is possible to run it with other older versions.
Note: Before using RubyCrumbler, make sure you have downloaded the respective spaCy models (EN: en_core_web_lg, DE: de_core_news_lg).

Linux:

  • If an error occurs while installing tk gem in Linux, try this: tk-dev installation.
  • If an error occurs while installing ruby-spacy, make sure that you have installed Python with spaCy library.
  • Make sure that you have installed ruby-dev package.

GUI

MacOS Windows Linux
mac_31 windows_4 rubycrumbler_linux

Issues & Future Tasks

General:

  • The GUI window cannot be reduced in width so far. In general, we recommend opening and using in full-screen mode.
  • Using threads for multiple execution.
  • Adding stemming as a feature in the NLP pipeline.
  • We recommend that texts are encoded in UTF-8.

macOS:

  • The URL in the File Upload area can only be inserted into the field via right click and "paste". The shortcut "cmd/ctrl + v" does not work.

Description of Features

Pre-Processing
Data Cleaning: This includes removing redundant whitespaces, punctuation (redundant dots), special symbols (e.g., line break, new line), hash tags, HTML tags, and URLs.
Normalization: This includes removing punctuation symbols (dot, colon, comma, semicolon, exclamation and question mark).
Normalization (lowercase): This includes removing punctuation symbols (dot, colon, comma, semicolon, exclamation and question mark) as well as converting the text into lowercase.
Normalization (contractions): This includes removing punctuation symbols (dot, colon, comma, semicolon, exclamation and question mark) as well as converting contractions (abbreviation for a sequence of words like “don’t”) into their original form (e.g., do not). Note: German contractions are always converted with the definite article and include only very colloquial contractions (unterm - unter dem). Contractions like “zum” are not transformed into “zu dem”, because expressions like “zum Beispiel” usually need to remain unchanged. The list of contractions can be found in the source code and can be customized as needed.

Natural Language Processing – Tasks
Tokenization: This includes splitting the pre-processed data into individual characters or tokens.
Stopword Removal: Stopwords are words that do not carry much meaning but are important grammatically, for example “to” or “but”. This feature includes the removal of stopwords.
Lemmatization: This involves the reduction of words to their semantic base forms by the elimination of inflectional suffixes such as plural markers on nouns or verb form markers. Irregular verb roots are replaced by the infinitive form. Word classes derived from a base form (e.g. adverbs derived from adjectives) are allocated to their respective lemmas. Examples: computing – compute, sung – sing, obviously – obvious.
Part-of-Speech Tagging (POS): This includes identifying and labeling the parts of speech of text data.
Named Entity Recognition (NER): This includes labeling the so-called named entities in the data such as persons, organizations, and places. Note: In order to better identify named entities, it is recommended not to convert the text to only lowercase letters during pre-processing (i.e., do not apply "Normalization (lowercase)").

File Naming Convention

To enable a quick identification and location of your converted document depending on the feature applied, the following file naming convention is used.
Abbreviations are added to the source file name to indicate the features that have been applied to the document. The suffix of the new file name indicates the ouput file for the corresponding feature. For example, the file named "myfirsttext_cl_nlc_tok.txt" is the output file of the tokenization step.

Overview of the feature abbreviations:

  • Data cleaning = cl
  • Normalization = n
  • Normalization (lowercase) = l
  • Normalization (contractions) = c
  • Tokenization = tok
  • Stopword Removal = sw
  • Lemmatization = lem
  • Part-of-Speech Tagging = pos
  • Named Entity Recognition = ner

For each feature step the output format is TXT. POS tagging and NER are additionally saved in CSV and XML output format.

Pipeline Structure of RubyCrumbler

The program is built based on the following pipeline structure.
alt text

About

A simple Ruby script that contains a GUI desktop application providing typical NLP tasks ready to apply on English or German text files. Available for macOS, Windows and Linux.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 4

  •  
  •  
  •  
  •  

Languages