Skip to content

Multi-pronged, multi-stage analysis of a 3.5M-sentences science fiction corpus using optimized NLP, with NER techniques, LDA modeling and LLM integration. After final commit, will be able to run a main file to generate a visualization of results on-demand. Modularized and documented code that can easily be reused/refitted for other kinds of corpii.

License

kcentric/deep_nlp_on_sf_literature

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

48 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Deep NLP: SF Literature Analysis

Introduction

Welcome πŸ‘‹

The dataset I began with for this project, sourced largely from the Pulp Magazine Archive, is a huge collection of science fiction stories in a single-file text corpus, 149.33MB in raw size. Here's a link to the first book in the corpus, and here's a snapshot of how the text looks in PyCharm:

Screenshot 2024-01-25 at 10 07 15β€―AM

The stories span multiple decades and contain a variety of writing styles, themes, and ideas. They represent a good snapshot of 20th-century SF literature, and have demonstrated their usefulness before for awesome projects like Robin Sloan's Autocomplete.

I wanted to analyze the corpus itself, and in the process gain insights into the era of SF literature it represents. I decided to use a multi-pronged, multi-stage approach, in each step focusing on making my code as generalizable and well-documented as possible.

Key Highlights

  • Processed texts using customized methods, NLTK, and spaCy
  • Performed domain-specific named entity recognition in multiple stages
  • Fine-tuned a RoBERTa model using GPT to generate annotated data
  • Implemented multicore LDA for efficient topic modeling and theme-extraction
  • Modularized code to make it highly reusable for other domain-specific literature tasks: code can be easily refitted for legal datasets, a corpus of classics etc.

An example of GPT-generated annotated data (ain't it interesting? πŸ™‚):

[
 ('buddha', 'MISCELLANEOUS SIGNIFICANT'), ('pistol shots', 'TECHNOLOGY'),
 ('chart', 'TECHNOLOGY'), ('torpedo', 'TECHNOLOGY'),
 ('steel shelves', 'TECHNOLOGY'), ('southeast asia', 'MISCELLANEOUS SIGNIFICANT')
]

Some basic NER output:

                                                               Sentence         Entity   Entity_type
2294  plane cloud degree climb dozen mile towards philippine anyone yat...    dozen mile    QUANTITY
2295                                                    seen kyoto buddha          kyoto         GPE
2296                                 would make building quarter mile long  quarter mile    QUANTITY

An example of a cleaned string vs an uncleaned string:

     Sentences                                                Cleaned Sentences
122, yet in the way they moved and in the way they stood      yet way moved way stood

How to use your own corpus

Step 0:

  • Insert a plain text file in the Input Files directory. There is no specific format it has to be in, but it works best if it's as regular a text file as we get. Check out small_sample_text_for_testing.txt to see how your corpus/file should look like.

  • Go to CorpusProcessor_with_NER.py. Change the FILEPATH variable, right at the top of the code, to say "../Input Files/[your_file_name].txt". Replace your_file_name with your actual file name.

  • Make sure to uncomment the line saying nltk.download() right above FILEPATH, if you don't have NLTK on your system. Here's how the code-snippet at the top of the CorpusProcessor module should look like then:

    nltk.download()  # Run this line whenever the project is to be executed for the first time. 
                     # Triggers nltk data-packs to download to the local system.
    
    # Path to the default corpus
    FILEPATH = '../Input_Files/my_file.txt'

    Note how nltk.download() is not commented out now.

Step 1 πŸš€:

  • And that's it! You're ready to go. Run CorpusProcessor_with_NER as a script, and see your data get cleaned 🧼, ready for further processing. Then follow the exploration guide below.

Overview of the project

The steps I went through were as follows:

Step 1: Rigorous data preparation

You can find the code for this in CorpusProcessor_with_NER.py, here.

  • I split the text into units of sentences and tokens using NLTK. Tokens are analogous to words in that they are units of meaning: they make text better suited for Natural Language Processing (NLP). Here are some examples.

  • I clean the tokenized text by removing spaces, words like "a, an, the" that may not add meaning but are very frequent in the data, etc. The function doing most of this work is clean_string. Its header looks like this in CorpusProcessor:

    def clean_string(self, text, only_remove_line_breaks=False, pos_tokens_if_lemmatizing=None, find_pos=False,
                     stem="None", working_on_stories=True):
        """Takes in a string and removes line breaks, punctuation, stop-words, numbers, and proceeds to stem/lemmatize.
        Returns the "cleaned" text finally. Capable of nuances depending on inputs."""

    Note that we have a working_on_stories parameter that defaults to True. The reason I made my cleaning methods so versatile and customized is that for domain-specific datasets, merely removing "a, an, the" etc. is not sufficient. For example, the words "said," "replied," "asked" might be just as common in a story-corpus as the article "and". (I have not statistically modeled this. Feel free to drop me a comment about whether this is true πŸ™‚)

  • Besides cleaning, CorpusProcessor's initial_prep also lemmatizes the text, then generates representations of the corpus as a (humongous) list of words, as a list of sentences, and as a list of cleaned sentences. Code snippet:

    self.text_as_a_wordlist = list(itertools.chain.from_iterable(list_of_lists))  # Now we can easily use this wordlist
                                                                                      # for further processing
    self.list_of_sentences_original = list_of_sentences
    self.list_of_sentences_cleaned = cleaned_sentences

    By storing these various representations of the text in attributes, the CorpusProcessor object will then allow us to flexibly perform numerous tasks with the text as our need be. Indeed, as initial_prep itself shall summarize for you, if you are mistaken enough to try get a return value from it ... πŸ™‚

    # Having some fun :)
    return "This function does not return prepped text, but rather just preps the text to now be contained" 
           "as the CorpusProcessor's attributes. Please call those attributes if you wish to see the cleaned text :-)"

    The main functions that I use CorpusProcessor's attributes for are to split the corpus into individual books (by splitting around valid appearances of 'Copyright' in the wordlist) so that I can have separate documents for topic-modeling via LDA, and to create a Pandas dataframe containing the 3.5 million (cleaned) sentences of the corpus: the dataframe is very useful for NER.

Step 2: Multi-stage named entity recognition (NER)

This turned out to be the most involved part of the project. You can begin exploring in NER.py, here.

  • NER is typically the process of extracting words or series of words from a text and categorizing them under common labels like "Person," "Organization," "Location," or custom labels like "Healthcare Terms," "Programming Languages" etc. For my corpus, I was primarily interested in extracting terms that represented SF technologies, SF concepts, and miscellaneous significant terms that contained plot- or theme-related meaning.

  • I began by performing a blanket-NER task across the whole corpus, using a Pandas dataframe, created earlier by CorpusProcessor, as my input. The Pandas dataframe contained the corpus as a list of sentences, and my optimized NER algo using the spaCy library with its parallelization capabilities generated a new dataframe that would look like:

                                              Sentence          Entity Entity_type
    0  galaxy far, far away, Luke Skywalker tra...       Luke Skywalker      Person
    1  year 2050 marked new era humanity.                          2050        Date

    This gave me a lot of meaning-containing sentences with their general entities. spaCy by itself of course cannot do something as specialized as extracting SF technology- and concept-terms from the text (without a significant amount of training, at least), which took me to the next stage.

  • I prepared a module (check sentences_of_entities.py) to help me extract all original (uncleaned/unmodified) sentences from the corpus which contain any of the named entities found by spaCy: essentially, the original versions of the sentences which I fed to spaCy. For example, the above dataframe would now be modified to look like:

                                              Sentence          Entity Entity_type
    0  In a galaxy far, far away, Luke Skywalker tra...  Luke Skywalker      Person
    1  The year 2050 marked a new era for humanity.                2050        Date

    Note the return of words like "a", "the", "in", "for" etc., which for the previous stage had been removed as stop-words.

    • The reason for this is that I planned to integrate with an LLM (large language model) like GPT-3.5 for some of my next steps, and LLMs are designed to generally do better on texts which look exactly like what a human would write, than bare-bones word-strings.
  • At this point, I had a set of ~135k sentences. I further used scikit-learn's TfidfVectorizer and KMeans clustering to extract a diverse batch of ~10k sentences to represent this set.

    • Find the TF-IDF code in TF-IDF.
    • The reason we want to use ~10k sentences rather than ~135k (or 3.5 million, for that matter) is that we would have to pay for that API usage (OpenAI's software is proprietary) - not to mention the huge increases in processing time that come with talking with an LLM in real-time (it took ~2 hours to run 10k sentences past GPT-3.5! That was the longest processing time for any single stage of my code).
  • I then ran GPT_NER_Round_1.py, which you can find in this directory along with its prep and configuration modules. GPT_NER_Round_1 integrates few-shot learning with spacy-llm and OpenAI API to extract terms under TECHNOLOGY, CONCEPT, and MISCELLANEOUS SIGNIFICANT labels from our 10k-sentence sample dataset. The output is stored in GPTResponses.json, which is a list of Python dictionaries, each in this basic format:

{
  "span_of_para_in_sentence_list": [ "Beginning:0", "End:10" ],
  "entities_in_para": [ [ "second class matter", "CONCEPT" ], [ "howard browne", "MISCELLANEOUS SIGNIFICANT" ] ]
}
  • After getting this GPT-generated data, I now use it as training data for fine-tuning a RoBERTa model. RoBERTa, unlike GPT-3.5, is an open-source model which is perfectly suited for my goal of achieving project independency and reducing budget as much as possible. Find RoBERTa documentation here.
  • The fine-tuned RoBERTa model then performs the remainder of our SF technology-, concept- and miscellaneous-term extraction for our ~135k entity-containing sentences from across the corpus.

The next steps, Step 3 and Step 4, are "features to come"; I am currently in the process of completing my RoBERTa fine-tuning.

Step 3: LDA topic modeling, time series analyses, and general collation

  • After RoBERTa-powered NER, we shall be using Latent Dirichlet Allocation via the gensim library, along with scikit-learn to model topics, themes etc. for further analysis on this "substantial essence" (represented by tech, concept and character-containing sentences) of the corpus.

Step 4: Creation of final graphs and visualization of results, final commit

  • An app.py or main.py file will be created, containing code that a user or explorer of the repo can simply run to visualize our results. This shall complete the project.

Exploration guide

As of this writing, I am currently in the process of fine-tuning a RoBERTa model for my last stage of NER. If you wish to explore the project code, I suggest beginning with CorpusProcessor_with_NER.py in main files (which actually does not contain any NER implementation within itself yet, despite being fully functional otherwise). It contains comprehensive data-preprocessing methods for tasks ranging from lemmatization to custom string-cleaning. If you run it as a script, it is meant to generate a processed object for small_sample_text_for_testing.txt. I have not yet hosted my original corpus file on GitHub thanks to its size, but if you follow the documentation in CorpusProcessor itself, you can easily use your own corpus to test it out.

Most of the action as of the project's current version is contained in directories main files, TF-IDF, and LLM. In main files, after you've used CorpusProcessor, go to NER.py that performs first-stage NER on the corpus using spaCy and stores the output in "NER_output2.csv" by default.

After performing basic NER, you can check out sentences_of_entities.py. If you run it as a script, it (attempts to) extract the famed 3.5M sentences from sci-fi_text_cleaned.csv, the csv file representing a Pandas dataframe that contains the corpus as a list of sentences with their cleaned versions. Then from this list, sentences_of_entities extracts all the sentences that contain any of the entities that were found by spaCy's NER. (Note that these entities are brought in from entity_list.txt which is a file I created using a separate script, after running NER.py. entity_list.txt is small enough to be included in the "main files" directory, so it is. Also note that in the current version of the code, "sci-fi_text_cleaned.csv" has not been added already to main files. You may have to create this file or slightly modify the code in sentences_of_entities to have it work as a script)

Now the results of sentences_of_entities.py will be stored in the LLM directory, in sents_containing_named_entities.txt. But don't go to the LLM directory just yet: first go to TF-IDF and run TF-IDF.py as a script. This will use scikit-learn's TfidfVectorizer and KMeans clustering to extract a diverse batch of 8-10,000 sentences from sents_containing_named_entities.txt. The diverse batch of sentences will be stored in representative_sents_for_GPT_labelling.txt that remains in the TF-IDF directory.

  • The reason we use a representative sample, rather than the full 114,000+ entity-containing sentences is that otherwise, we'll need a lot of OpenAI API tokens to run our round of custom NER using GPT-3.5 Turbo.

Finally, you can now go to directory LLM, open GPT_NER_round_1.py, and run it as a script. I have already created data you can use, so when asked about it you can use any key except "Y" to generate RoBERTa training data. If you want to see real magic however, first go to api_key.py and insert a valid OpenAI API key. Then run GPT_NER_round_1 again and say "Y" when it asks you for input. You can then see as we use spaCyLLM and OpenAI API to label and extract terms from the text as TECHNOLOGY, CONCEPT, or MISCELLANEOUS SIGNIFICANT. Then, we tokenize this output appropriately and prepare to use it for fine-tuning a RoBERTa model, so RoBERTa (documentation here) can perform the remainder of our required NER - for free.

Note on LLM: LLM_Interactor.py is the module that handles the creation of - you guessed it - an LLM interaction object. It configures spaCyLLM - please check the documentation - using the configuration strings residing in config_strings_2.py, which in turn refers to fewshot.json for its fewshot-training info. This means that LLM_Interactor as well as GPT_NER_round_1 are very independent and malleable modules. By modifying fewshot.json, you can train GPT-3.5 to perform NER for any other kind of domain-specific task. You can also change the spacyLLM configuration by modifying config_strings_2.py, where you can modify the "labels" to be fed for your own domain-specific task. For example, you can change these lines:

`[components.llm.task]`  
`@llm_tasks = "spacy.NER.v3"`
`labels = ["TECHNOLOGY", "CONCEPT", "MISCELLANEOUS SIGNIFICANT"]`

to these lines:

  `[components.llm.task]`
  `@llm_tasks = "spacy.NER.v3"`
  `labels = ["FINANCIAL ENTITY", "NON-FINANCIAL ENTITY"]`

if you were doing NLP on finance-related texts. You can even change the LLM model used to other ones supported by spaCyLLM.

About the LDA modules: In main files, note that we also have LDA.py and LDATrainer.py. These are in-themselves complete modules, which if you wish, you can repurpose, test or reuse.

  • To run LDATrainer as a script, change FILEPATH to "../Input Files/small_sample_text_for_testing.txt" or to a corpus of your own. The code-snippet (before you edit it) should look as below:
   from time import time
   from CorpusProcessor_with_NER import CorpusProcessor
 
   # Path to SF corpus
   FILEPATH = '../Data Files (Readable)/Input Files/internet_archive_scifi_v3.txt'

After I have completed NER with fine-tuned RoBERTa, I shall integrate LDA modeling and topic extraction into the project workflow.

Note on main.py: Prior to the final commit, a main.py/app.py file will be created, containing code to generate a visualization of our results on-demand, thus completing the project.

For more about me, see my Linkedin profile.

About

Multi-pronged, multi-stage analysis of a 3.5M-sentences science fiction corpus using optimized NLP, with NER techniques, LDA modeling and LLM integration. After final commit, will be able to run a main file to generate a visualization of results on-demand. Modularized and documented code that can easily be reused/refitted for other kinds of corpii.

Topics

Resources

License

Stars

Watchers

Forks