Skip to content

elsdes3/space-mission-negative-sentiment-tweet-identifier

Repository files navigation

Machine Learning to Identify Negative Sentiment Tweets about JWST Mission

Binder Open In Colab CI License: MIT OpenSource Code style: black prs-welcome

  1. About
  2. Pre-Requisites
  3. Usage
  4. Notebooks
  5. Notes
  6. Project Organization

This project trains a binary ML classification model to classify sentiment in tweets regarding the James Webb Space Telescope (JWST) mission, that were posted between December 30, 2021 and January 10, 2022. The model predicts if tweets

  • need support
    • this is the minority class
    • corresponds to negative and neutral sentiment tweets
  • do not need support
    • this is the majority class
    • corresponds to positive sentiment tweets

from the mission support/communications team.

Tweets were streamed using AWS Kinesis Firehose and then

  • combined by date and hour
  • filtered to only capture tweets relating to the mission
  • processed text in the tweets using text-processing via PySpark
  • divided into training, validation and testing splits
  • used to fine-tune a pre-trained transformers model to predict the above-mentioned binary outcome - if tweets needs support or not

The value of using a ML-based approach to flag tweets needing support was estimated by calculating how much time would be

  • missed
  • wasted

if the fine-tuned transformer model was used to predict if tweets in the test split needed support or not compared to the corresponding predictions made using an alternative naive approach that did not use ML (i.e. randomly guessing if tweets needed support). The ML-based approach was shown deliver value by reducing time missed and time wasted compared to the non-ML (naive, random guessing) approach to predicting if tweets needed support or not.

For full details about the background, motivation and implementation overview, please see the full project scope.

  1. The following AWS (1, 2) and Twitter Developer API credentials

    • AWS_ACCESS_KEY_ID
    • AWS_SECRET_ACCESS_KEY
    • AWS_REGION
    • AWS_S3_BUCKET_NAME
    • TWITTER_API_KEY
    • TWITTER_API_KEY_SECRET
    • TWITTER_ACCESS_TOKEN
    • TWITTER_ACCESS_TOKEN_SECRET

    must be stored in a .env file stored one level up from the root directory of this project.i.e. one level up from the directory containing this README.md file.

  1. Create AWS resources

    make -f Makefile-stream aws-create

    In this step, if the code in the section See EC2 Public IP Address in Ansible Inventory has not been manually executed, then edit

    inventories/production/host_vars/ec2host

    and replace ... in ansible_host: ... by the public IP address of the newly created EC2 instance from the AWS Console in the EC2 section.

  2. Provision the EC2 host, excluding Python package installation

    make -f Makefile-stream provision-pre-python
  3. Install Python packages on the EC2 host

    make -f Makefile-stream provision-post-python
  4. Start the Twitter streaming script locally

    make -f Makefile-stream stream-local-start
  5. Stop the Twitter streaming script running locally

    make -f Makefile-stream stream-local-stop
  6. Start the Twitter streaming script on the EC2 instance

    make -f Makefile-stream stream-start
  7. Stop the Twitter streaming script running on the EC2 instance

    make -f Makefile-stream stream-stop
  8. (optional) Run the Twitter streaming script locally, saving to a local CSV file but not to S3

    make -f Makefile-stream stream-check

    Pre-Requisites

    • the eight environment variables listed above must be manually set, before running this script, using
      export AWS_ACCESS_KEY_ID=...
      export AWS_SECRET_ACCESS_KEY=...
      export AWS_REGION=...
      export AWS_S3_BUCKET_NAME=...
      export TWITTER_API_KEY=...
      export TWITTER_API_KEY_SECRET=...
      export TWITTER_ACCESS_TOKEN=...
      export TWITTER_ACCESS_TOKEN_SECRET=...
  9. Combine data (tweets) by hour

    make combine-data combine-data-logs
    
  10. Combine data (tweets) by hour

    make combine-data combine-data-logs
    
  11. Filter hourly data (tweets) to remove unwanted / irrelevant tweets

    make filter-data filter-data-logs
    
  12. Process text of all filtered data (tweets)

    make process-data process-data-logs
    
  13. Split data

    make split-data split-data-logs
    
  14. Fine-tune pre-trained model

    make train train-logs
    
  15. Evaluate prediction probabilities on unseen data

    make inference inference-logs
    
  16. Destroy AWS resources

    make -f Makefile-stream aws-destroy
  1. 1_create_aws_resources.ipynb (view)
  2. 2_delete_aws_resources.ipynb (view)
    • use boto3 to delete all AWS resources
  3. 3_combine_data.ipynb (view)
    • combines raw data (streamed tweets) by hour
    • since each hour of data files were small enough to read into a single data object (DataFrame), in-memory tools were used to combine each hourly folder of streamed data
  4. 4_filter_data.ipynb (view)
    • filter hourly tweets to remove tweets unrelated to the JWST mission
    • filters out unwanted tweets based on a list of words that are not relevant to the subject of this project
  5. 5_process_data.ipynb (view)
    • processes text in all filtered tweets using PySpark string manipulation methods
  6. 6_split_data.ipynb (view)
    • divide processed data into training, validation and testing splits
  7. 7_train.ipynb (view)
    • fine-tune pre-trained transformers model to flag tweets that need and do not need support
    • pre-trained model is trained using training and validation split
    • fine-tuned model is then exported to disk and evaluated using test split
      • model evaluation is performed using ML and business metrics
  8. 8_inference.ipynb (view)
    • trends in fine-tuned model's probabilistic predictions are examined in order to compare the same after re-training in production
    • this is necessary in order to ensure model performs as expected in production, when making inference predictions
  1. When running the script locally (step 8. from Usage above), there is no functionality to stop the twitter_s3.py script. It has to be stopped manually by
    • pressing Ctrl + C
    • waiting until the specified number of tweets in max_num_tweets_wanted on line 217 of twitter_s3.py, have been streamed
  2. Running the notebooks to create and destroy AWS resources in a non-interactive approach has not been verified. It is not currently known if this is possible.
  3. AWS resources are created and destroyed using the boto3 AWS Python SDK. The AWS EC2 instance that is used to host the Twitter streaming (Python) code is provisioned using Ansible playbooks.
  4. The AWS credentials must be associated to a user group whose users have been granted programmatic access to AWS resources. In order to configure this for the IAM user group from the AWS console, see the documentation here. For this project, this was done before creating any AWS resources using the AWS Python SDK.
  5. The Twitter credentials must be for a user account with elevated access to the Twitter Developer API.
  6. Data used for this project was collected between December 30, 2021 and January 10, 2022.
├── LICENSE
├── .gitignore                          <- files and folders to be ignored by version control system
├── .pre-commit-config.yaml             <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── main.yml                    <- configuration file for CI build on Github Actions
├── Makefile                            <- Makefile with commands like `make lint` or `make build`
├── Makefile-stream                     <- Makefile for streaming tweets
├── README.md                           <- The top-level README for developers using this project.
├── scoping.md                          <- Project scope.
├── ansible.cfg                         <- configuration file for Ansible
├── environment.yml                     <- configuration file to create environment to run project on Binder
├── manage_host.yml                     <- manage provisioning of EC2 host
├── read_data.py                        <- Python script to read streamed Twitter data that has been saved locally
├── streamer.py                         <- Wrapper script to control local or remote Twitter streaming
├── stream_twitter.yml                  <- stream Twitter data on EC2 instance
├── twitter_s3.py                       <- Python script to stream Twitter data locally or on EC2 instance
├── variables_run.yaml                  <- Ansible playbook variables
├── tox.ini                             <- tox file with settings for running tox; see https://tox.readthedocs.io/en/latest/
├── utils.sh                            <- shell convenience utilities when caling `make`
├── data
│   ├── raw                             <- The original, immutable data dump.
|   └── processed                       <- Intermediate (transformed) data and final, canonical data sets for modeling.
├── notebooks                           <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                          the creator's initials, and a short `-` delimited description, e.g.
│                                          `1.0-jqp-initial-data-exploration`.
├── inventories
│   ├── production
│       ├── host_vars                   <- variables to inject into Ansible playbooks, per target host
│           └── ec2_host
|       └── hosts                       <- Ansible inventory
├── v1                                  <- previous version of project - topic modeling using big-data tools only
├── src                                 <- Source code for use in this project
│   └── __init__.py                     <- Makes src a Python module

Project based on the cookiecutter data science project template. #cookiecutterdatascience