Machine Learning to Identify Negative Sentiment Tweets about JWST Mission

About

This project trains a binary ML classification model to classify sentiment in tweets regarding the James Webb Space Telescope (JWST) mission, that were posted between December 30, 2021 and January 10, 2022. The model predicts if tweets

need support
- this is the minority class
- corresponds to negative and neutral sentiment tweets
do not need support
- this is the majority class
- corresponds to positive sentiment tweets

from the mission support/communications team.

Tweets were streamed using AWS Kinesis Firehose and then

combined by date and hour
filtered to only capture tweets relating to the mission
processed text in the tweets using text-processing via PySpark
divided into training, validation and testing splits
used to fine-tune a pre-trained transformers model to predict the above-mentioned binary outcome - if tweets needs support or not

The value of using a ML-based approach to flag tweets needing support was estimated by calculating how much time would be

missed
wasted

if the fine-tuned transformer model was used to predict if tweets in the test split needed support or not compared to the corresponding predictions made using an alternative naive approach that did not use ML (i.e. randomly guessing if tweets needed support). The ML-based approach was shown deliver value by reducing time missed and time wasted compared to the non-ML (naive, random guessing) approach to predicting if tweets needed support or not.

For full details about the background, motivation and implementation overview, please see the full project scope.

Pre-Requisites

The following AWS (1, 2) and Twitter Developer API credentials
- AWS_ACCESS_KEY_ID
- AWS_SECRET_ACCESS_KEY
- AWS_REGION
- AWS_S3_BUCKET_NAME
- TWITTER_API_KEY
- TWITTER_API_KEY_SECRET
- TWITTER_ACCESS_TOKEN
- TWITTER_ACCESS_TOKEN_SECRET
must be stored in a .env file stored one level up from the root directory of this project.i.e. one level up from the directory containing this README.md file.

Usage

Create AWS resources
```
make -f Makefile-stream aws-create
```
In this step, if the code in the section See EC2 Public IP Address in Ansible Inventory has not been manually executed, then edit
```
inventories/production/host_vars/ec2host
```
and replace ... in ansible_host: ... by the public IP address of the newly created EC2 instance from the AWS Console in the EC2 section.
Provision the EC2 host, excluding Python package installation
```
make -f Makefile-stream provision-pre-python
```

Install Python packages on the EC2 host

make -f Makefile-stream provision-post-python

Start the Twitter streaming script locally

make -f Makefile-stream stream-local-start

Stop the Twitter streaming script running locally
```
make -f Makefile-stream stream-local-stop
```
Start the Twitter streaming script on the EC2 instance
```
make -f Makefile-stream stream-start
```
Stop the Twitter streaming script running on the EC2 instance
```
make -f Makefile-stream stream-stop
```

(optional) Run the Twitter streaming script locally, saving to a local CSV file but not to S3

make -f Makefile-stream stream-check

Pre-Requisites

the eight environment variables listed above must be manually set, before running this script, using

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_REGION=...
export AWS_S3_BUCKET_NAME=...
export TWITTER_API_KEY=...
export TWITTER_API_KEY_SECRET=...
export TWITTER_ACCESS_TOKEN=...
export TWITTER_ACCESS_TOKEN_SECRET=...

Combine data (tweets) by hour
```
make combine-data combine-data-logs
```
Combine data (tweets) by hour
```
make combine-data combine-data-logs
```
Filter hourly data (tweets) to remove unwanted / irrelevant tweets
```
make filter-data filter-data-logs
```
Process text of all filtered data (tweets)
```
make process-data process-data-logs
```
Split data
```
make split-data split-data-logs
```
Fine-tune pre-trained model
```
make train train-logs
```
Evaluate prediction probabilities on unseen data
```
make inference inference-logs
```
Destroy AWS resources
```
make -f Makefile-stream aws-destroy
```

Notebooks

1_create_aws_resources.ipynb (view)
- use the AWS Python SDK (boto3 link) to create AWS resources
2_delete_aws_resources.ipynb (view)
- use boto3 to delete all AWS resources
3_combine_data.ipynb (view)
- combines raw data (streamed tweets) by hour
- since each hour of data files were small enough to read into a single data object (DataFrame), in-memory tools were used to combine each hourly folder of streamed data
4_filter_data.ipynb (view)
- filter hourly tweets to remove tweets unrelated to the JWST mission
- filters out unwanted tweets based on a list of words that are not relevant to the subject of this project
5_process_data.ipynb (view)
- processes text in all filtered tweets using PySpark string manipulation methods
6_split_data.ipynb (view)
- divide processed data into training, validation and testing splits
7_train.ipynb (view)
- fine-tune pre-trained transformers model to flag tweets that need and do not need support
- pre-trained model is trained using training and validation split
- fine-tuned model is then exported to disk and evaluated using test split
  - model evaluation is performed using ML and business metrics
8_inference.ipynb (view)
- trends in fine-tuned model's probabilistic predictions are examined in order to compare the same after re-training in production
- this is necessary in order to ensure model performs as expected in production, when making inference predictions

Notes

When running the script locally (step 8. from Usage above), there is no functionality to stop the twitter_s3.py script. It has to be stopped manually by
- pressing Ctrl + C
- waiting until the specified number of tweets in max_num_tweets_wanted on line 217 of twitter_s3.py, have been streamed
Running the notebooks to create and destroy AWS resources in a non-interactive approach has not been verified. It is not currently known if this is possible.
AWS resources are created and destroyed using the boto3 AWS Python SDK. The AWS EC2 instance that is used to host the Twitter streaming (Python) code is provisioned using Ansible playbooks.
The AWS credentials must be associated to a user group whose users have been granted programmatic access to AWS resources. In order to configure this for the IAM user group from the AWS console, see the documentation here. For this project, this was done before creating any AWS resources using the AWS Python SDK.
The Twitter credentials must be for a user account with elevated access to the Twitter Developer API.
Data used for this project was collected between December 30, 2021 and January 10, 2022.

Project Organization

├── LICENSE
├── .gitignore                          <- files and folders to be ignored by version control system
├── .pre-commit-config.yaml             <- configuration file for pre-commit hooks
├── .github
│   ├── workflows
│       └── main.yml                    <- configuration file for CI build on Github Actions
├── Makefile                            <- Makefile with commands like `make lint` or `make build`
├── Makefile-stream                     <- Makefile for streaming tweets
├── README.md                           <- The top-level README for developers using this project.
├── scoping.md                          <- Project scope.
├── ansible.cfg                         <- configuration file for Ansible
├── environment.yml                     <- configuration file to create environment to run project on Binder
├── manage_host.yml                     <- manage provisioning of EC2 host
├── read_data.py                        <- Python script to read streamed Twitter data that has been saved locally
├── streamer.py                         <- Wrapper script to control local or remote Twitter streaming
├── stream_twitter.yml                  <- stream Twitter data on EC2 instance
├── twitter_s3.py                       <- Python script to stream Twitter data locally or on EC2 instance
├── variables_run.yaml                  <- Ansible playbook variables
├── tox.ini                             <- tox file with settings for running tox; see https://tox.readthedocs.io/en/latest/
├── utils.sh                            <- shell convenience utilities when caling `make`
├── data
│   ├── raw                             <- The original, immutable data dump.
|   └── processed                       <- Intermediate (transformed) data and final, canonical data sets for modeling.
├── notebooks                           <- Jupyter notebooks. Naming convention is a number (for ordering),
│                                          the creator's initials, and a short `-` delimited description, e.g.
│                                          `1.0-jqp-initial-data-exploration`.
├── inventories
│   ├── production
│       ├── host_vars                   <- variables to inject into Ansible playbooks, per target host
│           └── ec2_host
|       └── hosts                       <- Ansible inventory
├── v1                                  <- previous version of project - topic modeling using big-data tools only
├── src                                 <- Source code for use in this project
│   └── __init__.py                     <- Makes src a Python module

Project based on the cookiecutter data science project template. #cookiecutterdatascience

Name		Name	Last commit message	Last commit date
Latest commit History 147 Commits
.github/workflows		.github/workflows
data		data
inventories/production		inventories/production
model-fine-tuned		model-fine-tuned
notebooks		notebooks
src		src
v1		v1
v2/old-version		v2/old-version
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
1_create_aws_resources.ipynb		1_create_aws_resources.ipynb
2_delete_aws_resources.ipynb		2_delete_aws_resources.ipynb
LICENSE		LICENSE
Makefile		Makefile
Makefile-stream		Makefile-stream
README.md		README.md
ansible.cfg		ansible.cfg
docker-compose.yml		docker-compose.yml
labeling_guide.md		labeling_guide.md
manage_host.yml		manage_host.yml
read_data.py		read_data.py
scoping.md		scoping.md
stream_twitter.yml		stream_twitter.yml
streamer.py		streamer.py
tox.ini		tox.ini
twitter_s3.py		twitter_s3.py
utils.sh		utils.sh
variables_run.yaml		variables_run.yaml

License

elsdes3/space-mission-negative-sentiment-tweet-identifier

Folders and files

Latest commit

History

Repository files navigation

Machine Learning to Identify Negative Sentiment Tweets about JWST Mission

About

Topics

Resources

License

Stars

Watchers

Forks

Languages