GitHub - Nerdward/batch_gh_archive: Data Engineering Project with Terraform, Spark, AWS, Docker, Airflow and other tools

Problem statement

The Research and Development Team of Github want to know what time of the day they get more traffic and what resources are not popular enough. They will send details to the Marketing Team.

You have been hired to give insights on Github Developer activity for June 2022.
Here are some visualizations you need to produce:
- Traffic per hour
- Events popularity chart.

About the Dataset

GitHub Archive is a project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis

Architecture

Create a dashboard

Data Pipeline

The pipeline could be stream or batch: this is the first thing you'll need to decide

If you want to run things periodically (e.g. hourly/daily), go with batch

Technologies / Tools

Containerisation: Docker
Cloud: AWS
Infrastructure as code (IaC): Terraform
Workflow orchestration: Airflow
Data Wareshouse: Redshift
Batch processing: EMR
Visualisation: Google Data Studio

About the Project

Github Archive data is ingested daily into the AWS S3 buckets from 1st of May.
A Spark job is run on the data stored in the S3 bucket using AWS ElasticMapReduce (EMR)
The results are written to a table defined in Redshift.
A dashboard is created from the Redshift tables.

Dashboard

Reproducibility

Prerequisites.

AWS Platform Account

Create an AWS account if you do not have one. AWS offers free tier for some services like S3, Redshift.

Create an IAM user (optional but advised)

Open the IAM console here
In the navigation pane, choose Users and then choose Add users. More information here
Select Programmatic access, For console password, create custom password.
On the Set permissions page, Attach AdministratorAccess policy.
Download credentials.csv file with login information and store it in ${HOME}/.aws/credentials.csv

Pre-Infrastructure Setup

Terraform is used to setup most of the services used for this project i.e S3 buckets, Redshift cluster. This section contains step to setup these aspects of the project.

Setting up a Virtual Machine

You can use any virtual machine of your choice; Azure, GCP etc.. But AWS EC2 is preferable because of faster upload and download speeds to AWS services. To set up an AWS EC2 vm that works for this project, you will need to pay for it. Here is a link to help. Ubuntu OS is preferable.

AWS CLI

To download and set up AWS cli

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

AWS credentials

To configure aws credentials run

$ aws configure
AWS Access Key ID [None]: fill with value from credentials.csv
AWS Secret Access Key [None]: fill with value from fill with value from credentials.csv
Default region name [None]: your regiion
Default output format [None]: json

Docker

Connect to your VM

Install Docker

sudo apt-get update
sudo apt-get install docker.io

Docker needs to be configured so that it can run without sudo
```
sudo groupadd docker
sudo gpasswd -a $USER docker
sudo service docker restart
```
- Logout of your SSH session and log back in
- Test that docker works successfully by running docker run hello-world

Docker-Compose

Check and copy the latest release for Linux from the official Github repository
Create a folder called bin/ in the home directory. Navigate into the /bin directory and download the binary file there
```
wget <copied-file> -O docker-compose
```
Make the file executable
```
chmod +x docker-compose
```
Add the .bin/ directory to PATH permanently
- Open the .bashrc file in the HOME directory
```
nano .bashrc
```
- Go to the end of the file and paste this there
```
export PATH="${HOME}/bin:${PATH}"
```
- Save the file (CTRL-O) and exit nano (CTRL-X)
- Reload the PATH variable
```
source .bashrc
```
You should be able to run docker-compose from anywhere now. Test this with docker-compose --version

Terraform

Navigate to the bin/ directory that you created and run this

wget https://releases.hashicorp.com/terraform/1.1.7/terraform_1.1.7_linux_amd64.zip

Unzip the file
```
unzip terraform_1.1.7_linux_amd64.zip
```
You might have to install unzip sudo apt-get install unzip
Remove the zip file
```
rm terraform_1.1.7_linux_amd64.zip
```
Terraform is already installed. Test it with terraform -v

Remote-SSH

To work with folders on a remote machine on Visual Studio Code, you need this extension. This extension also simplifies the forwarding of ports.

Install the Remote-SSH extension from the Extensions Marketplace
At the bottom left-hand corner, click the Open a Remote Window icon
Click Connect to Host. Click the name of your config file host.
In the Explorer tab, open any folder on your Virtual Machine Now, you can use VSCode completely to run this project.

back to index

Main

Clone the repository

    git clone https://github.com/Nerdward/batch_gh_archive

Create remaining infrastructure with Terraform

We use Terraform to create a S3 bucket and Redshift

Navigate to the terraform folder

change the name of the variables to suit your project type.

set the username and password for your redshift cluster using this
```
# Set secrets via environment variables
export TF_VAR_username=(the username)
export TF_VAR_password=(the password)
```
Initialise terraform
```
terraform init
```
Check infrastructure plan
```
terraform plan
```
Create new infrastructure
```
terraform apply
```
Confirm that the infrastructure has been created on the AWS console.batch_gh_archive

Initialise Airflow

Airflow is run in a docker container. This section contains steps on initisialing Airflow resources

Navigate to the airflow folder
Create a logs folder airflow/logs/
```
mkdir logs/
```
Build the docker image
```
docker-compose build
```
The names of some project resources are hardcoded in the docker_compose.yaml file. Change this values to suit your use-case
Initialise Airflow resources
```
docker-compose up airflow-init
```
Kick up all other services
```
docker-compose up
```
Open another terminal instance and check docker running services
```
docker ps
```
- Check if all the services are healthy
Forward port 8080 from VS Code. Open localhost:8080 on your browser and sign into airflow

Both username and password is airflow

Run the pipeline

You are already signed into Airflow. Now it's time to run the pipeline

Click on the DAG Batch_Github_Archives that you see there
You should see a tree-like structure of the DAG you're about to run
You can also check the graph structure of the DAG
At the top right-hand corner, trigger the DAG. Make sure Auto-refresh is turned on before doing this

The DAG would run from May 1 at 12:00am UTC till May 7
This should take a while
While this is going on, check the AWS console to confirm that everything is working accordingly

The EMR clusters should be starting up

If you face any problem or error, confirm that you have followed all the above instructions religiously. If the problems still persist, raise an issue.
When the pipeline is finished and you've confirmed that everything went well, shut down *docker-compose with CTRL-C and kill all containers with docker-compose down
Take a well-deserved break to rest. This has been a long ride.

back to index

Going the extra mile

Here are a few things I can do:

Add tests
Use make
Add CI/CD pipeline

Some links to refer to:

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
airflow		airflow
images		images
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Nerdward/batch_gh_archive

Folders and files

Latest commit

History

Repository files navigation