Million Song Dataset JSON -> AWS RedShift

Load data from the Million Song Dataset into a final dimensional model in RedShift utilizing Apache Airflow.

Introduction

A fictional music streaming startup, Sparkify, has grown their user base and song database and want to move their processes and data onto the cloud. Their data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in their app. The files in this repository will assist in loading the data into a final dimensional model in AWS RedShift.

Apache Airflow Operators

Data is extracted from S3 buckets and loaded into staging tables. Apache Airflow is utilized to coordinate data transfer and transformations. 4 operators are created:

StageToRedshiftOperator
- Stages data from both event and songs source JSON files
LoadFactOperator
- Loads data into songplays fact table
LoadDimensionOperator
- Loads data into all dimension tables
- Optionally truncates dimension before loading
DataQualityOperator
- Generic operator to perform any SQL data quality check passed to it

Apache Airflow DAG

An Apache Airflow DAG was created utilizing the above operators.

Installation

Clone this repository:

git clone https://github.com/rigganni/Million-Song-Apache-Airflow-AWS-RedShift

Set up the following variables in dwh.cfg:

variable	description
DWH_CLUSTER_TYPE	multi-node
DWH_NUM_NODES	number of nodes for cluster
DWH_NODE_TYPE	i.e. dc2.large
DWH_HOST	will be written to in create_redshift_cluster_database.py
DWH_IAM_ROLE_NAME	i.e. dwhRoleS3
DWH_CLUSTER_IDENTIFIER	dwhCluster
DWH_DB	dev
DWH_DB_USER	aws
DWH_DB_PASSWORD	Your password
DWH_PORT	specify port (i.e. 5439)
DWH_REGION	us-west-2 since required S3 files are stored there
DWH_CLUSTER_SUBNET_NAME	use if RedShift cluster is required in specific VPC
DWH_ROLE_ARN	written to in create_redshift_cluster_database.py
DWH_CLUSTER_SUBNET_GROUP_NAME	use if RedShift cluster is required in specific VPC
KEY	AWS key
SECRET	AWS secret
ARN	written to in create_redshift_cluster_database.py
LOG_DATA	s3://udacity-dend/log_data
LOG_JSONPATH	s3://udacity-dend/log_json_path.json
SONG_DATA	s3://udacity-dend/song_data

Usage

Ensure variables are set up in dwh.cfg as described in installation above.

Create RedShift IAM role, cluster, and database:

python create_redshift_cluster_database.py

Deploy files to proper directories in your Apache Airflow setup.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dags		dags
images		images
plugins		plugins
README.md		README.md
create_redshift_cluster_database.py		create_redshift_cluster_database.py
create_tables.sql		create_tables.sql
dwh.cfg		dwh.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dags

dags

images

images

plugins

plugins

README.md

README.md

create_redshift_cluster_database.py

create_redshift_cluster_database.py

create_tables.sql

create_tables.sql

dwh.cfg

dwh.cfg

Repository files navigation

Million Song Dataset JSON -> AWS RedShift

Introduction

Apache Airflow Operators

Apache Airflow DAG

Installation

Usage

About

Releases

Packages

Languages

rigganni/Million-Song-Apache-Airflow-AWS-RedShift

Folders and files

Latest commit

History

Repository files navigation

Million Song Dataset JSON -> AWS RedShift

Introduction

Apache Airflow Operators

Apache Airflow DAG

Installation

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages