Sparkify Songplay Data Extract Transform Load (ETL)

Introduction

This project is intended to create a postgres database for Sparkify platform song and artist data and combine that information with Sparkify songplay data to create a descriptive set of records of Sparkify platform song plays.

Installation

Python 3

pip install pyscopg2

Usage

From the command line, run these commands:

$ python create_tables.py

$ python etl.py

Database Design

The Sparkify database consists of only 5 separate tables linked by primary and foreign keys. Below are descriptions of the tables and schema.

Table: songplay

Contains data to show the songs listened to by Sparkify platform users. Also shows the time and location of when and where the user listened to the song.

Type	Column	Type
PK	songplay_id	int
FK	start_time	timestamp
FK	user_id	int
null	level	varchar
FK	song_id	varchar
FK	artist_id	varchar
null	session_id	varchar
null	location	varchar
null	user_agent	varchar

Table: user

Contains data on the Sparkify platform user.

Type	Column	Type
PK	user_id	int
null	first_name	varchar
null	last_name	varchar
null	gender	varchar
null	level	varchar

Table: song

Contains song data.

Type	Column	Type
PK	song_id	varchar
null	title	varchar
FK	artist_id	varchar
null	year	int
null	duration	decimal

Table: artist

Contains artist data.

Type	Column	Type
PK	artist_id	varchar
null	name	varchar
null	location	varchar
null	latitude	varchar
null	longitude	varchar

Table: time

Contains timestamp data from song listens as well as other time related.

Type	Column	Type
PK	start_time	timestamp
null	hour	int
null	day	int
null	week	int
null	month	int
null	year	int
null	weekday	int

ETL Process

song_data => insert song data into song table song_data => insert artist data into artist data

log_data => extract timestamp and time information from songplay, insert into time table => find artist_id and song_id from artist and song tables => insert artist, song, and songplay data into songplay table

Files

data (folder)

log_data
- json records of Sparkify song plays
song_data
- json records of Sparkify song and artist information

create_tables.py

Python module that drops tables if they exist and creates them in the database using queries from sql_queries.

etl.py

Python module that moves songplay, song and artist data from the song_data and log_data folders and inserts data into the songplay table.

etl.ipynb

Notebook for developing etl.py module.

sql_queries.py

Python module that contains all SQL queries for table creation and inserts.

test.ipynb

Notebook for testing etl.py module.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
create_tables.py		create_tables.py
data.zip		data.zip
etl.ipynb		etl.ipynb
etl.py		etl.py
sql_queries.py		sql_queries.py
test.ipynb		test.ipynb
weston-mackinnon-3pCRW_JRKM8-unsplash.jpg		weston-mackinnon-3pCRW_JRKM8-unsplash.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

create_tables.py

create_tables.py

data.zip

data.zip

etl.ipynb

etl.ipynb

etl.py

etl.py

sql_queries.py

sql_queries.py

test.ipynb

test.ipynb

weston-mackinnon-3pCRW_JRKM8-unsplash.jpg

weston-mackinnon-3pCRW_JRKM8-unsplash.jpg

Repository files navigation

Sparkify Songplay Data Extract Transform Load (ETL)

Introduction

Table of Contents

Installation

Usage

Database Design

ETL Process

Files

About

Releases

Packages

Languages

caseywhorton/Sparkify-Data-Model-ETL

Folders and files

Latest commit

History

Repository files navigation

Sparkify Songplay Data Extract Transform Load (ETL)

Introduction

Table of Contents

Installation

Usage

Database Design

ETL Process

Files

About

Topics

Resources

Stars

Watchers

Forks

Languages