Spotify Capstone Project - Data Engineering

Almost 10 years back — that the then-nascent term “data engineering” started to pop up in modern highly data-driven scaleups and fast-growing tech companies such as Facebook, Netflix, LinkedIn and Airbnb. As these companies harnessed massive amounts of real-time data that could provide high business value, software engineers at these companies had to develop tools, platforms and frameworks to manage all this data with speed, scalability and reliability. From this, the data engineer job started evolving to a role that transitioned away from using traditional ETL tooling to developing their own tooling to manage the increasing data volumes.

As Data Engineers we built an ETL out of Spotify API using Apache Airflow and performed analysis using PowerBI. The analysis mainly focused on user activity. We developed a python script that extracts data from user spotify account by using Spotipy API. Once the data is extracted we stored the data in SQL Server Management Studio (SSMS). We followed Snowflake Schema method to store the data. Then we connected the database engine to PowerBI, and developed dashboard in PowerBI for Analysis. At Last we performed scheduling using Apache Airflow. Our Project is a complete On-Premise solution.

Overall Architecture

Directory Structure and Description

About Spotify API

We extracted data out of the Spotify API using this endpoint to get the 25 most recently played tracks. The result of calling this endpoint is a dictionary which we will then take and create multiple dataframes after cleaning it up a bit first.

Load Python and SQL Server Management Studio (SSMS)

We can connect to SSMS directly from Python using SQLalchemy library. Once the necessary libraries are installed we can connect to SSMS using Windows Authentication or using username and password. We used Windows authentication to connect SSMS, while connecting to SSMS we have to mention the user name and Database name. After a successful connection one can load the data directly from python to SSMS.

We followed Snowflake Schema approach to store the data in SSMS.

PowerBI for Data Visualization

Initially we started the reporting phase with Tableau and we were able to generate worksheets with the data extracted. Then we switched to PowerBI as we were not able to display images dynamically from the ImageURL provided by Spotify. PowerBI was able to display the images once the data was selected. In PowerBI we used MS SQL connector to connect to the server and authenticate to the database. We created a simple dashboard which shows user details and information related to the users music taste.

Scheduling and Automation using Apache Airflow

We used Apache Airflow for Scheduling and Automation. Airflow is a workflow management tool for scheduling data engineering pipelines. We developed the scripts and created DAG file to schedule the developed ETL scripts to run automatically. We installed Airflow in the Windows Subsystem for Linux(Ubuntu) so that we can orchestrate and monitor the jobs.

Commands used in WSL:

pip3 install apache-airflow

Save the ETL python files and .py in $AIRFLOW_HOME/dags (/root/airflow/dags/).

airflow db init

airflow dags list

airflow tasks list spotify_dag

airflow scheduler

airflow webserver

Airflow DAG Run History

Concepts Implemented

Database Normalization
SQLalchemy Connection
Apache Airflow Scheduling
Object Oriented Programming
Data Modelling
Data Visualization
Orchestration and Scheduling

Tools Used

Pycharm
PowerBI
Apache Airflow
SQL Server Management Studio

Programming Language - Python
Database Query Language - T-SQL
Libraries Used - Pandas, SQLalchemy, urllib, Spotipy, Airflow, datetime
Environment - Windows, Ubuntu

Limitations

Spotify won’t allow to pull data for multiple users because of their security reasons. They allow users to pull data for single user account.
We started with Tableau then we used PowerBI because Tableau was not able to display Artist images directly from web URLs.

Future work

The report will aim to display more insights about the user's taste for music.
Machine Learning model will be applied on the data extracted to recommend tracks to the user based on the past history.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
PowerBI Reports		PowerBI Reports
Python Files		Python Files
SQL_Server_Database		SQL_Server_Database
Screenshots		Screenshots
.DS_Store		.DS_Store
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PowerBI Reports

PowerBI Reports

Python Files

Python Files

SQL_Server_Database

SQL_Server_Database

Screenshots

Screenshots

.DS_Store

.DS_Store

README.md

README.md

Repository files navigation

Spotify Capstone Project - Data Engineering

Overall Architecture

Directory Structure and Description

About Spotify API

Load Python and SQL Server Management Studio (SSMS)

PowerBI for Data Visualization

Scheduling and Automation using Apache Airflow

Airflow DAG Run History

Concepts Implemented

Tools Used

Limitations

Future work

References

Airflow references

Spotify API

Other references

About

Releases

Packages

Contributors 2

Languages

VishanthSurresh/Spotify-Capstone-Project---Data-Engineering

Folders and files

Latest commit

History

Repository files navigation

Spotify Capstone Project - Data Engineering

Overall Architecture

Directory Structure and Description

About Spotify API

Load Python and SQL Server Management Studio (SSMS)

PowerBI for Data Visualization

Scheduling and Automation using Apache Airflow

Airflow DAG Run History

Concepts Implemented

Tools Used

Limitations

Future work

References

Airflow references

Spotify API

Other references

About

Topics

Resources

Stars

Watchers

Forks

Languages