Skip to content

This repository is a working ETL framework which utilizes user data from Spotify API using ➲Python for Extraction and Transformation ➲SQL for Data Loading and Staging ➲Airflow for Data Orchestration and Monitoring ➲PowerBI for Reporting

VishanthSurresh/Spotify-Capstone-Project---Data-Engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spotify Capstone Project - Data Engineering

Introduction

Almost 10 years back — that the then-nascent term “data engineering” started to pop up in modern highly data-driven scaleups and fast-growing tech companies such as Facebook, Netflix, LinkedIn and Airbnb. As these companies harnessed massive amounts of real-time data that could provide high business value, software engineers at these companies had to develop tools, platforms and frameworks to manage all this data with speed, scalability and reliability. From this, the data engineer job started evolving to a role that transitioned away from using traditional ETL tooling to developing their own tooling to manage the increasing data volumes.

As Data Engineers we built an ETL out of Spotify API using Apache Airflow and performed analysis using PowerBI. The analysis mainly focused on user activity. We developed a python script that extracts data from user spotify account by using Spotipy API. Once the data is extracted we stored the data in SQL Server Management Studio (SSMS). We followed Snowflake Schema method to store the data. Then we connected the database engine to PowerBI, and developed dashboard in PowerBI for Analysis. At Last we performed scheduling using Apache Airflow. Our Project is a complete On-Premise solution.

Overall Architecture

overallArchitecture

Directory Structure and Description

image

About Spotify API

We extracted data out of the Spotify API using this endpoint to get the 25 most recently played tracks. The result of calling this endpoint is a dictionary which we will then take and create multiple dataframes after cleaning it up a bit first.

Load Python and SQL Server Management Studio (SSMS)

We can connect to SSMS directly from Python using SQLalchemy library. Once the necessary libraries are installed we can connect to SSMS using Windows Authentication or using username and password. We used Windows authentication to connect SSMS, while connecting to SSMS we have to mention the user name and Database name. After a successful connection one can load the data directly from python to SSMS.

We followed Snowflake Schema approach to store the data in SSMS.

Database_Schema

PowerBI for Data Visualization

Initially we started the reporting phase with Tableau and we were able to generate worksheets with the data extracted. Then we switched to PowerBI as we were not able to display images dynamically from the ImageURL provided by Spotify. PowerBI was able to display the images once the data was selected. In PowerBI we used MS SQL connector to connect to the server and authenticate to the database. We created a simple dashboard which shows user details and information related to the users music taste.

spotifydashboard

Scheduling and Automation using Apache Airflow

We used Apache Airflow for Scheduling and Automation. Airflow is a workflow management tool for scheduling data engineering pipelines. We developed the scripts and created DAG file to schedule the developed ETL scripts to run automatically. We installed Airflow in the Windows Subsystem for Linux(Ubuntu) so that we can orchestrate and monitor the jobs.

Commands used in WSL:

pip3 install apache-airflow

Save the ETL python files and .py in $AIRFLOW_HOME/dags (/root/airflow/dags/).

airflow db init

airflow dags list

airflow tasks list spotify_dag

airflow scheduler

airflow webserver

Airflow DAG Run History

Logs

Concepts Implemented

  1. Database Normalization
  2. SQLalchemy Connection
  3. Apache Airflow Scheduling
  4. Object Oriented Programming
  5. Data Modelling
  6. Data Visualization
  7. Orchestration and Scheduling

Tools Used

  1. Pycharm
  2. PowerBI
  3. Apache Airflow
  4. SQL Server Management Studio
  • Programming Language - Python
  • Database Query Language - T-SQL
  • Libraries Used - Pandas, SQLalchemy, urllib, Spotipy, Airflow, datetime
  • Environment - Windows, Ubuntu

Limitations

  1. Spotify won’t allow to pull data for multiple users because of their security reasons. They allow users to pull data for single user account.
  2. We started with Tableau then we used PowerBI because Tableau was not able to display Artist images directly from web URLs.

Future work

  1. The report will aim to display more insights about the user's taste for music.
  2. Machine Learning model will be applied on the data extracted to recommend tracks to the user based on the past history.

References

Airflow references

  1. https://towardsdatascience.com/job-scheduling-with-apache-airflow-2-0-in-10-minutes-16d19f548a46
  2. https://www.kirenz.com/post/2022-05-28-apache-airflow-installation-tutorial/

Spotify API

  1. https://spotipy.readthedocs.io/en/2.22.0/#
  2. https://developer.spotify.com/documentation/general/guides/authorization/scopes/

Other references

  1. https://github.com/culpgrant/Spotify_ETL

About

This repository is a working ETL framework which utilizes user data from Spotify API using ➲Python for Extraction and Transformation ➲SQL for Data Loading and Staging ➲Airflow for Data Orchestration and Monitoring ➲PowerBI for Reporting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages