Skip to content

An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and visualizes clean data.

Notifications You must be signed in to change notification settings

topefolorunso/musicaly-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

musicaly

An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and perfroms analytics on clean data.

background

Eventsim is a top music streaming company. The management of Eventsim are working on a new feature tailored to the preferences of the users. In order to aid the development of this feature, the developers needed to understand certain things about the streaming habits of users. Hence, they came up with use cases and questions that need to be answered.

  1. What is the total number of active users, heir total stream hours and their geographic distribution?
  2. What is the general gender composition of users and how do they make up the top artists?
  3. What are the top songs and who are the top artists that users listen to?

data flow

  • Eventsim API produces the streaming data which are then consumed by Kafka.
  • Stream data are read from Kafka with Spark Streaming.
  • Spark Streaming structures the data and writes to data lake (Cloud Storage) as flat file.
  • ELT from data lake (Cloud Storage) to data warehouse (BigQuery) using dbt, and orchestrated with Airflow
  • Stream Analytics were performed and deployed using Google Data Studio.

cloud architecture

data source

Eventsim is a program that generates event data to replicate page requests for a fake music web site. The results look like real use data, but are totally fake. The docker image is borrowed from viirya's fork of it, as the original project has gone without maintenance for a few years now.

Eventsim uses song data from Million Songs Dataset to generate events. I have used a subset of 10000 songs.

dashboard

Click here to view latest version on Data Studio

how to setup

⚠️ Note that GCP resources (which incur cost) are provisioned in this project

⚠️ Also this setup assumes you are using a linux or bash environment

  1. clone this repo to the ~/musicaly-project directory

    git clone https://github.com/topefolorunso/musicaly-project.git ~/musicaly-project && \
    cd ~/musicaly-project
  2. setup GCP account

  3. provision infrastructure

  4. ssh to and setup vms

  5. proceed to run

how to run

  1. start up the kafka service and start streaming here
  2. start up the spark streaming service here
  3. start up the airflow service here
  4. connect bigquery to Data Studio for analytics

About

An end-to-end data pipeline that ingests simulated music stream data, structures, cleans and models the raw data, and visualizes clean data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published