Skip to content

This project implements a real-time event streaming pipeline for a music streaming service, inspired by Spotify Wrapped and Billboard charts. The pipeline is powered by Apache Airflow, Apache Kafka, dbt, Docker, GCP, Spark-Streaming, and Terraform.

The-Algorist/Beatlytica

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BEAT~LYTICA

This project implements a real-time event streaming pipeline for a music streaming service, inspired by Spotify Wrapped and Billboard charts. The pipeline is powered by Apache Airflow, Apache Kafka, dbt, Docker, GCP, Spark-Streaming, and Terraform.

Beat~lytica-logo

Table of Contents

Introduction

BEAT~LYTICA is a project designed to create a live dashboard for a fake music streaming service, similar to Spotify. The pipeline streams events generated by user interactions, such as listening to a song, navigating the website, and authenticating. The real-time data is processed and stored in a data lake periodically (every two minutes). An hourly batch job consumes this data, applies transformations, and creates the desired tables for the dashboard to generate analytics. The goal is to analyze metrics like popular songs, active users, and user demographics.

Dataset

The project uses Eventsim, a program that generates event data to replicate page requests for a fake music website. The generated data mimics real user data but is entirely synthetic. The Docker image for Eventsim is borrowed from [viirya's fork], as the original project has not been maintained for several years.

Eventsim utilizes song data from the Million Songs Dataset to generate events.

Requirements

  • Apache Airflow
  • Apache Kafka
  • dbt
  • Docker
  • Google Cloud Platform (GCP)
  • Spark-Streaming
  • Terraform

Installation

Setup

NOTE: Google Cloud Platform will charge for the infrastructure based on the usage. You can feel free to create a new and free GCP account and access $300 worth of credit to run this project and others.

Pre-requisites

If you already have a Google Cloud account and a working terraform setup, you can skip the pre-requisite steps;

Get Going!

  • Procure infrastructure on GCP with Terraform - Setup

  • (Extra) SSH into your VMs, Forward Ports - Setup

  • Setup Kafka Compute Instance and start sending messages from Eventsim - Setup

  • Setup Spark Cluster for stream processing - Setup

  • Setup Airflow on Compute Instance to trigger the hourly data pipeline - Setup

Usage

Debug

If you run into issues, see if you find something in this debug guide.

Contributing

Guidelines for contributing to the project, such as submitting issues, creating pull requests, or any other relevant information, will be provided here.

About

This project implements a real-time event streaming pipeline for a music streaming service, inspired by Spotify Wrapped and Billboard charts. The pipeline is powered by Apache Airflow, Apache Kafka, dbt, Docker, GCP, Spark-Streaming, and Terraform.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published