Skip to content

dvu4/udacity-data-engineering

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Data Engineering Nanodegree

Projects and resources developed in the DEND Nanodegree from Udacity.

Project passed

Develope a relational database using PostgreSQL to model user activity data for a music streaming app. Skills include:

  • Created a relational database using PostgreSQL
  • Developed a Star Schema database using optimized definitions of Fact and Dimension tables. Normalization of tables.
  • Built out an ETL pipeline to optimize queries in order to understand what songs users listen to.

Technologies used: Python, PostgreSql, Star Schema, ETL pipelines, Normalization

Project passed

Develop NoSQL database with Cassandra and build an ETL pipeline using Python based on the original schema outlined in project one. We want to get some answers around the queries :

  • Get details of a song that was herad on the music app history during a particular session.
  • Get songs played by a user during particular session on music app.
  • Get all users from the music app history who listened to a particular song.

Technologies used: Python, Apache Cassandra, Denormalization

Project passed

Apply the Data Warehouse architectures we learnt and build a Data Warehouse on AWS Redshift.

  • Build an ETL pipeline to extract and transform data stored in JSON format from S3 buckets into staging tables.
  • Move the data to Warehouse hosted on Amazon Redshift Cluster.
  • Develope the optimized queries required by the data analytics team

Technologies used: Python, Amazon Redshift, AWS CLI, Amazon SDK, SQL, PostgreSQL

Project 4: Data Lake - Spark.

Project passed

Build a Data Lake on AWS cloud using Spark and AWS EMR cluster. The data lake will serve as a Single Source of Truth (SSOT) for the Analytics Platform. Spark jobs are created to scale up ELT pipeline that moves data from landing zone on S3 (data warehouse) and transform and stores data in processed zone on S3 (data lake).

  • Create an EMR Hadoop Cluster
  • Further develop the ETL Pipeline copying datasets from S3 buckets, data processing using Spark and writing to S3 buckets using efficient partitioning and parquet formatting.
  • Fast-tracking the data lake buildout using (serverless) AWS Lambda and cataloging tables with AWS Glue Crawler.

Technologies used: Python, Spark, AWS S3, EMR, Athena, Amazon Glue, Parquet.