udacity-nanodegree-data-engineering

Project files created during 4 month nanodegree Data Engineering with AWS (Link).

Contents of this repository

Files and scripts from procect files.

1 - Data Modeling: Project 1 actually consists of 2 projects. Both deal with data from the music-streaming startup Sparkify. In project 1A a relational Postgres database is constructed using input data from json-files. In project 1B a number of business cases are explored using the NoSQL database Apache Cassandra.

2 - Cloud Data Warehouses: Objective of this project was to set up a Redshift data warehouse on AWS for the music-streaming startup Sparkify. Data is supplied through a Python ETL extracting data from S3, staging it on Redshift and transforming it into a star schema.

3 - Spark & Data Lakes: In this project a data lake containing songplay-data from Sparkify and a corresponding ETL is constructed using EMR, S3 and Spark. Data is extracted from a S3 bucket, transformed data warehouse tables with Spark on EMR and safed as parquet-files.

4 - Automate Data Pipelines: In this project continues on the Sparkify dataset. Data is extracted from S3 and written into Redshift database with an Python ETL. Main focus of this project is the orchestration of the ETL trough Apache Airflow.

5 - Capstone Project: This graduation project designs an analytical dataware house and corresponding ETLs for the bicycle sharing system Citi Bikes in New York City. The project combines bike trip data provided Citi Bikes with weather data of New York City. Data is extracted from a S3 bucket and an API, transformed into six data warehouse table with Spark and safed as parquet-files.

Program information

as cited from official syllabus...

Students will learn to:

Create user-friendly relational and NoSQL data models.
Create scalable and efficient data warehouses.
Work efficiently with massive datasets.
Build and interact with a cloud-based data lake.
Automate and monitor data pipelines.
Develop proficiency in Spark, Airflow, and AWS tools.

Course contents

Data Modeling

Introduction to Data Modeling
Relational Data Models
NoSQL Data Models

Cloud Data Warehouses

Introduction to Data Warehouses
ELT and Data Warehause Technology in the Cloud
AWS Data Technologies
Implementing Data Warehouses on AWS

Spark & Data Lakes

Big Data Ecosystem, Data Lakes, & Spark
Spark Essentials
Using Spark & Data Lakes in the AWS Cloud
Ingesting & organizing data in lakehouse architecture on AWS

Automate Data Pipelines

Data Pipelines
Airflow & AWS
Data Quality
Production Data Pipelines

Capstone Project

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
1A - Data Modeling with Postgres		1A - Data Modeling with Postgres
1B - Data Modeling with Cassandra		1B - Data Modeling with Cassandra
2 - Cloud Data Warehouses with Redshift		2 - Cloud Data Warehouses with Redshift
3 - Spark and Datalakes		3 - Spark and Datalakes
4 - Automate Data Pipelines		4 - Automate Data Pipelines
5 - Capstone Project		5 - Capstone Project
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1A - Data Modeling with Postgres

1A - Data Modeling with Postgres

1B - Data Modeling with Cassandra

1B - Data Modeling with Cassandra

2 - Cloud Data Warehouses with Redshift

2 - Cloud Data Warehouses with Redshift

3 - Spark and Datalakes

3 - Spark and Datalakes

4 - Automate Data Pipelines

4 - Automate Data Pipelines

5 - Capstone Project

5 - Capstone Project

README.md

README.md

Repository files navigation

udacity-nanodegree-data-engineering

Contents of this repository

Program information

About

Releases

Packages

Languages

horony/udacity-nanodegree-data-engineering

Folders and files

Latest commit

History

Repository files navigation

udacity-nanodegree-data-engineering

Contents of this repository

Program information

About

Topics

Resources

Stars

Watchers

Forks

Languages