Skip to content

dacosta-github/udacity-de

Repository files navigation

Data Engineering Nanodegree

Projects of Udacity's Data Engineering Nanodegree.


Projects

Data Modeling

In this course, I’ll learn to create relational and NoSQL data models to fit the diverse needs of data consumers. I’ll understand the differences between different data models, and how to choose the appropriate data model for a given situation. I’ll also build fluency in PostgreSQL and Apache Cassandra.

Course Project 1 - Data Modeling with Postgres

In this project, I’ll model user activity data for a music streaming app called Sparkify. I’ll create a relational database and ETL pipeline designed to optimize queries for understanding what songs users are listening to. In PostgreSQL I will also define Fact and Dimension tables and insert data into new tables.

Course Project 2 - Data Modeling with Apache Cassandra

In these projects, I’ll model user activity data for a music streaming app called Sparkify. I'll create a database and ETL pipeline, in both Postgres and Apache Cassandra, designed to optimize queries for understanding what songs users are listening to. For PostgreSQL, I will also define Fact and Dimension tables and insert data into new tables. For Apache Cassandra, I'll model data so I can run specific queries provided by the analytics team at Sparkify.


Cloud Data Warehouses

Course Project 3 - Data Modeling with AWS Redshift

In this project, I applied what I've learned on data warehouses and AWS to build an ETL pipeline for a database hosted on Redshift. To complete the project, I need to load data from S3 to staging tables on Redshift and execute SQL statements that create the analytics tables from these staging tables. To manage the AWS and manage the clusters and access, I used the AWS SDK for Python.


Data Lakes with Spark

Course Project 4 - Data Lake with Apache Spark and AWS S3

In this project, I applied what learned on Spark and data lakes to build an ETL pipeline for a data lake hosted on S3. To complete the project, I need to load data from S3, process the data into analytics tables using Spark, and load them back into S3. After, I deployed this Spark process on a cluster using AWS. I used the AWS SDK for Python.


Data Pipelines with Airflow

Course Project 5 - Data Pipelines with Apache Airflow

In this project, I applied what I've learned on Apache Airflow data pipelines. To complete the project, I need to create your own custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step. I used the AWS SDK for Python.


Data Engineering Capstone

Capstone Project - Data Platform for Analytics & Machine Learning - Financial companies complaints analysis

In this project, I applied what I've learned on Udacity Nanodegrees.


Acknowledgements

Data Engineering Nanodegree Program Syllabus