Skip to content

GabrielGiurgica/Udacity-Data-Engineering-Capstone-Project

Repository files navigation

Udacity Data Engineering Capstone Project


License: MIT Code style: black Imports: isort


Contents

  1. Project Summary
  2. Data Sources
  3. Data Model
  4. ETL Pipeline
  5. Other Scenarios
  6. Structure of the project

Project Summary

This project aims to create an ETL pipeline that takes data from 7 sources, processes them and uploads them to a data warehouse. The data warehouse facilitates the analysis of the US immigration phenomenon using Business Intelligence applications. With the help of the data stored in it, it is possible to identify:

  • if there is any correlation between the phenomenon of global warming and the origin of immigrants
  • if there is an increase/decrease in immigrants from certain states
  • which are the main states where immigrants go
  • what is the age distribution of immigrants
  • etc.

This repository is the result of completing the Data Engineering Nanodegree on Udacity. So the code was tested in Project Workspace on Udacity.

Data Sources

As mentioned in the previous section, 7 data sources are used in this project. 4 of them are suggested by Udacity Provided Project and 3 of them are taken from various web pages. A small description of each of them can be found below:

  • I94 Immigration Data: This dataset contains data about immigrants from the US National Tourism and Trade Office. In addition to the actual data, this dataset also comes with a file in which the codes used are described. For a better understanding of the meaning of the names of the columns in this dataset, I recommend the following projects: project A and project B.
  • World Temperature Data: This dataset came from Kaggle. It provides historical information about monthly average temperatures in different cities around the world.
  • U.S. City Demographic Data: This dataset contains information on the demographics of all US cities with a population greater than 63 000.
  • Airport Code Table: Provides information about airports around the world.
  • Country Codes: This site provides the name and 2-letter code of all countries in the world.
  • US States Codes: This site provides the name and 2-letter code of all US states.
  • Continent Codes: This site provides the name and 2-letter code of all continents.

Data Model

The schema was created using dbdesigner. The SQL code to create the tables is in docs/immigration_db_postgres_create.sql. As can be seen from the image, a star schema has been used as a way to model the data because the ultimate goal of the data is to analyze it using Business Intelligence applications. A brief description of the tables is reproduced in the following:

  • country_temperature_evolution: is a dimension table whose data source is the World Temperature Data dataset. It stores the average monthly temperatures of each country from 1743 to 2013.
  • demographic: is a dimension table whose data source is the U.S. City Demographic Data dataset. It contains population data for each US state.
  • world_airports: is a dimension table whose data sources are the Airport Code Table and Continent Codes datasets. It contains data about all airports in the world.
  • us_states: is a dimension table whose data source is the US States Codes dataset. It contains the name and 2-letter code of all US states.
  • visa: is a table of dimensions whose data source is the I94 Immigration Data dataset and its description file. It contains all valid visa information.
  • applicant_origin_country: is a dimension table whose data source is the description file in the I94 Immigration Data dataset. It contains a 3-digit code and the name of each country from which an immigrant could come.
  • status_flag: is a dimension table whose data source the I94 Immigration Data dataset. It contains the one-letter status for different stages that the immigrant went through.
  • admission_port: is a dimension table whose data source is the description file in the I94 Immigration Data dataset. It contains the code and information about the admission port through which the immigrant passed.
  • arrival_mode: is a table of dimensions whose data source is the I94 Immigration Data dataset and its description file. It contains information about how the immigrant arrived in the US.
  • date: is a dimension table whose data source the I94 Immigration Data dataset. It contains all possible dates from the columns in the source dataset.
  • immigran_application: is the fact table in the data model. It has as a data source both the I94 Immigration Data dataset and the visa, status_flag and arrival_mode tables from which it takes the id columns. This table contains information on the application submitted by the immigrants.

More details related to the columns in the tables can be found in docs/data_dictionary.md.

ETL Pipeline

As can be seen in the previous image, the pipeline takes the source data from 3 different places, processes the data and saves it locally in parquet format. Once all the data has been processed, they are uploaded to Amazon S3. Each table is processed using the following pattern:

  1. Raw data is read.
  2. The data is transformed.
  3. The correctness of the processed data is checked.
  4. The processed data are saved in parquet format in the output folder.

The main tools used in this project are:

  • Apache Spark: was chosen for data processing, regardless of their size.
  • Pandas: was chosen for the ease with which HTML tables are read.
  • Amazon S3: it was chosen because it is highly scalable, reliable, fast and inexpensive data storage.

To run the pipeline, the next steps have to be followed:

  1. Complete dl.cfg configuration file. It is recommended that the KEY and SECRET fields have the values of an IAM User that has only AmazonS3FullAccess policy attached. For the S3 field, it is necessary to pass the S3 bucket name. Be careful not to put the entered values between single or double quotes.
  2. Run the command from the project root.
python -m etl

Other Scenarios

The data was increased by 100x.

For such a scenario, I would consider using an Amazon EMR to run the ETL, and upload the data directly to the Amazon S3. Besides this, I would partition the tables. For example, I would partition the country_temperature_evolution table according to the country.

The data populates a dashboard that must be updated on a daily basis by 7am every day.

For this situation, the ETL can be refactor to work with Apache Airflow, because it would be much easier to automate the execution of the pipeline.

The database needed to be accessed by 100+ people.

If the database starts to be used intensively, I would consider moving the data to Amazon Redshift.

Structure of the project

├── docs                                # Contains files about data model and ETL pipeline.
├── processed_data                      # The folder where the processed data is stored.
├── raw_data                            # The folder where the raw data is stored.
├── utils                               # Contains files with functions used in the ETL pipeline.
├── data_exploration.ipynb              # Jupyter notebook used for data exploration.
├── dl.cfg                              # The credentials and config used to manage the AWS resources.
├── etl.py                              # Code for the ETL pipeline.
├── LICENSE                             # Contains information about the project license
├── pyproject.toml                      # Contains the configurations of the linting tools used
├── README.md
└── requirements.txt                    # Contains the list of libraries used