Skip to content

1bk/simple-airports-analysis

Repository files navigation

Simple Airports Analysis - Malaysia

Made with Python Python version GitHub code size in bytes GitHub repo size Website GitHub

Contents

Introduction

In this project, we want to run simple analysis on airport and arrival data.

We want to know:

  1. How many airports are there in Malaysia?
  2. What is the distance between the airports in Malaysia?
  3. How many flights are going to land in Malaysian airports the next day from the point of query?
  4. Which airport is most congested base on the information gathered in question 3 above?

To answer these questions, we built an end-to-end framework with ETL pipelines, a database, and an analytics platform as seen in the Project Framework diagram below. (Admittedly a little over-engineered but, hey! Why not...? :trollface:)

The relationship between the tables used to answer the questions above can be visualised using the Table Lineage.

We also hosted a version of the dbt documentation on GitHub which would contain the table descriptions, data dictionaries, and the SQL queries—both raw and compiled—used to generate the tables.

Finally, to visualise the data and help us answer the questions, we used Metabase—an open source business intelligence tool—to create an Analytics Dashboard.


Project Framework

Covered in Step 1, 2, and 3 airports-analysis-framework

Table Lineage (via dbt)

Covered in Step 4 (Optional) - view the dbt documentation table-lineage-dbt

Table Documentation (via dbt)

Covered in Step 4 (Optional) - view the dbt documentation table-lineage-dbt

Metabase Dashboard

Covered in Step 3 (Optional) airport-analysis-metabase


Technology Used

Database (ETL/ELT) Pipeline Workflow Orchestration Analytics Platform
Postgres (using Docker) Python 3.6 Luigi (using Python) Metabase (using Docker)
dbt (Data Transformation)

Developer notes

This project was created on macOS. It has not yet been tested on Windows.

Requirements

  • Python 3.6 or above

    1. (Optional) Use a virtual environment - see this guide
    2. IMPORTANT - For libraries used, see requirements.txt. Includes:
      • luigi - Workflow Orchestration
      • dbt - data build tool (Data Transformation)
  • Docker

    1. PostgresSQL - Database
    2. Metabase - Analytics Platform
  • Other Software

    1. DBeaver - SQL Client to view the database

Data Source

We primarily use Airport and Arrival data from these two sources:

Using this project

Step 1. Initial setup

  1. Install Python 3.6 using these instructions.

    (Optional) Use a Python virtual environment

  2. Install Poetry using the official documentation

    Suggestion: use Homebrew to install

  3. IMPORTANT - Install required libraries using Poetry, at the command line:

    $ poetry install

    This will install the Python libraries required for this project in a virtual environment.

    • luigi - for (one-script) workflow orchestration
    • dbt - for uploading raw data and data transformation in database
    • requests - to scrape websites
    • SQLAlchemy - for database connection in scripts
    • pandas - for flattening of extracted (JSON) data and saving to csv
    • PyYAML - for configuration extraction
  4. Install Docker using these instructions.

  5. Install PostgresSQL via Docker using the instructions below (source), at the command line:

    $ docker pull postgres
    
    $ mkdir -p $HOME/docker/volumes/postgres
    
    $ docker run --rm --name pg-docker -e POSTGRES_PASSWORD=docker -d -p 5432:5432 -v $HOME/docker/volumes/postgres:/var/lib/postgresql/data postgres
  6. Install a suitable SQL Client such as DBeaver.

    Use the default connection configuration to verify the database connection and view the data that will be added in the later steps:

    Host:     localhost 
    Database: postgres
    Port:     5432
    User:     postgres
    Password: docker

    database-sql-client-dbeaver

Step 2. Running the workflow

This step utilises the Luigi workflow orchestration that automates the entire ELT pipeline.

To run each tasks in the workflow manually, see this guide.

  1. Activate the virtual environment by running the following command:

    $ poetry shell
  2. Run the workflow.py file which will initiate the luigi workflow orchestration, at the command line:

    $ python workflow/workflow.py --local-scheduler DbtRunAnalysis 

    Luigi will work through all the steps as defined in workflow.py.

    Notes: This step may take some time since it is running the entire pipeline.

    (Optional) Luigi has a web interface which can be used to monitor workflow and view dependency graphs as shown below. To replicate, follow this guide.

    workflow-visualisation-luigi

  3. The following will be returned when the previous step completes successfully:

    INFO: 
    ===== Luigi Execution Summary =====
    
    Scheduled 7 tasks of which:
    * 7 ran successfully:
        - 1 DbtDeps()
        - 1 DbtRunAirports()
        - 1 DbtRunAnalysis()
        - 1 DbtSeedAirports()
        - 1 DbtSeedArrivals()
        ...
    
    This progress looks :) because there were no failed tasks or missing dependencies
    
    ===== Luigi Execution Summary =====

    Notes: You can view the 'logs' for each step in the files with .output extension that is generated in the root of the project directory.

  4. The data should ready in the database as tables and can be viewed using the SQL Client that was installed in Step 1.5 above.

Step 3. Analysis on Metabase

Optional Step! Alternatively, the data can be viewed directly using the SQL client as mentioned in Step 1.5.

  1. Install Metabase via Docker using the instructions below (source), at the command line:

    $ docker run -d -p 3000:3000 --name metabase metabase/metabase

    Wait a little while as it takes time to initialise. You can use docker logs -f metabase to follow the rest of the initialization progress.

  2. Once initialised, visit the local Metabase portal at http://localhost:3000/.

  3. You will be directed to http://localhost:3000/setup, click on Let's get started. setup-metabase-0

  4. Fill in the details for your admin account. setup-metabase-1

  5. Important step - Use the following database connection configuration to connect Metabase to our Postgres instance in Docker:

    Name: Local  (Or any other name) 
    Host: host.docker.internal 
    Port: 5432
    Database name: postgres
    Database username: postgres
    Database password: docker

    Please note that this has been tested to work on macOS. For Windows, if Metabase fails to connect to the above Host, please try referring to this official documentation on Docker Networking.

    setup-metabase-2

  6. Choose your usage data preference. setup-metabase-3

  7. Done! You should see the main Metabase page below which lists the available tables in the Postgres Database that was created in Step 2. setup-metabase-4

  8. Unfortunately, there is no export feature to share the dashboard and charts seen in the Introduction. However, it is just a few simple SELECT * FROM table queries to the main fct_ and some of the stg_ tables available in the Postgres database. 😄

Step 4. Viewing dbt documentation

Optional Step!

  1. In the ./dbt directory, run the following commands to generate the documentation:

    $ cd ./dbt
    $ dbt docs generate --profiles-dir ./
  2. Once the documentation has been generated, run the following commands to host the documentation website locally:

    $ dbt docs serve
  3. The documentation website should automatically be launched in your browser.

    If it does not, just navigate to http://localhost:8080/.

  4. Alternatively, you can view the version we hosted on GitHub - dbt documentation.

Step 5. Done

:shipit:

Future Features and Improvements

As this project was only done in more or less 3 days, there were some features or improvements that were not implemented.

Here are a few suggestions:

  • Use multiprocessing to scrape websites - Especially for scraping arrivals data as we loop through multiple airports.
  • Implement the logging module - Right now, the scripts are using simple print statements. However, proper logging will help with debugging.
  • Data documentation using dbt - Documentation is important, but it is also a time consuming task. Whenever possible, we should document as much as possible the tables that were generated. Your downstream consumers will thank you.
  • Use environmental variables for sensitive information - For passwords, in particular.
  • Use a more versatile Workflow Orchestration tool - Luigi is simple and easy to get started. But one should consider alternatives like Airflow which can handle complex workflow and has more features.

Author

  • Lee Boon Keong - Feb, 2020

About

A simple project to analyse Malaysian airports - an opportunity to play with tools like Luigi, Docker, and Metabase as part of an end-to-end analytics framework.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published