Skip to content

An EL pipeline built with Apache Airflow that downloads a file from the web uploads it to Google Cloud Storage, and creates an external table in BigQuery for data storage and analysis.

Adarsh-Hota/EL_gcs-to-bigquery

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Overview

An EL data pipeline that leverages the capabilities of Apache Airflow to extract a file from the web and subsequently upload it to BigQuery.

  • To start the process, the DAG (Directed Acyclic Graph) first downloads the desired file from the web source. The downloaded file is then uploaded to Google Cloud Storage.
  • In the next step of the process, an external table is created in BigQuery, which refers to the uploaded file in GCS. This table can then be used analyze the data.

Setup

  • GCP Compute Engine VM

    • Instance configuration
      • Machine type: e2-standard-4
      • Boot disk image: ubuntu-2004-focal-v20230918
      • Boot disk size: 30 GB
      • Boot disk type: Balanced persistent disk
    • Environment configuration
      • Apache Airflow: 2.7.2
      • Docker version: 24.0.6
      • Docker Compose version: 3

Data Source

NYC Taxi & Limousine Commission website - https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

About

An EL pipeline built with Apache Airflow that downloads a file from the web uploads it to Google Cloud Storage, and creates an external table in BigQuery for data storage and analysis.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published