Building a fully automated data Pipeline with Google Cloud Services

The project is a fully automated data pipeline with various google cloud services.

Description

Objective

The project is designed with a purpose to explore various google cloud services. It covers each stage from data ingestion to processing and finally to storage, utilizing a robust tech stack that includes cloud composer, cloud stoarage, cloud functions, dataflow, bigquery, looker. Although each of these cloud service can do individually the task of loading to bigquey from stoarage, the pipline explores connecting the various services.

Tools & Technologies

Language - Python
Orchestration - cloud composer
trigger - cloud functions
datalake - cloud stoarage
data Processing - dataflow
Data Warehouse - bigquery

Architecture

The diagram above provides a detailed insight into pipeline's architecture.

Data ingestion - Python library Faker is used that generate synthetic data for testing, development, and analysis purposes. It provides a range of methods to generate realistic fake data, such as names, addresses, phone numbers, email addresses, dates, and more

Orchestration - cloud composer, google managed Apache Airflow, automates the task to store the data in a CSV format within Google Cloud Storage (GCS), ensuring accessibility and scalability for future processing

Cloud Function Trigger - A cloud function triggers upon the file upload to the GCS bucket, serving as the initiator for our subsequent data processing steps. The function meticulously handle triggers and pass the requisite parameters to seamlessly initiate the Dataflow job, ensuring a smooth flow of data processing

Dataflow Job for BigQuery - A dataflow job triggered by the Cloud Function, this job orchestrates the transfer of data from the CSV file in GCS to BigQuery. The job settings meticulously configured to ensure optimal performance and accurate data ingestion into BigQuery. The dataflow template GCS_CSV_to_BigQuery is used.

Data Warehouse - BigQuery stores & persists data from dataflow job.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
cloud_function		cloud_function
dags		dags
scripts		scripts
setup		setup
terraform		terraform
.gitignore		.gitignore
Architecture.JPG		Architecture.JPG
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_function

cloud_function

dags

dags

scripts

scripts

setup

setup

terraform

terraform

.gitignore

.gitignore

Architecture.JPG

Architecture.JPG

README.md

README.md

Repository files navigation

Building a fully automated data Pipeline with Google Cloud Services

Description

Objective

Tools & Technologies

Architecture

About

Languages

mahesh-c-pathak/gcs_pipeline

Folders and files

Latest commit

History

Repository files navigation

Building a fully automated data Pipeline with Google Cloud Services

Description

Objective

Tools & Technologies

Architecture

About

Topics

Resources

Stars

Watchers

Forks

Languages