Skip to content

leonidee/spark-hadoop-automation-in-cloud

Repository files navigation

Code style: black Imports: isort

About

schema

The goal of this project is to implement an approach for ad-hoc scheduling of Spark jobs, with minimal infrastructure deployment costs.

A pipeline has been built to process the data, which is orchestrated using Airflow. The code in Airflow runs every day on a schedule, sends a request to the Yandex Rest API to start the Hadoop cluster, and waits for a response about the state of the cluster.

A Rest API service has been deployed inside the cluster using Fast API and Uvicorn, with its state being managed by the Unix utility supervisor. When the cluster is launched, supervisor starts up the Uvicorn in the cluster to receive requests from Airflow.

Once the infrastructure deployment is successful, Airflow sends a request to the Cluster to submit the Spark job. Spark processes the data and saves the results to S3 data warehouse. After processing is complete, the cluster API sends a summary of the results to Airflow.

After receiving a response from the Cluster, regardless of the success of the job execution, Airflow sends a request to the Yandex API to stop the cluster.

Technology and Architecture

The process orchestrator is Airflow, deployed in Docker containers using docker compose.

A Rest API service, implemented using the Fast API and Uvicorn frameworks, is used to receive requests in the cluster.

Apache Spark, deployed on a lightweight Hadoop cluster consisting only of a Master node and several compute nodes (without HDFS), is the primary data processing engine.

S3 is the main data warehouse, providing flexibility and cost-effectiveness for data storage.

All cloud infrastructure provided by Yandex Cloud.

Project structure

project-tree

This project structure appears to be organized into several directories and files, including:

  • api: Contains the files necessary to run the Restful API service in Hadoop cluster.

    Including api.py and run-api.sh. Second used by supervisor service to auto-deploy API on each cluster start.

  • config: Contains main configuration files for Airflow deployment and Spark jobs submitting.

  • dags: This directory contains the main DAG file datamart-collector-dag.py, which defines the workflow in Airflow.

  • jobs: Contains all jobs files for Spark execution.

  • src: Is the main source code directory of the project, splited into several subdirectories.

    Each directory contains a separate business entity and its own logic.

  • supervisor: Contains the configuration file for the supervisor.

  • templates: Project tamplates, including .env.template for required environment variables and docstring-template.mustache for geneting docstring for all Project's python object in same style.

  • tests: This directory contains the unit tests for the project, organized into several subdirectories in the same manner as the main folders.

    All modules tested with built-in unittest and pytest libraries, except for Spark modules in src/spark/, due to the features of how Spark application works.

    Spark modules was tested manually in Cluster. Maybe this process will be somehow automated in future versions.

  • utils: This directory contains bash scripts for setting up the project's environment.

Deploy and usage

./utils/setup-project - is a main entry point for configuring a new environment required to deploy a project.

You can use it as follows:

Usage: ./utils/setup-project.sh [dataproc|airflow|dev]

project-setup

The project requires two main environments:

  1. Dataproc cluster with at least one compute node
  2. Debian based host to deploy Airflow with docker

After running this script, you will be prompted to choose which environment you want to set up. And all of the required preperation will be done automatically.

It's also possible to use the dev parameter when preparing a development or testing environment. I personally used this while developing a project, and it helped automate boring tasks and save time when working on different hosts.

Note: To deploy project you need a ready-to-use Dataproc Cluster by any provider you want. For this project I personally used Yandex Data Proc version 2.1 with Yarn as resourse manager.

Create Data Proc cluster

To create a Data Proc cluster in Yandex Cloud you can use CLI utility -> yc.

Install yc:

curl -sSL https://storage.yandexcloud.net/yandexcloud-yc/install.sh | bash

Set profile token to authorize:

yc config set token <your_token>

Create cluster:

./utils/create-dataproc.sh

About

Automate Apache Spark in Hadoop with Airflow in Cloud

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages