Skip to content

fearless-pioneer/simple-dataops-docker

Repository files navigation

Simple DataOps Docker

License: Apache 2.0 Python 3.10 Code style: black Imports: isort Type Checking: mypy Linting: ruff

All Contributors

Prerequisites

Preparation

Install Python 3.10 on Pyenv or Anaconda and execute the following commands:

$ make init             # setup packages (need only once)

Infra Setup

$ make compose          # create all the containers (need only once)

You can delete the containers.

$ make compose-clean    # delete the containers

1. Mongo DB

You can access localhost:8081 from a web browser and log in as admin for both ID and password and then view the data that is being added to the mongo database in the Mongo DB through the data generator.

2. Airflow

You can access localhost:8008 from a web browser and log in as admin for both ID and password.

Run the dags according to the detailed case studies below.

Case Studies

1. Simple Test

You can run on a simple dag called simple-test that you see on the main screen of airflow. You can also see the dag in src/dags/1_simple_test/simple_dag.py, defined as several tasks with the python and bash operators.

The schedule interval of the dag is @once, so when the dag is executed, it only works once at first. (references: DAG Runs in Airflow and Cron in Wikipedia)

So let's run the dag. You can unpause the dag by clicking Pause/Unpause DAG.

After a few seconds, you can confirm that the dag has successfully ended on the main screen of airflow.

2. Batch Glue

You can run a batch-glue dag that extracts, transforms, and loads (ETLs) data like AWS Glue. You can also see the dag in src/dags/2_batch_glue/dag.py and the code for the task that runs on the bash operator in the dag in src/dags/2_batch_glue/pipeline.py.

The task that works in the dag is to extract the wine data in Mongo DB, transform the type of data, and then load it into Maria DB.

The dag runs every minute because the schedule interval for the dag is specified as (*/1 * * * *) every minute. (We don't manually run it.) In other words, the dag ETLs data between Mongo DB and Maria DB every minute.

So let's run the dag. You can unpause the dag by clicking Pause/Unpause DAG.

After a few seconds, you can confirm that the dag has successfully ended on the main screen of airflow.

Finally, you can access Maria DB and see that data is added every time the dag is executed.

$ docker exec -it mariadb bash
root@742cd8f602a7:/# mariadb -u maria -p
Enter password: maria
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 32
Server version: 10.6.13-MariaDB-1:10.6.13+maria~ubu2004 mariadb.org binary distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [(none)]> use maria
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
MariaDB [maria]> select * from wine_data limit 5;
+----+---------------------+--------------------------+---------+------------+------+-------------------+-----------+---------------+------------+----------------------+-----------------+-----------------+------+------------------------------+---------+--------+
| id | time                | mongo_id                 | alcohol | malic_acid | ash  | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue  | od280_od315_of_diluted_wines | proline | target |
+----+---------------------+--------------------------+---------+------------+------+-------------------+-----------+---------------+------------+----------------------+-----------------+-----------------+------+------------------------------+---------+--------+
|  1 | 2023-06-17 15:41:11 | 648d5587d7f2b36f4504969c |   14.23 |       1.71 | 2.43 |              15.6 |       127 |           2.8 |       3.06 |                 0.28 |            2.29 |            5.64 | 1.04 |                         3.92 |    1065 |      0 |
|  2 | 2023-06-17 15:41:13 | 648d5589d7f2b36f4504969d |    13.2 |       1.78 | 2.14 |              11.2 |       100 |          2.65 |       2.76 |                 0.26 |            1.28 |            4.38 | 1.05 |                          3.4 |    1050 |      0 |
|  3 | 2023-06-17 15:41:15 | 648d558bd7f2b36f4504969e |   13.16 |       2.36 | 2.67 |              18.6 |       101 |           2.8 |       3.24 |                  0.3 |            2.81 |            5.68 | 1.03 |                         3.17 |    1185 |      0 |
|  4 | 2023-06-17 15:41:17 | 648d558dd7f2b36f4504969f |   14.37 |       1.95 |  2.5 |              16.8 |       113 |          3.85 |       3.49 |                 0.24 |            2.18 |             7.8 | 0.86 |                         3.45 |    1480 |      0 |
|  5 | 2023-06-17 15:41:19 | 648d558fd7f2b36f450496a0 |   13.24 |       2.59 | 2.87 |                21 |       118 |           2.8 |       2.69 |                 0.39 |            1.82 |            4.32 | 1.04 |                         2.93 |     735 |      0 |
+----+---------------------+--------------------------+---------+------------+------+-------------------+-----------+---------------+------------+----------------------+-----------------+-----------------+------+------------------------------+---------+--------+
5 rows in set (0.004 sec)

3. Batch SQS

You can run a batch-sqs dag imitating AWS SQS. This process is an example of storing data using Message Queue(MQ).

We used RabbitMQ as MQ and MinIO as storage. the rest of settings are same of above.

The process consists of two steps:

  • First, you can check that data is generated in the Mongo DB every 2 seconds, and that a dag produces the data in the MQ every minute (same as the second case study above) according to the scheduled interval.
  • Second, you can check that the consumer consumes the data from MQ and stores it in storage.

In detail, the dag runs every minute because the schedule interval for the dag is specified as (*/1 * * * *) every minute and publishes the data to the RabbitMQ queue. So we can expect that the data will be synchronized every minute between the source DB(MongoDB) and the queue(RabbitMQ).

Let's run the dag. You can unpause the dag by clicking Pause/Unpause DAG.

After a few seconds, you can confirm that the dag has successfully ended on the main screen of airflow.

The first step is ended and we can see the result of them by accessing the RabbitMQ console(localhost:15672). the ID and Password are set same word rabbit.

Then you can also monitor how second step is going on using two methods.

3.1 Docker Logs

$ docker logs rabbitmq-consumer -f

3.2 MinIO Console

You can access localhost:9900 from the web browser and log in as ID minio and password minio123.


Finally, you finished our case studies. We hope you enjoy the journey with case studies.

Thank you for visiting our repository!

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Dongmin Lee
Dongmin Lee

πŸ“– πŸ’»
Kim dong hyun, κΉ€λ™ν˜„
Kim dong hyun, κΉ€λ™ν˜„

πŸ“– πŸ’»

This project follows the all-contributors specification. Contributions of any kind welcome!