DataPulse: Platform For Big Data & AI

Summary

DataPulse is a platform for big data and AI. It is based on Apache Spark and Kubernetes. The platform is designed to be scalable and easy to use. It provides a set of tools for data processing, machine learning, and data visualization.

Quick Start

Docker Compose

Details

Start docker-compose
```
docker-compose up -d
```
Access platform UI
- http://localhost:5001
Use notebook
- Access http://localhost:8888
- Spark session is automatically created
  - Run spark in cell to check the spark session
- Run the following code in the notebook to test the spark session
```
spark.range(0, 5) \
  .write.format("delta").mode("overwrite").saveAsTable("test")
```
Check the history server
- Access http://localhost:18080
- Spark application history / progress can be viewed here
Delta tables
- Use /opt/data/delta-table/ as the root directory for delta tables
Schedule with Airflow
- Access http://localhost:8090
- Use the default username and password to login
- Create a new DAG to schedule the spark job
- Or use the example DAGs in the ./dags folder

MiniKube

TODO

Examples

Basic Analysis on Static Tables

Singapore Resale Flat Prices Analysis
- Notebook
- Data Source

Incremental Pipeline

TODO

Docker Images

WebApp

Dockerfile

Spark

Dockerfile
Includes
- Spark
- Python

Notebook

Dockerfile
Includes
- Jupyter Notebook
- Spark
- Google Cloud SDK
- GCS Connector
- Pyspark Startup Script
- Notebook Save Hook Function

History Server

Dockerfile
Includes
- Spark
- GCS Connector

Airflow

Dockerfile
Includes
- Python
- Java
- pyspark

Versions

Component	Version
Scala	2.12
Java	17
Python	3.11
Apache Spark	3.5.0
Delta Lake	3.0.0
Airflow	2.9.1
Postgres	13

License

This project is licensed under the terms of the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 265 Commits
.github/workflows		.github/workflows
bin		bin
dags		dags
datasets		datasets
docker		docker
examples		examples
helm		helm
resources/images		resources/images
server		server
webapp		webapp
.gitignore		.gitignore
GCP.md		GCP.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yaml		docker-compose.yaml
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt

License

xuwenyihust/DataPulse

Folders and files

Latest commit

History

Repository files navigation

DataPulse: Platform For Big Data & AI

Summary

Quick Start

Docker Compose

MiniKube

Examples

Basic Analysis on Static Tables

Incremental Pipeline

Docker Images

Versions

License

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages