Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Introduction

This project serves as a comprehensive guide to building an end-to-end data engineering pipeline using TCP/IP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch. It covers each stage from data acquisition, processing, sentiment analysis with ChatGPT, production to kafka topic and connection to elasticsearch.

System Architecture

The project is designed with the following components:

Data Source: We use yelp.com dataset for our pipeline.
TCP/IP Socket: Used to stream data over the network in chunks
Apache Spark: For data processing with its master and worker nodes.
Confluent Kafka: Our cluster on the cloud
Control Center and Schema Registry: Helps in monitoring and schema management of our Kafka streams.
Kafka Connect: For connecting to elasticsearch
Elasticsearch: For indexing and querying

What You'll Learn

Setting up data pipeline with TCP/IP
Real-time data streaming with Apache Kafka
Data processing techniques with Apache Spark
Realtime sentiment analysis with OpenAI ChatGPT
Synchronising data from kafka to elasticsearch
Indexing and Querying data on elasticsearch

Technologies

Python
TCP/IP
Confluent Kafka
Apache Spark
Docker
Elasticsearch

Getting Started

Clone the repository:

git clone https://github.com/airscholar/E2EDataEngineering.git

Navigate to the project directory:
```
cd E2EDataEngineering
```
Run Docker Compose to spin up the spark cluster:
```
docker-compose up
```

For more detailed instructions, please check out the video tutorial linked below.

Watch the Video Tutorial

For a complete walkthrough and practical demonstration, check out the video here:

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What You'll Learn

Technologies

Getting Started

Watch the Video Tutorial

About

Releases

Packages

Languages

airscholar/RealtimeStreamingEngineering

Folders and files

Latest commit

History

Repository files navigation

Realtime Data Streaming With TCP Socket, Apache Spark, OpenAI LLM, Kafka and Elasticsearch | End-to-End Data Engineering Project

Table of Contents

Introduction

System Architecture

What You'll Learn

Technologies

Getting Started

Watch the Video Tutorial

About

Topics

Resources

Stars

Watchers

Forks

Languages