Skip to content

Two minimum viable products that will import a ~6 million record .csv file into PostgresSQL. One method uses batch processing, the other uses Stateless Sessions to loop through the data and insert it into the appropriate row/column in a SQL database.

Notifications You must be signed in to change notification settings

margueriteblair/Big-Data-Processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Large Dataset Import Microservice

This repo contains two minimum viable products that will import a 6 million record .csv file into PostgreSQL. The first method I created to achieve this uses Stateless Sessions to stringify the data and loop through the data file, while the second method uses Spring Batch processing.

Average runtime for the batch processor with a ThreadPoolTaskExecutor is 2 minutes 33 seconds. Average runtime for the stateless sessions parser/processor is 40 minutes. Both of these methods will be improved upon in the future by incorporating a MultiResourcePartitioner within the Spring Batch Configuration file, as well as splitting the large dataset into smaller sets, so that multiple threads may operate on different files at a given time.

This project:

  • Uses Spring Boot service uses Spring Batch with Spring Data JPA-Hibernate.
  • Imports data from a CSV file (about 6 million records) to a PostgreSQL database.
  • Improved batch processing performance from implementing a ThreadPoolTaskExecutor to achieve data chunking and multithreaded code.
  • Based on this data, a fraud detection model is built using python machine learning libraries.
  • Is intended to be launched through an API Gateway server (linked below).
  • Instructions to run:

      1. Clone this repository to your local machine.
      2. Download the financial data from Kaggle. Add this data to "resource/data" and be sure to include the .csv file in your .gitignore!
      3. Within main/java/com there are two distinct packages, "batch" and "session", which are the batch processor and sessions processor respectively.
      4. Each package has it's own main file that can be ran
      5. Once the application is launched without issues, head over to Postman and test on your configured port and the route "/load"

    Technologies Used

  • Java
  • Spring Boot for REST API
  • Spring Batch Processing (Open Source Data Processing Framework)
  • Maven
  • Factory Design Pattern within Batch Processor
  • Hibernate
  • Java Persistence API (JPA)
  • PostgreSQL
  • Gateway Server Communication. Gateway Server can be found here.
  • About

    Two minimum viable products that will import a ~6 million record .csv file into PostgresSQL. One method uses batch processing, the other uses Stateless Sessions to loop through the data and insert it into the appropriate row/column in a SQL database.

    Topics

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published

    Languages