Skip to content

This Project demonstrates the Technology shift in Automobile Firm to resolve the data engineering challenge of manual data ops. AWS Cloud Services implemented here as: S3 bucket for lake storage incoming batches, Lambda Python Script for automating the validation function call and Glue Crawler to generate relational table with successful testing.

shubhamjais40/AWS-Data-Pipeline-Project-Implementing-Data-Validation-Using-Lambda-based-Gluecrawler-v1.0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AWS-Data-Pipeline-Project-Implementing-Data-Validation-Using-Lambda-based-Gluecrawler-v1.0

Business Requirement:

A renowned Automobile Car🚗 Firm has several franchises across a region and they want to govern their daily automobile products orders🛒 created from their retail outlets.

Throughout the business operations, they receive the sales orders transaction files through sftp in a batch of 100 records csv file in an interval of ~2-3 ⌚ hours. Everytime, the financial analyst needs to open a file and cross validate📝 the correct🔧 records and separate out the corrupt records to make the analysis batch file ready for analysis. Moreover, they need to run scripts to load data to SQL Engine manually to run simple aggregation queries for generating reports📊. Many times this intermediate process of acquiring notification of delivered batch file,cleaning and loading to SQL gets delayed due to manual interventions.

Therefore, the firm needs to improve their business process by automating📌 the tasks, generating the valid files and making data available for running queries.

Solution

To solve the problem stated in the above case, ☁️ AWS Cloud Platform is introduced to migrate traditional on premise operations to modern data analytics stack.

  • S3 Bucket📁 is a popular storage resource as a data lake for our incoming batch files.
  • Also the new batch data could arrive in ~2 hours so need a tool which gets triggered on the instant the file gets delivered. AWS Lambda is a required entity in such a case.
  • To automate a loading of cleaned data in SQL and for analytics purposes. The best solution is AWS Glue Crawler which is a serverless tool and runs on a data source defined to crawl data and automatically creates a relational table with schema detection. For running queries over the table we can use AWS Athena which runs on presto SQL to run analytics queries over a crawled table.
  • Again AWS Crawler⛏️ could be run manually or at a scheduled time interval by cron job support. But since start time is not predictable, it is better to trigger a crawler using another lambda function to start crawler on arrival to validate clean batch file successively.

AWS pipeline overview

AWS Services Required

AWS Services Alias_Name Description
S3 Object Storage autosalesraw-batch-2024 DataLake for arrival of Batch incoming file.
S3 Object Storage auto-salesfiltered-batch2024 Staging Area, stores valid records & corrupt records in separate subfolders.
IAM Role data_validator_lambda_role Lambda role for data validation python script run.
IAM Role glue-crawler-lambda-role Lambda role to trigger glue crawler
IAM Role gluerunner Lambda role to trigger Glue crawler
Lambda Function data-validation-runner Lambda function to trigger python Validation script
Lambda Function crawler-runner-func Lambda function to trigger Crawler
Glue Crawler AutoSalesCrawler Crawler to run over valid batch files and generate table.

Setup of Services

S3 Buckets📁:

s3 setup: Buckets

IAM Roles 🔑:

IAM Roles

Role alias name: data_validator_lambda_role

Permissions:

IAM Roles

Role alias name: glue-crawler-lambda-role

Permissions:

IAM Roles

Role alias name: gluerunner

Permissions:

IAM Role for Crawler

Lambda Fucntions:

List of Lambda functions

1. data-validation-runner

  • Triggers on: s3://autosalesraw-batch-2024/
  • Events Type: COPY,PUT,POST

data-validaiton

2. crawler-runner-func

  • Triggers on: s3://auto-salesfiltered-batch2024/valid_batch_files/
  • Events Type: COPY,PUT,POST

crawler-lambda

Glue Crawler:

AutoSalesCrawler ⛏️

  • Database: autosales_DB
  • Target: s3://auto-salesfiltered-batch2024/valid_batch_files/

crawler-DB

Workflow 💠

  • Input batch file lands first in S3 bucket autosalesraw-batch-2024, which triggers the lambda function data-validation-runner because of events defined (PUT,COPY,POST).

  • On successful run, valid records saved into s3://auto-salesfiltered-batch2024/valid_batch_files/ and if corrupt records found then saved to s3://auto-salesfiltered-batch2024/corrupt_batch_files/.

  • Now other lambda function will immediately trigger on valid file generation in valid_batch_files/ and start crawler job.

  • On successful run of crawler-runner-func, a table would be created in autosales_DB.

Unit Testing 🕹️

Unit testing is performed manually to verify the pipeline ability to perform intended task.

  • In one go, First lambda function is tested by upload of sample file auto_sales_batch-123.csv in autosalesraw-batch-2024 bucket. Triggered the lambda instantly and output files observed in valid & corrupt batch folders in other auto-salesfiltered-batch2024 bucket.

  • Again, to check other lambda function for crawler, a file uploaded to valid_batch_files folder with all valid records.

  • It is observed lambda successfully started glue crawler and created a table at end of job.

crawler-table

Integration Testing 🕹️

crawler-table

  • Now for testing whole pipeline operation matches our expectation, integration testing also performed in one go. Another batch file uploaded to auto_sales_batch-123.csv as auto_sales_batch-234.csv.

  • First lambda started and performed successful generation of valid & corrupt records.

  • At same instant, as valid_batch_files folder updated. Other lambda function got triggered to run crawler.

  • After completion, incremental data is loaded to our table which was generated in first unit test. To verify, SQL query was ran in AWS Athena. With this pipeline successfully matched our expectation in timely response and operations.

Logs:

[6d223715-0d62-4a98-8d6a-4f06e39fa193] INFO : Table valid_batch_files in database autosales_db has been updated with new schema

[6d223715-0d62-4a98-8d6a-4f06e39fa193] BENCHMARK : Finished writing to Catalog

[6d223715-0d62-4a98-8d6a-4f06e39fa193] INFO : Run Summary For TABLE:

[6d223715-0d62-4a98-8d6a-4f06e39fa193] INFO : UPDATE: 1

About

This Project demonstrates the Technology shift in Automobile Firm to resolve the data engineering challenge of manual data ops. AWS Cloud Services implemented here as: S3 bucket for lake storage incoming batches, Lambda Python Script for automating the validation function call and Glue Crawler to generate relational table with successful testing.

Topics

Resources

Stars

Watchers

Forks

Languages