Skip to content

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.

gakas14/AWS-Serverless-Data-Lake

Repository files navigation

AWS-Serverless-Data-Lake

To demonstrate the power of data lake architectures, In this workshop, I ingested streaming data from the Kinesis Data Generator (KDG) into Amazon S3. Then created a big data processing pipeline without servers or clusters, which is ready to process huge amounts of data, the dataset is an open dataset at AWS Open Data Registry, called GDELT and it has ~170GB+ size, and is comprised of thousands of uncompressed CSV files. I also created an AWS Glue transform job to perform basic transformations on the Amazon S3 source data. And finaly, I used the larger public dataset with more tables to observe the various AWS services in collaboration using AWS Athena.

  1. Create a CloudFormation template and uplode this file (serverlessDataLakeDay.json)
  2. Create Kinesis Firehose Delivery Stream to Ingest data into your Data Lake

Screen Shot 2022-11-18 at 3 45 07 PM

  1. Install the Kinesis Data Generator Tool (KDG)

Screen Shot 2022-11-18 at 3 45 57 PM

Monitoring for the Firehose Delivery Stream Screen Shot 2022-11-18 at 3 46 15 PM

Amazon Kinesis Firehose writes data to Amazon S3 Screen Shot 2022-11-18 at 3 47 52 PM

  1. Cataloging your Data with AWS Glue
  • Create crawler to auto discover schema of your data in S3

Screen Shot 2022-11-18 at 3 55 17 PM

  • Create a database and a table then Edit the Metadata Schema
  1. Create a Transformation Job with Glue Studio

Screen Shot 2022-11-18 at 4 00 28 PM

Screen Shot 2022-11-18 at 4 01 24 PM

  1. SQL analytics on a Large Scale Open Dataset usimg AWS Athena
  • create a database CREATE DATABASE gdelt;

  • Create Metadata Table for GDELT EVENTS Data CREATE EXTERNAL TABLE IF NOT EXISTS gdelt.events ( globaleventid INT, day INT, monthyear INT, year INT, fractiondate FLOAT, actor1code string, actor1name string, actor1countrycode string, actor1knowngroupcode string, actor1ethniccode string, actor1religion1code string, actor1religion2code string, actor1type1code string, actor1type2code string, actor1type3code string, actor2code string, actor2name string, actor2countrycode string, actor2knowngroupcode string, actor2ethniccode string, actor2religion1code string, actor2religion2code string, actor2type1code string, actor2type2code string, actor2type3code string, isrootevent BOOLEAN, eventcode string, eventbasecode string, eventrootcode string, quadclass INT, goldsteinscale FLOAT, nummentions INT, numsources INT, numarticles INT, avgtone FLOAT, actor1geo_type INT, actor1geo_fullname string, actor1geo_countrycode string, actor1geo_adm1code string, actor1geo_lat FLOAT, actor1geo_long FLOAT, actor1geo_featureid INT, actor2geo_type INT, actor2geo_fullname string, actor2geo_countrycode string, actor2geo_adm1code string, actor2geo_lat FLOAT, actor2geo_long FLOAT, actor2geo_featureid INT, actiongeo_type INT, actiongeo_fullname string, actiongeo_countrycode string, actiongeo_adm1code string, actiongeo_lat FLOAT, actiongeo_long FLOAT, actiongeo_featureid INT, dateadded INT, sourceurl string ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' WITH SERDEPROPERTIES ( 'serialization.format' = '\t', 'field.delim' = '\t') LOCATION 's3://gdelt-open-data/events/';

  • Create Metadata Table for GDELT Lookup Tables

Screen Shot 2022-11-18 at 4 05 48 PM

Screen Shot 2022-11-18 at 4 05 54 PM

Screen Shot 2022-11-18 at 4 05 59 PM

Screen Shot 2022-11-18 at 4 06 05 PM

  • Example output:

Screen Shot 2022-11-18 at 4 08 06 PM

Screen Shot 2022-11-18 at 4 08 14 PM

Screen Shot 2022-11-18 at 4 08 24 PM

Screen Shot 2022-11-18 at 4 08 32 PM

This workshop is base on AWS workshop studio the link is below. https://catalog.us-east-1.prod.workshops.aws/workshops/ea7ddf16-5e0a-4ec7-b54e-5cadf3028b78/en-US

About

This workshop is to build a serverless data lake architecture using Amazon Kinesis Firehose for streaming data ingestion, AWS Glue for Data Integration (ETL, Catalogue Management), Amazon S3 for data lake storage, Amazon Athena for SQL big data analytics.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published