Skip to content

XavientInformationSystems/DataDumpUtility

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

Data Dump Utility

This spark based command line driven utility can be used to fetch and store data from various source and destination file systems including s3 , gs , hdfs and local file type system. It can also be used to create a table in Amazon Athena if the destination data is S3

Command Line Options available

On top of the regular spark command line options , this utility provide switches to provide necessary information to retrieve and stored the data from the specific filesystem. These are

Generic Options

  • s : Source Location
  • d : Destination Location . Default to s + f
  • f : Destination data format . Defaults to ORC
  • e : External schema location . If not provided , the schema is created using the source file headers

S3 Related Options

  • s3ak : Access Key for the AWS System
  • s3sk : Secret Key for the AWS System

Google Cloud Related Options

  • gsi : Google Project Id
  • gss : Service Account for the GCS System
  • gsp : Path to the P12 file

Athena Related Optios

  • adb : Athena Database
  • at : Athena Table Name
  • as : Athena Staging Directory
  • act : Create Table - true or false .Defaults to false
  • acs : Athena Conection String
  • p : Create Partitioned Data

How to

build application

Unzip the project and perform a maven build in its root directory

mvn clean package

use with the generic options

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv -f parquet -d destinationDirectory -e hdfs://<<pathToExternalSchema

or can be used without the destination , format or the external schema

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv

For S3 file type system

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey> 

With s3 as a destination system we can also create an athena table by passing athena related options. Athena jar can be downloaded from here

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at <<table_name>> -adb <<Existing_dbname_name>> -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> 

Table can also be created with partion using the "p" switch to true

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at finalTest -adb sampledb -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> -p true

If the table created is partioned then execute the following command to in the Athena console before viewing the data

MSCK REPAIR TABLE  <<dbname>>.<<tablename>>

For Google Cloud File System

spark-submit --master yarn --class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d gs://<<destination>>  -gsi <<google project id >>  -gss <<google service account>> -gsp <<path to .p12 file >> -f parquet

Releases

No releases published

Packages

No packages published

Languages