Data Dump Utility

This spark based command line driven utility can be used to fetch and store data from various source and destination file systems including s3 , gs , hdfs and local file type system. It can also be used to create a table in Amazon Athena if the destination data is S3

Command Line Options available

On top of the regular spark command line options , this utility provide switches to provide necessary information to retrieve and stored the data from the specific filesystem. These are

Generic Options

s : Source Location
d : Destination Location . Default to s + f
f : Destination data format . Defaults to ORC
e : External schema location . If not provided , the schema is created using the source file headers

S3 Related Options

s3ak : Access Key for the AWS System
s3sk : Secret Key for the AWS System

Google Cloud Related Options

gsi : Google Project Id
gss : Service Account for the GCS System
gsp : Path to the P12 file

Athena Related Optios

adb : Athena Database
at : Athena Table Name
as : Athena Staging Directory
act : Create Table - true or false .Defaults to false
acs : Athena Conection String
p : Create Partitioned Data

How to

build application

Unzip the project and perform a maven build in its root directory

mvn clean package

use with the generic options

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv -f parquet -d destinationDirectory -e hdfs://<<pathToExternalSchema

or can be used without the destination , format or the external schema

 spark-submit --class com.xavient.datadump.StoreData target/com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar -s test.csv

For S3 file type system

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>

With s3 as a destination system we can also create an athena table by passing athena related options. Athena jar can be downloaded from here

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at <<table_name>> -adb <<Existing_dbname_name>> -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>>

Table can also be created with partion using the "p" switch to true

spark-submit   --jars=AthenaJDBC41-1.0.0.jar --master yarn --class com.xavient.datadump.StoreData DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d s3://<<bucketPath>>  -s3ak <<AccessKey>>  -s3sk <<SecretKey>  -act true -at finalTest -adb sampledb -acs jdbc:awsathena://<<Athena URL>>:443/ -as s3://<<temp_bucketPath>> -p true

If the table created is partioned then execute the following command to in the Athena console before viewing the data

MSCK REPAIR TABLE  <<dbname>>.<<tablename>>

For Google Cloud File System

spark-submit --master yarn --class com.xavient.datadump.StoreData target/DataDumpUtility-0.0.1-SNAPSHOT-jar-with-dependencies.jar  -s clientdata   -d gs://<<destination>>  -gsi <<google project id >>  -gss <<google service account>> -gsp <<path to .p12 file >> -f parquet

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

README.md

README.md

pom.xml

pom.xml

Repository files navigation

Data Dump Utility

Command Line Options available

How to

About

Releases

Packages

Languages

XavientInformationSystems/DataDumpUtility

Folders and files

Latest commit

History

Repository files navigation

Data Dump Utility

Command Line Options available

How to

About

Topics

Resources

Stars

Watchers

Forks

Languages