Skip to content

rparrapy/irs-revenue

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IRS Form 909 Aggregation with Apache Spark

Simple Apache Spark jobs for aggregating the IRS Form 909 dataset available here. Jobs include average and median calculation in a nationwide and per-state basis.

Dependencies

  • Scala 2.11
  • sbt 0.13.8
  • Apache Spark 2.1 for Hadoop 2.4

More recent Hadoop versions are not supported because of a bug related to S3 support. See more about it here.

Instructions to run

  1. Start the Spark master $SPARK_HOME/sbin/start-master.sh
  2. Start a Spark slave $SPARK_HOME/sbin/start-slave.sh MASTER_URL
  3. Build the fat jar sbt clean assembly
  4. Deploy the jar $SPARK_HOME/bin/spark-submit --class "IrsRevenueApp" --master "MASTER_URL" --executor-memory MAX_MEMORY target/scala-2.11/irs-revenue-assembly-1.0.jar

About

Playing around with Spark for dataset aggregation

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages