Skip to content

timing different approaches for computing statistics of 22GBs of csv files, fastest one took 3 seconds

License

Notifications You must be signed in to change notification settings

karenyyng/HW1_Stat250_Winter14

 
 

Repository files navigation

HW1_Stat250_Winter14

benchmarking different approaches to computing statistics of large csv files.

To see overall results, in a R-session, type the following command:

load("${PATH TO GIT DIR}/results.rda")

RESULTS1

For details see: http://nbviewer.ipython.org/github/karenyng/HW1_Stat250_Winter14/blob/master/writeup/hw1.ipynb?create=1

Dependency:

  • Put data (.csv files NOT .csv.bz files) in a directory ${PATH TO GIT DIRECTORY}/data
  • R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
  • Python v. 2.7.4, numpy v.1.7.1, pandas v.0.10.1
  • R package included in this repository -- NotSoFastCSVSample

To run:

  • Method 1 : $Rscript ${PATH TO GIT DIRECTORY}/method1.R
  • Method 2 : $Rscript ${PATH TO GIT DIRECTORY}/method2.R
  • Method 3 : $Rscript ${PATH TO GIT DIRECTORY}/method3.R

Returns:

Method 1:

Method 2:

  • results2.rda
  • results2.txt

Method 3:

  • result3.rda

Machine specification:

Wallclock time

  • Method 1: ~5.4 mins
  • Method 2: ~3.1 mins
  • Method 3: ~4.7 mins

Results

Method 1 agrees with Method 2 up to 6 decimal places

Method 2:

mean = 6.56650421703

median = 0.0

std. dev. = 31.5563262623

Method 3:

agrees with method 1 and 2 up to two sign fig , sampling only 1% of all the lines

About

timing different approaches for computing statistics of 22GBs of csv files, fastest one took 3 seconds

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 87.2%
  • R 6.7%
  • Shell 2.6%
  • C 2.2%
  • Python 1.3%