Skip to content

navdeep-G/sdss-h2o-automl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

H2O AutoML Short Course at the 2018 Symposium for Data Science and Statistics

AutoML is a function in H2O that automates the process of building a large number of models, with the goal of finding the "best" model without any prior knowledge or effort by the Data Scientist.

The current version of AutoML (in H2O 3.18.*) trains and cross-validates a default Random Forest, an Extremely-Randomized Forest, a random grid of Gradient Boosting Machines (GBMs), a random grid of Deep Neural Nets, a fixed grid of GLMs, and then trains two Stacked Ensemble models at the end. One ensemble contains all the models (optimized for model performance), and the second ensemble contains just the best performing model from each algorithm class/family (optimized for production use).

  • More information and code examples are available in the AutoML User Guide.

  • New features and improvements planned for AutoML are listed here.

Setting Up Environment for AutoML Demos:

Prerequisites for H2O

H2O-3 Requirements

Install H2O in R

# The following two commands remove any previously installed H2O packages for R.
if ("package:h2o" %in% search()) { detach("package:h2o", unload=TRUE) }
if ("h2o" %in% rownames(installed.packages())) { remove.packages("h2o") }

# Next, we download packages that H2O depends on.
pkgs <- c("RCurl","jsonlite")
for (pkg in pkgs) {
if (! (pkg %in% rownames(installed.packages()))) { install.packages(pkg) }
}

# Now we download, install and initialize the H2O package for R.
install.packages("h2o", type="source", repos="http://h2o-release.s3.amazonaws.com/h2o/rel-wolpert/9/R")

Install H2O in Python

pip install requests
pip install tabulate
pip install scikit-learn
pip install colorama
pip install future
# The following command removes the H2O module for Python.
pip uninstall h2o

# Next, use pip to install this version of the H2O Python module.
pip install http://h2o-release.s3.amazonaws.com/h2o/rel-wolpert/9/Python/h2o-3.18.0.9-py2.py3-none-any.whl

Part 1: Binary Classification

For the AutoML binary classification demo, we use a subset of the Product Backorders dataset. The goal here is to predict whether or not a product will be put on backorder status, given a number of product metrics such as current inventory, transit time, demand forecasts and prior sales.

In this tutorial, you will:

  • Specify a training frame.
  • Specify the response variable and predictor variables.
  • Run AutoML where stopping is based on max number of models.
  • View the leaderboard (based on cross-validation metrics).
  • Explore the ensemble composition.
  • Save the leader model (binary format & MOJO format).

Demo Notebooks:

Part 2: Regression

For the AutoML regression demo, we use the Combined Cycle Power Plant dataset. The goal here is to predict the energy output (in megawatts), given the temperature, ambient pressure, relative humidity and exhaust vacuum values. In this demo, you will use H2O's AutoML to outperform the state-of-the-art results on this task.

In this tutorial, you will:

  • Split the data into train/test sets.
  • Specify a training frame and leaderboard (test) frame.
  • Specify the response variable.
  • Run AutoML where stopping is based on max runtime, using training frame (80%).
  • Run AutoML where stopping is based on max runtime, using original frame (100%).
  • View leaderboard (based on test set metrics).
  • Compare the leaderboards of the two AutoML runs.
  • Predict using the AutoML leader model.
  • Compute performance of the AutoML leader model on a test set.

Demo Notebooks:

Part 3: Lending Club

For the AutoML Lending Club demo, we use the Lending Club dataset (Lending Club is a peer-to-peer lending platform). The goal here is to predict if a borrower will default or not given various features about their financial history.

In this tutorial, you will:

  • Perform Basic dataframe manipulations.
  • Creating a target (response column).
  • Feature preprocessing & engineering.
  • Splitting the data into training and validation sets.
  • Building models (GLM & GBM).
  • Evaluating model performance.
  • Run a grid search.
  • Run AutoML where stopping is based on max runtime.
  • View the leaderboard (based on cross-validation metrics).
  • Predict using the AutoML leader model.

Demo Notebooks:

About

Code & presentation for the 'H2O AutoML' short course at SDSS 2018 in Reston, VA

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published