Skip to content

josephwinston/spark-kernel

Repository files navigation

Spark Kernel

A simple Scala application to connect to a Spark cluster and provide a generic, robust API to tap into various Spark APIs. Furthermore, this project intends to provide the ability to send both packaged jars (standard jobs) and code snippets (with revision capability) for scenarios like IPython for dynamic updates. Finally, the kernel is written with the future plan to allow multiple applications to connect to a single kernel to take advantage of the same Spark context.

Vagrant Dev Environment

A Vagrantfile is provided to easily setup a development environment. You will need to install Virtualbox 4.3.12+ and Vagrant 1.6.2+. Once Vagrant and Virtualbox are installed, the environment can be created by running vagrant up. When the VM is running, you can get to the source code by executing:

vagrant ssh
cd /src/spark-kernel

Building from Source

To build the kernel from source, you need to have sbt installed on your machine. Once it is on your path, you can compile the Spark Kernel by running the following from the root of the Spark Kernel directory:

sbt compile

The recommended configuration options for sbt are as follows:

-Xms1024M
-Xmx2048M
-Xss1M
-XX:+CMSClassUnloadingEnabled
-XX:MaxPermSize=1024M

Library Dependencies

The Spark Kernel uses ZeroMQ as the medium for communication. In order for the Spark Kernel to be able to use this protocol, the ZeroMQ library needs to be installed on the system where the kernel will be run. The Vagrant and Docker environments set this up for you.

For Mac OS X, you can use Homebrew to install the library:

brew install zeromq22

For aptitude-based systems (such as Ubuntu 14.04), you can install via apt-get:

apt-get install libzmq-dev

Usage Instructions

The Spark Kernel is provided as a series of jars, which can be executed on their own or as part of the launch process of an IPython notebook.

The following command line options are available:

  • --profile - the file to load containing the ZeroMQ port information
  • --help - displays the help menu detailing usage instructions
  • --master - location of the Spark master (defaults to local[*])

Additionally, ZeroMQ configurations can be passed as command line arguments

  • --ip
  • --stdin-port
  • --shell-port
  • --iopub-port
  • --control-port
  • --heartbeat-port

Ports can also be specified as Environment variables:

  • IP
  • STDIN_PORT
  • SHELL_PORT
  • IOPUB_PORT
  • CONTROL_PORT
  • HB_PORT

Packing the kernel

We are utilizing xerial/sbt-pack to package and distribute the Spark Kernel.

You can run the following to package the kernel, which creates a directory in kernel/target/pack contains a Makefile that can be used to install the kernel on the current machine:

sbt kernel/pack

To install the kernel, run the following:

cd kernel/target/pack
make install

This will place the necessary jars in ~/local/kernel/current/lib and provides a convient script to start the kernel located at ~/local/kernel/current/bin/sparkkernel.

Running A Docker Container

The Spark Kernel can be run in a docker container using Docker 1.0.0+. There is a Dockerfile included in the root of the project. You will need to compile and pack the Spark Kernel before the docker image can be built.

sbt compile
sbt pack
docker build -t spark-kernel .

After the image has been successfully created, you can run your container by executing the command:

docker run -d -e IP=0.0.0.0 spark-kernel 

You must always include -e IP=0.0.0.0 to allow the kernel to bind to the docker container's IP. The environment variables listed in the getting started section can be used in the docker run command. This allows you to explicitly set the ports for the kernel.

Development Instructions

You must have SBT 0.13.5+ installed. From the command line, you can attempt to run the project by executing sbt kernel/run <args> from the root directory of the project. You can run all tests using sbt test (see instructions below for more testing details).

For IntelliJ developers, you can attempt to create an IntelliJ project structure using sbt gen-idea.

Running tests

There are four levels of test in this project:

  1. Unit - tests that isolate a specific class/object/etc for its functionality

  2. Integration - tests that illustrate functionality between multiple components

  3. System - tests that demonstrate correctness across the entire system

  4. Scratch - tests isolated in a local branch, used for quick sanity checks, not for actual inclusion into testing solution

To execute specific tests, run sbt with the following:

  1. Unit - sbt unit:test

  2. Integration - sbt integration:test

  3. System - sbt system:test

  4. Scratch - sbt scratch:test

To run all tests, use sbt test!

The naming convention for tests is as follows:

  1. Unit - test classes end with Spec e.g. CompleteRequestSpec

    • Placed under com.ibm.spark
  2. Integration - test classes end with SpecForIntegration e.g. InterpreterWithActorSpecForIntegration

    • Placed under integration
  3. System - test classes end with SpecForSystem e.g. InputToAddJarSpecForSystem

    • Placed under system
  4. Scratch

    • Placed under scratch

It is also possible to run tests for a specific project by using the following syntax in sbt:

sbt <PROJECT>/test

About

An implementation of the IPython kernel (3.0) using Scala and providing specific support for Apache Spark.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages