Bayesian Neural Network (BNN) Distributed Training

The repo consist codes for preforming distributed training of Bayesian Neural Network models at scale using High Performance Computing Cluster such as ALCF (Theta). The main purpose of the code is to act as a tutorial for getting started with distibuted training of BNN's on High Performace Computing clusters. The BNN's are also used in one of my works for Gravitational Wave parameters link where Neural Networks were combined with Bayesian Neural Network Layers. The dataset and the code is available on Theta and restricted to the mmadsp users only. For furthter details about ADSP contact Argonne ALCF support.

The BNN models are implemented using the Tensorflow-probability libarary. The data distribted training is performed using Horovod.

Brief Background on BNN:

Bayesian Neutal Networks is one of approaches used to capture network uncertainity. The uncertainities in Bayesian modeling can be classified under two categories;

Aleatoric uncertainity
Epistemic uncertainity.

The Aleatoric uncertainity tries to capture noise inherent with the observations/data. The noise in data is associated with sensor measurement noise. Epistemic unceratinity is associated with model parameters, and with increasing the data the uncertainity can be reduced. The Aleoteric uncertainity is further divided into Homoscedastic and Heteroscedastic.

Homoscedastic uncertainty: uncertainty which stays constant for different inputs, and heteroscedastic uncertainty.
Heteroscedastic uncertainty: depends on the inputs to the model, with some inputs potentially having more noisy outputs than others. This is particuraly important to avoid model over-confident predictions.

The Epistemic uncertainty is modelled by putting a prior distribution over the model parameters/weights and compute how these weights varies and converges, which are done in case of Bayesian Neural Networks. While in case of Aleoteric uncertainity are modelled by putting distibutions on the output of the model. Further, details about the Bayesian Network and Variationa inference for training can be found in the Jupyter-Notebook.

Code Dependencies:

python >= 3.5
requirements.txt

Dataset:

MNIST hand-written digit dataset sample images below.
CIFAR-10 The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Models:

Bayesian Neural Network with Flipout Fully Connected Layer.('BNN_conv_flip')
Bayesian Neural Network with Non-Flipout Fully Connected Layer.('BNN_conv_nonflip')
Bayesian Neural Network with Flipout Convolutional Layers.('BNN_FC_flip')
Bayesian Neural Network with Non-Flipout Convolutional Layers.('BNN_FC_nonflip)
Bayesian Neural Network with Flipout Convolutional Layers (3-VGG Blocks) for CIFAR-10 data.('CIFAR10_BNN_model')
Convolutional Neural Network ('CNN_Conv')
Fully Connected Neural Network ('CNN_FC')

How to run the code:

On the local machine Running:
- horovodrun -n 2 -H localhost:2 python CNN_BNN_Model.py --flagfile=config_file.cfg

ALCF high performance Computing Cluster (Theta):

PPN=1 # 32,16,8 MPIRank Per Node (Process Per Node)
NUM_THDS=128

aprun -n $((${COBALT_PARTSIZE} * ${PPN})) -N ${PPN} -cc depth -j 2 -d ${NUM_THDS} \
    -e OMP_NUM_THREADS=${NUM_THDS} -b python <path to the code>/CNN_BNN_Model.py \
    --flagfile=config_file.cfg

The submission script is provided in the repository.
Running the job with Balsam (Theta):
- Look the detail README.md
For other information about the configuration of running the code, use help function as follows;
python CNN_BNN_Model.py --help
Example Results:
- The comparison of the BNN and CNN time to train with increasing number of nodes are shown in the Fig-1 shown below.
- The comparison of the Speed-Up between the BNN and CNN using the training time can be also compared in Fig-2 shown below.
- The training of the Bayesian Network is to find optimal distribution of the training parameters which done using the technique of Variational Inference(VI). As the training iteration progress the weights posterior converges. An example is shown below with the weights initialized with the Gaussian prior in Fig-3.
- As the model is trained once and the posteriors for the weights are converged. The model is used for performing inference. The inference is perfomed by running the model over and over again (Monte-Carlo iterations). The output of the model returns the prediction distribution as shown below for MC iterations of 300 and with a BNN Fully Connected model.
Research Articles:
- Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study
Papers related to Bayesian Neural Networks:
Papers for Gravitational Bayesian Model:
- Deterministic and Bayesian Neural Networks for Low-latency Gravitational Wave Parameter Estimation of Binary Black Hole Mergers
Additional Resources:
- Tensorflow Probalbility Examples
Contact
- Himanshu Sharma (himanshu90sharma@gmail.com)
- Elise Jennings
Ackowledegment

This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was funded in part and used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This work describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the work do not necessarily represent the views of the U.S. DOE or the United States Government. Declaration of Interests - None.

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
BalsamJobCreation		BalsamJobCreation
DATA		DATA
misc		misc
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BalsamJobCreation

BalsamJobCreation

DATA

DATA

misc

misc

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Bayesian Neural Network (BNN) Distributed Training

Brief Background on BNN:

Code Dependencies:

Dataset:

Models:

How to run the code:

About

Releases

Packages

Languages

License

Himscipy/bnn_hvd

Folders and files

Latest commit

History

Repository files navigation

Bayesian Neural Network (BNN) Distributed Training

Brief Background on BNN:

Code Dependencies:

Dataset:

Models:

How to run the code:

About

Topics

Resources

License

Stars

Watchers

Forks

Languages