Kurtosis-based Projection Pursuit

Kurtosis-based projection pursuit analysis (PPA) is an exploratory data analysis algorithm originally developed by Siyuan Hou and Peter Wentzell in 2011 and remains an active research area for the Wentzell Research Group at Dalhousie University. Instead of using variance and distance-based metrics to explore high-dimensional data (PCA, HCA etc.), PPA searches for interesting projections by optimizing kurtosis. This repository contains MATLAB and Python code to perform PPA, published literature involving the on-going development of PPA, as well as some examples of how to apply PPA to uncover interesting projections in high-dimensional data. Below is a figure from our recent paper that I think demonstrates the value of searching for distributions with low kurtosis values.

MATLAB and Python Functions

projpursuit.m is a MATLAB function to perform kurtosis-based projection pursuit. It has the following MATLAB function call format:

[T, V, ppout] = projpursuit(X,varargin)

where X is the m (samples) x n (response variables) data matrix, T is the m x p scores for the samples in each of the p dimensions (default p= 2), V is the corresponding n x p projection vectors and ppout (1 x p) contains the final kurtosis value for each dimension.

projpursuit.py is a Python function that is more or less a line by line port of the MATLAB function. kurtosis.py is a python implementation of the MATLAB function kurtosis.m. A list of dependencies to run projpursuit.py are found in dependencies.txt. The Python PPA function has the following call format:

projpursuit(X, **kwargs)

that returns T, V, and ppout.

Literature

Examples

Wood Identification using Near-infrared (NIR) Spectroscopy and univariate PPA

PPA was originally developed for searching high-dimensional chemical data for informative projections. As such, this example employs a data set designed for the identification of different Brazilian wood species using NIR spectroscopy. The original paper and the data for this example can be found here: Implications of measurement error structure on the visualization of multivariate chemical data: hazards and alternatives (2018).

The NIR wood data set contains 4 replicate scans of the follow wood samples: 26 of crabwood, 28 of cedar, 29 of curupixa, and 25 of mahogany. This results 432 samples across 100 NIR channels. Let's apply PCA and PPA and plot the corresponding scores:

Xm = X - mean(X);
% PCA of mean centered data via singular value decomposition
[U, S, V] = svds(Xm, 2);
T_PCA = U*S;

% PPA of mean centered data with dim = 2 and number of initial guesses equal to 1000
[T_PPA] = projpursuit(Xm, 2, 1000)

% Class vector
class = class_rep;

% Colour vector for plotting
cvec = ['b' 'k' 'r' 'g'];

% Plot the PCA scores
figure
for i=1:length(T_PCA)
   scatter(T_PCA(i,1), T_PCA(i,2), 300, cvec(class(i)), 'filled', 'MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PCA Score 1')
ylabel('PCA Score 2')

% Plot the PPA scores
figure
for i=1:length(T_PPA)
   scatter(T_PPA(i,1),T_PPA(i,2),300,cvec(class(i)),'filled','MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PP Score 1')
ylabel('PP Score 2')

Unsupervised Facial Recognition using Univariate PPA

Of course, the data being explored does not have to be chemical in nature... the PPA framework can be applied to any multivariate data set. In this example, we will apply it to a subset of The AT&T face data set. This subset consists of 4 classes (people) each with 10 different grayscale images of their face (112 x 92 pixels). All images were vectorized along the row direction (112 x 92 --> 1 x 10304) producing a 40 x 10304 data set X which was then column mean-centered. Let's apply PCA and PPA and plot the first two scores vectors:

Xm = X - mean(X);
% PCA of mean centered data via singular value decomposition
[U, S, V] = svds(Xm, 2);
T_PCA = U*S;

% PPA of mean centered data using PCA compression (15 components) to avoid PPA finding spurious low kurtosis
[U, S, V] = svds(Xm, 15);
[T_PPA] = projpursuit(U*S,2)

% Class vector (classes were vertically stacked in groups of 10)
class = [ones(1,10), 2*ones(1,10), 3*ones(1,10), 4*ones(1,10)];

% Colour vector for plotting
cvec = ['b' 'k' 'r' 'g'];

% Plot the PCA scores
figure
for i=1:length(T_PCA)
   scatter(T_PCA(i,1), T_PCA(i,2), 300, cvec(class(i)), 'filled', 'MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PCA Score 1')
ylabel('PCA Score 2')

% Plot the PPA scores
figure
for i=1:length(T_PPA)
   scatter(T_PPA(i,1),T_PPA(i,2),300,cvec(class(i)),'filled','MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PP Score 1')
ylabel('PP Score 2')

While PCA reveals 3 clusters corresponding to 2 distinct classes and 2 overlapping classes, PPA is able to reveal 4 distinct clusters corresponding to the 4 different classes.

PPA can also be used to optimize the multivariate kurtosis and the recentered kurtosis. For more information on these options the reader is encouraged to explore the literature linked previously in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
common/images		common/images
README.md		README.md
dependencies.txt		dependencies.txt
kurtosis.py		kurtosis.py
projpursuit.m		projpursuit.m
projpursuit.py		projpursuit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

common/images

common/images

README.md

README.md

dependencies.txt

dependencies.txt

kurtosis.py

kurtosis.py

projpursuit.m

projpursuit.m

projpursuit.py

projpursuit.py

Repository files navigation

Kurtosis-based Projection Pursuit

MATLAB and Python Functions

Literature

Examples

Wood Identification using Near-infrared (NIR) Spectroscopy and univariate PPA

Unsupervised Facial Recognition using Univariate PPA

About

Releases

Packages

Languages

S-Driscoll/Projection-pursuit

Folders and files

Latest commit

History

Repository files navigation

Kurtosis-based Projection Pursuit

MATLAB and Python Functions

Literature

Examples

Wood Identification using Near-infrared (NIR) Spectroscopy and univariate PPA

Unsupervised Facial Recognition using Univariate PPA

About

Topics

Resources

Stars

Watchers

Forks

Languages