Skip to content

S-Driscoll/Projection-pursuit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kurtosis-based Projection Pursuit

Kurtosis-based projection pursuit analysis (PPA) is an exploratory data analysis algorithm originally developed by Siyuan Hou and Peter Wentzell in 2011 and remains an active research area for the Wentzell Research Group at Dalhousie University. Instead of using variance and distance-based metrics to explore high-dimensional data (PCA, HCA etc.), PPA searches for interesting projections by optimizing kurtosis. This repository contains MATLAB and Python code to perform PPA, published literature involving the on-going development of PPA, as well as some examples of how to apply PPA to uncover interesting projections in high-dimensional data. Below is a figure from our recent paper that I think demonstrates the value of searching for distributions with low kurtosis values.

kurtosis

MATLAB and Python Functions

  • projpursuit.m is a MATLAB function to perform kurtosis-based projection pursuit. It has the following MATLAB function call format:
[T, V, ppout] = projpursuit(X,varargin)

where X is the m (samples) x n (response variables) data matrix, T is the m x p scores for the samples in each of the p dimensions (default p= 2), V is the corresponding n x p projection vectors and ppout (1 x p) contains the final kurtosis value for each dimension.

  • projpursuit.py is a Python function that is more or less a line by line port of the MATLAB function. kurtosis.py is a python implementation of the MATLAB function kurtosis.m. A list of dependencies to run projpursuit.py are found in dependencies.txt. The Python PPA function has the following call format:
projpursuit(X, **kwargs)

that returns T, V, and ppout.

Literature

Examples

Wood Identification using Near-infrared (NIR) Spectroscopy and univariate PPA

PPA was originally developed for searching high-dimensional chemical data for informative projections. As such, this example employs a data set designed for the identification of different Brazilian wood species using NIR spectroscopy. The original paper and the data for this example can be found here: Implications of measurement error structure on the visualization of multivariate chemical data: hazards and alternatives (2018).

The NIR wood data set contains 4 replicate scans of the follow wood samples: 26 of crabwood, 28 of cedar, 29 of curupixa, and 25 of mahogany. This results 432 samples across 100 NIR channels. Let's apply PCA and PPA and plot the corresponding scores:

Xm = X - mean(X);
% PCA of mean centered data via singular value decomposition
[U, S, V] = svds(Xm, 2);
T_PCA = U*S;

% PPA of mean centered data with dim = 2 and number of initial guesses equal to 1000
[T_PPA] = projpursuit(Xm, 2, 1000)

% Class vector
class = class_rep;

% Colour vector for plotting
cvec = ['b' 'k' 'r' 'g'];

% Plot the PCA scores
figure
for i=1:length(T_PCA)
   scatter(T_PCA(i,1), T_PCA(i,2), 300, cvec(class(i)), 'filled', 'MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PCA Score 1')
ylabel('PCA Score 2')

% Plot the PPA scores
figure
for i=1:length(T_PPA)
   scatter(T_PPA(i,1),T_PPA(i,2),300,cvec(class(i)),'filled','MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PP Score 1')
ylabel('PP Score 2')

PCA vs PPA

Unsupervised Facial Recognition using Univariate PPA

Of course, the data being explored does not have to be chemical in nature... the PPA framework can be applied to any multivariate data set. In this example, we will apply it to a subset of The AT&T face data set. This subset consists of 4 classes (people) each with 10 different grayscale images of their face (112 x 92 pixels). All images were vectorized along the row direction (112 x 92 --> 1 x 10304) producing a 40 x 10304 data set X which was then column mean-centered. Let's apply PCA and PPA and plot the first two scores vectors:

Xm = X - mean(X);
% PCA of mean centered data via singular value decomposition
[U, S, V] = svds(Xm, 2);
T_PCA = U*S;

% PPA of mean centered data using PCA compression (15 components) to avoid PPA finding spurious low kurtosis
[U, S, V] = svds(Xm, 15);
[T_PPA] = projpursuit(U*S,2)

% Class vector (classes were vertically stacked in groups of 10)
class = [ones(1,10), 2*ones(1,10), 3*ones(1,10), 4*ones(1,10)];

% Colour vector for plotting
cvec = ['b' 'k' 'r' 'g'];

% Plot the PCA scores
figure
for i=1:length(T_PCA)
   scatter(T_PCA(i,1), T_PCA(i,2), 300, cvec(class(i)), 'filled', 'MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PCA Score 1')
ylabel('PCA Score 2')

% Plot the PPA scores
figure
for i=1:length(T_PPA)
   scatter(T_PPA(i,1),T_PPA(i,2),300,cvec(class(i)),'filled','MarkerEdgeColor','black') 
    hold on
end
set(gca,'linewidth',2,'FontSize',14)
xlabel('PP Score 1')
ylabel('PP Score 2')

PCA vs PPA

While PCA reveals 3 clusters corresponding to 2 distinct classes and 2 overlapping classes, PPA is able to reveal 4 distinct clusters corresponding to the 4 different classes.

PPA can also be used to optimize the multivariate kurtosis and the recentered kurtosis. For more information on these options the reader is encouraged to explore the literature linked previously in this repository.