Principal component analysis

Easy to read, simple paper highlighting the different aspects of PCA and related diagnostic tools. Good reference for the future.


H. Abdi and L. J. Williams, Principal Component Analysis, Wiley Interdisciplinary Reviews:  Computational Statistics, 2, 2010.


This is a little package that I have been using for a long time to visually explore results of PCA on grouped data. The main purpose was to have one simple command that would visualise a result of a PCA in R in 3D and color the data points by group and type.

For example, take the data set provided with the package, called “metabo” (it stems from my paper on metabolic profiling in tuberculosis):

library( pca3d )
data( metabo )

# top left bit of the metabo data frame
head( metabo )[,1:10] 

This last command shows the following output:

  group   X1   X2   X3   X4   X5   X6   X7   X8   X9
1   POS 0.78 1.10 1.26 0.87 0.68 0.65 0.72 0.77 0.88
2   POS 0.68 0.51 0.30 0.21 1.64 2.42 1.19 1.19 1.58
3   POS 1.00 1.31 1.68 1.08 2.46 1.19 1.02 1.82 1.60
4   POS 1.08 0.75 0.65 2.33 0.81 0.72 0.94 0.93 0.31
5    TB 0.87 0.81 0.99 0.85 0.92 0.69 1.12 1.50 0.70
6    TB 1.29 0.89 0.46 0.49 0.50 1.03 1.10 0.48 0.31

Each row corresponds to one serum sample either from TB patients or healthy controls. The first column of the data frame metabo are the group assignments; the remaining 423 columns correspond to relative levels of different small molecules (like sugars or amino acids) in the given serum sample. Running a PCA is straightforward:

pca <- prcomp( metabo[,-1], scale.= TRUE )

And visualisation with pca3d is straightforward as well:

pca3d( pca, group= metabo[,1] )

A 3D output (using the rgl package) is produced — you can interactively turn, zoom and change the perspective of the plot. Also, with the rgl.snapshot( filename ) command you can export the graphics as a PNG file.

Visualisation of the metabo PCA using pca3d.

Visualisation of the metabo PCA using pca3d.

You can very clearly see that the blue balls stand apart from the rest in the first two components. What are they? It is not easy to create a reasonable legend directly on an RGL canvas, but pca3d produces a text-only legend in the main text interface:

       group:        color,        shape
         NEG:          red, tetrahaedron
         POS:       green3,         cube
          TB:         blue,       sphere

Oh, so the TB patients are really different from the rest! Neat. The really elegant thing about the PCA is that it does not use any information about the group classification. Therefore, whatever groups we see, they are real — the visualisation corresponds to an independent validation on the whole data set. This is very much unlike PLS, where the score plots always show a clear separation; PLS is eager to please as one author put it.

Unfortunately, 3D can only be saved as a PNG. However, for a publication, a 2D PDF might be more suitable. Another command in this package, pca2d takes exactly the same options as the pca3d command and produces a graphics on the standard R device:

2D -version of the previous plot.

2D -version of the previous plot.

There are plenty of other options to pca3d, for example show.labels can take a character vector as an argument and show a little text floating above every data point.

Furthermore, it is possible to create biplots. Unlike the normal biplot function, by default only a few variables are selected from each component (by their absolute loadings in that component) — if there are too many variables visualised, the figure is cluttered and useless.

The red arrows show selected variables

The red arrows show selected variables

In the above figure, several variables with high loadings can be seen.

Another plot, in which the cluster centroids are shown for all three groups of samples:

The large symbols indicate cluster centroids. Each sample is connected to the corresponding centroid.

The large symbols indicate cluster centroids. Each sample is connected to the corresponding centroid.

pca3d on CRAN: