Using external data from within another package

If you make the error which I did, you will try to use the data (say, “pckgdata”) from another package (say, “pckg”) naively like this:

someFunc <- function() {
  data(pckgdata)
  foo <- pckgdata$whatever
}

This will result in an error:

someFunc: no visible binding for global variable ‘pckgdata’
someFunc : <anonymous>: no visible binding for global variable
  ‘pckgdata’
Undefined global functions or variables:
  pckgdata

Here is the solution (thanks to the comments from stackexchange:

.myenv <- new.env(parent=emptyenv())

someFunc <- function() {
  data("pckgdata", package="pckg", envir=".myenv")
  foo <- .myenv$pckgdata$whatever
}

Actually, let us load the object as soon as our package is loaded:

.myenv <- new.env(parent=emptyenv())

.onLoad <- function(libname, pkgname){
  data("pckgdata", package="pckg", envir=".myenv") 
}

someFunc <- function() {

  foo <- .myenv$pckgdata$whatever
}

Now any of the functions in our package can use the pckgdata, whenever. Note that we want to use .onLoad(), and not .onAttach() — the latter one is for such things as startup messages when the package is manually attached by the user.

Alternatively, you can create your environment within the function itself:

<br />someFunc <- function() {
  myenv <- new.env(parent=emptyenv())
  data("pckgdata", package="pckg", envir="myenv")
  foo <- .myenv$pckgdata$whatever
}

R-devel in parallel to regular R installation

Unfortunately, you need both: R-devel (development version of R) if you want to submit your packages to CRAN, and regular R for your research (you don’t want the unstable release for that).

Fortunately, installing R-devel in parallel is less trouble than one might think.

Say, we want to install R-devel into a directory called ~/R-devel/, and we will download the sources to ~/src/. We will first set up two environment variables to hold these two directories:

export RSOURCES=~/src
export RDEVEL=~/R-devel

Then we get the sources with SVN. In Ubuntu, you need package subversion for that:

mkdir -p $RSOURCES
cd $RSOURCES
svn co https://svn.r-project.org/R/trunk R-devel
R-devel/tools/rsync-recommended

Then, we compile R-devel. R might complain about missing developer packages with header files, in such a case the necessary package name must be guessed and the package installed (e.g. libcurl4-openssl-dev for Ubuntu when configure is complaining about missing curl):

mkdir -p $RDEVEL
cd $RDEVEL
$RSOURCES/R-devel/configure && make -j

That's it. Now we just need to set up a script to launch the development version of R:

#!/bin/bash
export PATH="$RDEVEL/bin/:\$PATH"
export R_LIBS=$RDEVEL/library
R "$@"

You need to save the script in an executable file somewhere in your $PATH, e.g. ~/bin might be a good idea.

Here are commands that make this script automatically in ~/bin/Rdev:

cat <<EOF>~/bin/Rdev;
#!/bin/bash

export R_LIBS=$RDEVEL/library
export PATH="$RDEVEL/bin/:\$PATH"
R "\$@"
EOF
chmod a+x ~/bin/Rdev

One last thing remaining is to populate the library with packages necessary for the R-devel to run and check the packages, in my case c("knitr", "devtools", "ellipse", "Rcpp", "extrafont", "RColorBrewer", "beeswarm", "testthat", "XML", "rmarkdown", "roxygen2" ) and others (I keep expanding this list while checking my packages). Also, bioconductor packages limma and org.Hs.eg.db, which I need for a package which I build.

Now I can check my packages with Rdev CMD build xyz / Rdev CMD check xyz_xyz.tar.gz

pca3d

This is a little package that I have been using for a long time to visually explore results of PCA on grouped data. The main purpose was to have one simple command that would visualise a result of a PCA in R in 3D and color the data points by group and type.

For example, take the data set provided with the package, called “metabo” (it stems from my paper on metabolic profiling in tuberculosis):

library( pca3d )
data( metabo )

# top left bit of the metabo data frame
head( metabo )[,1:10] 

This last command shows the following output:

  group   X1   X2   X3   X4   X5   X6   X7   X8   X9
1   POS 0.78 1.10 1.26 0.87 0.68 0.65 0.72 0.77 0.88
2   POS 0.68 0.51 0.30 0.21 1.64 2.42 1.19 1.19 1.58
3   POS 1.00 1.31 1.68 1.08 2.46 1.19 1.02 1.82 1.60
4   POS 1.08 0.75 0.65 2.33 0.81 0.72 0.94 0.93 0.31
5    TB 0.87 0.81 0.99 0.85 0.92 0.69 1.12 1.50 0.70
6    TB 1.29 0.89 0.46 0.49 0.50 1.03 1.10 0.48 0.31

Each row corresponds to one serum sample either from TB patients or healthy controls. The first column of the data frame metabo are the group assignments; the remaining 423 columns correspond to relative levels of different small molecules (like sugars or amino acids) in the given serum sample. Running a PCA is straightforward:

pca <- prcomp( metabo[,-1], scale.= TRUE )

And visualisation with pca3d is straightforward as well:

pca3d( pca, group= metabo[,1] )

A 3D output (using the rgl package) is produced — you can interactively turn, zoom and change the perspective of the plot. Also, with the rgl.snapshot( filename ) command you can export the graphics as a PNG file.

Visualisation of the metabo PCA using pca3d.

Visualisation of the metabo PCA using pca3d.

You can very clearly see that the blue balls stand apart from the rest in the first two components. What are they? It is not easy to create a reasonable legend directly on an RGL canvas, but pca3d produces a text-only legend in the main text interface:

Legend:
----------------------------------------
       group:        color,        shape
----------------------------------------
         NEG:          red, tetrahaedron
         POS:       green3,         cube
          TB:         blue,       sphere

Oh, so the TB patients are really different from the rest! Neat. The really elegant thing about the PCA is that it does not use any information about the group classification. Therefore, whatever groups we see, they are real — the visualisation corresponds to an independent validation on the whole data set. This is very much unlike PLS, where the score plots always show a clear separation; PLS is eager to please as one author put it.

Unfortunately, 3D can only be saved as a PNG. However, for a publication, a 2D PDF might be more suitable. Another command in this package, pca2d takes exactly the same options as the pca3d command and produces a graphics on the standard R device:

2D -version of the previous plot.

2D -version of the previous plot.

There are plenty of other options to pca3d, for example show.labels can take a character vector as an argument and show a little text floating above every data point.

Furthermore, it is possible to create biplots. Unlike the normal biplot function, by default only a few variables are selected from each component (by their absolute loadings in that component) — if there are too many variables visualised, the figure is cluttered and useless.

The red arrows show selected variables

The red arrows show selected variables

In the above figure, several variables with high loadings can be seen.

Another plot, in which the cluster centroids are shown for all three groups of samples:

The large symbols indicate cluster centroids. Each sample is connected to the corresponding centroid.

The large symbols indicate cluster centroids. Each sample is connected to the corresponding centroid.

pca3d on CRAN: http://cran.r-project.org/web/packages/pca3d/