Custom comparison function for sorting data in R

Many languages allow you to use a custom comparison function in sorting. R is not an exception, but it is not entirely straightforward – it requires you to define a new class and overload certain operators. Here is how to do it.

Consider the following example. You have a certain number of paired values, for example

v <- list(a=c(2,1), b=c(1,3), d=c(1,1), e=c(2,3))

The job is to order these pairs in the following way. Given two pairs, p1=(x1, y1) and p2=(x2, y2), p1 < p2 iff one of the following conditions is fulfilled: either x1 < x2 and y1 <= y2, or x1 <= x2 and y1 < y2. The point is that if we draw lines, where one end of the line is at the height x1, and the other end is at the height y1, we want to sort these lines only if they do not cross — at most, only if one of their ends overlaps (but not both, because then the lines would be identical):

On the figure above, left panel, p1 < p2, because one of the ends is below the end of the other line (x1 < x2 and y1=y2). Of course, if y1 < y2 the inequality still holds. On the other hand, the right panel shows a case where we cannot resolve the comparison; the lines cross, so we should treat them as equal.

If now we have a list of such pairs and want to order it, we will have a problem. Here is the thing: the desired order is {d, a, b, e}. The element d=(1,1) is clearly smaller (as defined above) than all the others. However, b=(1,3) is not smaller than a=(2,1), and a is not smaller than b; that means, that a is equal to b, and their order should not be modified.

There is no way to do that with regular tools such as order, especially since x and y may not only be on different scales — they might be even completely different data types! One might be a numeric vector, the other a character string. Or, possibly, a type of requisite from Monty Python (with a defined relation stating that a banana is less than a gun). We must use a custom comparator.

For this, we need to notice that the R functions sort and order rely on the function xtfrm. This in turns relies on the methods ==, &gt; and [, defined for a given class. For numeric vectors, for example, these give what you would expect.

Our v vector is a list with elements which are pairs of numbers. For this type of data, there is no comparison defined; and comparing two pairs of numbers results with a vector of two logical numbers, which is not what we want.

> v[1] < v[2]
Error in v[1] < v[2] : comparison of these types is not implemented
> v[[1]] < v[[2]]

R, however, is an object oriented language (even if it does not always feel like that). Comparisons (“, ==) are generic functions and it is possible to define (or redefine) them for any class of objects. So here is the plan: we invent a new class for the object v, and define custom comparisons for the elements of this class of objects. Remember that if we define a function which name consists of a generic (like "plot" or "["), a dot, and a name of the class, we are defining the method for the given class:

## make v an object of class "foo"
class(v) <- "foo"

## to use the "extract" ([) method, 
## we need to momentarily change the class of x, because 
## otherwise we will end up in an endless loop
'[.foo' <- function(x, i) {
    class(x) <- "list"
    x <- x[i]
    class(x) <- "foo"

## define ">" as stated above
## the weird syntax results from the fact that a and b
## are lists with one element, this element being a vector 
## of a pair of numbers
'>.foo' <- function(a,b) {
a <- a[[1]]
b <- b[[1]]
ifelse( (a[1] > b[1] && a[2] >= b[2])
        (a[1] >= b[1] && a[2] > b[2]), TRUE, FALSE)

## if we can't find a difference, then there is no difference
'' <- function(a, b) 
    ifelse(a > b || b > a, FALSE, TRUE)

## we don't need that, but for the sake of completeness...
'<.foo' <- function(a, b) b > a

This will now do exactly what we want:

> v["a"] == v["b"]
[1] TRUE
> v["a"] > v["d"]
[1] TRUE
> sort(v)
[1] 1 1

[1] 2 1

[1] 1 3

[1] 2 3

[1] "foo"

R, shiny and source()

This one cost me more time to figure out than it should have. The reason being, it turns out that I never properly understood what the source() function does.

So here is the story: I was setting up a shiny server for a student based on her code. She was running the shiny app from within RDesktop, and so before starting the app with runApp() she would load all necessary object and source() a file called helpers.R with some common calculations.

In order to put the app on a server, I have moved these pre-runApp() initializations into ui.R and server.R. Suddenly, weird errors appeared. The functions in the helpers.R no longer seemed to be able to find anything in the parent environment — object X not found! Even though I called source() immediately after loading the necessary objects into the environment:

# file server.R

The solution was, as usual, to read documentation. Specifically, documentation on source():

local   TRUE, FALSE or an environment, determining where the 
        parsed expressions are evaluated. FALSE (the default) 
        corresponds to the user's workspace (the global 
        environment) and TRUE to the environment from which 
        source is called.

The objects which I have load()-ed before were not in the global environment, but instead in another environment created by shiny. However, the expressions from helpers.R were evaluated in the global environment. Thus, a new function defined in helpers.R could be seen from inside server.R, but an object loaded from server.R could not be seen by helpers.R.

It is the first time that I have noticed this. Normally, I would use a file such as helpers.R only to define helper functions, and actually call them from server.R or ui.R. However, I was thinking that source() is something like #include in C, simply calling the commands in the given file as if they were inserted at this position into the code — or called from the environment from which source() was called.

This is not so.

Adding figure labels (A, B, C, …) in the top left corner of the plotting region

I decided to submit a manuscript using only R with knitr, pandoc and make. Actually, it went quite well. Certainly, revisions of manuscript with complex figures did not require much of manual work once the R code for the figures has been created. The manuscript ended up as a Word file (for the sake of co-authors), looking no different than any other manuscript. However, you can look up precisely how all the figures have been generated and, with a single command, re-create the manuscript (with all figures and supplementary data) after you changed a parameter.

One of the small problems I faced was adding labels to pictures. You know — like A, B, C… in the top right corner of each panel of a composite figure. Here is the output I was striving at:

Doing it proved to be more tedious than I thought at first. By default, you can only plot things in the plotting region, everything else gets clipped — you cannot put arbitrary text anywhere outside the rectangle containing the actual plot:

text(-20, 0, "one two three four", cex=2)

This is because the plotting are is the red rectangle on the figure below, and everything outside will not be shown by default:

One can use the function mtext to put text on the margins. However, there is no simple way to say “put the text in the top left corner of the figure”, and the results I was able to get were never perfect. Anyway, to push the label really to the very left of the figure region using mtext, you first need to have the user coordinate of that region (to be able to use option ‘at’). However, if you know these coordinates, it is much easier to achieve the desired effect using text.

First, we need to figure out a few things. To avoid clipping of the region, one needs to change the parameter xpd:


Then, we need to know where to draw the label. We can get the coordinates of the device (in inches), and then we can translate these to user coordinates with appropriate functions:

di <- dev.size("in")
x <- grconvertX(c(0, di[1]), from="in", to="user")
y <- grconvertY(c(0, di[2]), from="in", to="user")

x[1] and y[2] are the coordinates of the top left corner of the device… but not of the figure. Since we might have manipulated the layout, for example, by calling par(mfrow=...) or layout to put multiple plots on the device, and we would like to always label the current plot only (i.e. put the label in the corner of the current figure, not of the whole device), we have to take this into account as well:

fig <- par("fig")
x <- x[1] + (x[2] - x[1]) * fig[1:2]
y <- y[1] + (y[2] - y[1]) * fig[3:4]

Before plotting, we have to adjust this position by half of the text string width and height, respectively:

txt <- "A"
x <- x[1] + strwidth(txt, cex=3) / 2
y <- y[2] - strheight(txt, cex=3) / 2
text(x, y, txt, cex=3)

Looks good! That is exactly what I wanted:

Below you will find an R function that draws a label in one of the three regions — figure (default), plot or device. You specify the position of the label using the labels also used by legend: “topleft”, “bottomright” etc.

First, a few examples how to use it:

Basic use:

sapply(LETTERS[1:4], function(x) { 
    fig_label(x, cex=2) 


Plotting at different positions and in different regions:

for(i in c("topleft", "topright", "top", 
           "left", "center", "right", 
           "bottomleft", "bottom", "bottomright")) {
    fig_label(i, pos=i, cex=2, col="blue")
    fig_label(i, pos=i, cex=1.5, col="red", region="plot")


All the different regions:

sapply(LETTERS[1:4], function(x) { 
    fig_label("figure region", cex=2, col="red") 
    fig_label("plot region", region="plot", cex=2, col="blue")
fig_label("device region", cex=2, pos="bottomright", 
  col="darkgreen", region="device")


And here is the function:

fig_label <- function(text, region="figure", pos="topleft", cex=NULL, ...) {

  region <- match.arg(region, c("figure", "plot", "device"))
  pos <- match.arg(pos, c("topleft", "top", "topright", 
                          "left", "center", "right", 
                          "bottomleft", "bottom", "bottomright"))

  if(region %in% c("figure", "device")) {
    ds <- dev.size("in")
    # xy coordinates of device corners in user coordinates
    x <- grconvertX(c(0, ds[1]), from="in", to="user")
    y <- grconvertY(c(0, ds[2]), from="in", to="user")

    # fragment of the device we use to plot
    if(region == "figure") {
      # account for the fragment of the device that 
      # the figure is using
      fig <- par("fig")
      dx <- (x[2] - x[1])
      dy <- (y[2] - y[1])
      x <- x[1] + dx * fig[1:2]
      y <- y[1] + dy * fig[3:4]

  # much simpler if in plotting region
  if(region == "plot") {
    u <- par("usr")
    x <- u[1:2]
    y <- u[3:4]

  sw <- strwidth(text, cex=cex) * 60/100
  sh <- strheight(text, cex=cex) * 60/100

  x1 <- switch(pos,
    topleft     =x[1] + sw, 
    left        =x[1] + sw,
    bottomleft  =x[1] + sw,
    top         =(x[1] + x[2])/2,
    center      =(x[1] + x[2])/2,
    bottom      =(x[1] + x[2])/2,
    topright    =x[2] - sw,
    right       =x[2] - sw,
    bottomright =x[2] - sw)

  y1 <- switch(pos,
    topleft     =y[2] - sh,
    top         =y[2] - sh,
    topright    =y[2] - sh,
    left        =(y[1] + y[2])/2,
    center      =(y[1] + y[2])/2,
    right       =(y[1] + y[2])/2,
    bottomleft  =y[1] + sh,
    bottom      =y[1] + sh,
    bottomright =y[1] + sh)

  old.par <- par(xpd=NA)

  text(x1, y1, text, cex=cex, ...)

Using external data from within another package

If you make the error which I did, you will try to use the data (say, “pckgdata”) from another package (say, “pckg”) naively like this:

someFunc <- function() {
  foo <- pckgdata$whatever

This will result in an error:

someFunc: no visible binding for global variable ‘pckgdata’
someFunc : <anonymous>: no visible binding for global variable
Undefined global functions or variables:

Here is the solution (thanks to the comments from stackexchange:

.myenv <- new.env(parent=emptyenv())

someFunc <- function() {
  data("pckgdata", package="pckg", envir=".myenv")
  foo <- .myenv$pckgdata$whatever

Actually, let us load the object as soon as our package is loaded:

.myenv <- new.env(parent=emptyenv())

.onLoad <- function(libname, pkgname){
  data("pckgdata", package="pckg", envir=".myenv") 

someFunc <- function() {

  foo <- .myenv$pckgdata$whatever

Now any of the functions in our package can use the pckgdata, whenever. Note that we want to use .onLoad(), and not .onAttach() — the latter one is for such things as startup messages when the package is manually attached by the user.

Alternatively, you can create your environment within the function itself:

<br />someFunc <- function() {
  myenv <- new.env(parent=emptyenv())
  data("pckgdata", package="pckg", envir="myenv")
  foo <- .myenv$pckgdata$whatever

All my life

Here is a little script to show you your life. In weeks. Each point is a week. Each black point is a week that you have already spent. The number of weeks corresponds to 90 years, which is higher than the current life expectancy anywhere in the world.

Have fun.

birthdate <- "1973-05-25"
seq1 <- 1:(as.numeric((Sys.Date() - as.Date(birthdate)))/7) - 1
seq2 <- 1:(90*52) - 1
plot(NULL, xlim=c(0,53), ylim=c(91, 0), bty="n", xlab="Week of the year", ylab="Age")
points(seq2 %% 52, floor(seq2 / 52), pch=15, col="grey")
points(seq1 %% 52, floor(seq1 / 52), pch=15)


Scientific Editor

I have a dream: a scientific editor that would be suitable for editing scientific papers. Currently, Word is king. There is no way around it: everybody has it, everybody uses it, sooner or later you will have a co-author who knows how to edit text only when writing an e-mail or when writing in Word. Chances are, that collaborator will be one of the big fishes on your authors list, maybe even your boss. Word has some basic collaborative options (like tracking changes and comments), bibliography (via Endnote or similar tools), and is accepted by most journals.

However, Word sucks. Big time. I know you know it.

I envision an open source solution that is based on an updated markdown syntax and the pandoc system. Here is, point by point, an informal specification of the system.

  1. Markdown as the primary text format. A user should be able to edit markdown directly without compromising any information contained in the document or write the document in rmarkdown and directly pass it down to the system.
  2. Zip file as the primary format: documents would be exchanged using a zip file containing
    • manuscript file in markdown
    • bibliography file (I would recommend BibTeX)
    • figure files
  3. Bibliography as in pandoc — bibliography in a format that is acceptable by pandoc, with CLS for reference formatting.
  4. Figures: figures are a major pain in the neck. Publishers require usually a vector graphic format or a high resolution image, but you want low-res previews in your print files or documents. The manuscript zip file should therefore either contain both, or only previews, with original files to be contained later.
  5. Markdown extensions:
    • extensions that would allow a “rich” export to Word’s docx: marking reviews, comments etc.
    • (better) figure and label captioning and cross-referencing
    • special bibliography sections (currently, you can only place the references at the end of the file)
  6. A visual UI with editor:
    • Java or similar that allows a painless installation process for even the least computer-savvy users, and allows them to edit the manuscript in a way that they are used to
    • GUI operations for an easy update of bibliography (I mean like really easy, just paste+copy of whatever: pubmed ids, google scholar links etc)
    • Equation editor, table editor etc. suitable for saving in markdown format
    • Version tracking
  7. Version tracking and managing revisions. Still pondering how to do this best, but this should be one of the major points for the system.
  8. Misc operations. The system should be able to quickly and painlessly accomplish following tasks:
    • split the manuscript into submission files by using logical definitions in the markdown (e.g. in the main manuscript file, separate figure files, separate supplementary data files)
    • provide detailed statistics on the document (word count)
    • possibly the visual UI could provide a plug-in to facilitate submission specifically in some of the most common manuscript submission systems (e.g. manuscript central).

Fold-change bar plots with “0” on y axis

I see it more and more frequently: bar plots which are supposed to illustrate the regulation of a gene in terms of “fold change”, which include a “0” on the y axis.

It is subtle, but it irks me a lot. Also, the last time I tried to argue with my experimentally working colleagues, I heard that “everybody does it like this” and that I am nit-picking.

What is the fold change? Suppose that you have a before and after measurements, a_0 and a_1. Now, the fold change is


Could you replace a_0 by a_1 and vice versa? Yes, you could define it as \frac{a_0}{a_1}, right? Fold change decrease (how many times smaller) rather than fold change increase (how many times larger).

OK, so what does that mean if the fold change is equal to 0?

First, think what it means that the fold change is equal to 0.5. That means that a_1 is half of a_0, or that a_0 is two times that of a_1.

What about 0.1? That means that a_1 is ten times smaller than a_0.

0.01? Hundred times.

0.001? Thousand times.

You see where this is going. As we approach zero, the relation \frac{a_0}{a_1} approaches infinity; you could say (incorrectly) that when fold change is equal to zero, a_1 is infinitely smaller than a_0.

Of course, this is outside of regular statistics. In other words, a fold change of 0 is meaningless and cannot be computed. If you measured a_1 and it was zero, you cannot meaningfully compute the fold change. Putting a zero on the y axis is therefore as meaningfull as putting “infinity”.

For that and other reasons, in many applications one calculates the log-fold change rather than fold change:

log_2{FC} = \log_2\frac{a_1}{a_0} = \log_2{a_1} - \log_2{a_0}

That makes the measure nice and symmetric around 0. If a_1 is twice higher than a_0, then log_2{FC}=1. If it is half of a_0, then log_2{FC}=-1. Also, it follows that a_0 and a_1 cannot be equal to 0 — because you cannot logarithmize zero.

Moreover, in most applications, logFC is (more or less) normally distributed. Fold change not only isn’t, it is not even possible for it to be. That means that not only putting a zero on the y axis is meaningless; but calculating parametric statistics such as mean and standard deviation of fold change is equally misleading. You simply shouldn’t do that.

But people nonetheless do, and they are happy with that. That is why we cannot have nice things.