R, rmarkdown, cache and objects

If your rmarkdown takes hours to generate, and you want to be able to generate different document output types on the fly, using the output_format option from rmarkdown::render is extremely annoying: every time you change the output format, the cache is reset, so you need to wait hours to get the other format.

I found no clean solution to this problem, but here is an ugly hack. We create a copy of the document and render it. First time it will take hours, but then the cache will be separate from your original document:

file.copy("test.rmd", "test_html.rmd", overwrite=TRUE)
rmarkdown::render("test_html.rmd")

Of course, this is annoying, and we can wrap this two commands with a function. But beware! Markdown by default evaluates in its parent environment, so to make sure it is evaluated in the global environment, you need to set an option. Here is a wrapper function which also opens by default the resulting document in google-chrome:

myrender <- function(fn, open=TRUE) {

  fb <- gsub("\\.rmd$", "", fn, ignore.case=TRUE)
  fn2 <- paste0(fb, "_html.rmd")
  file.copy(fn, fn2, overwrite=T)
  res <- render(fn2, output_format="html_document", envir=globalenv())
  system(sprintf("google-chrome %s", res))
  res
} 

Accessing object from another package by variable

Say the package name is stored in variable x and the name of the object you would like to access from that package (without loading the package) is stored in y. Of course the :: will not work, but fortunately it is just a wrapper around the function getExportedValue, so the following will work:

getExportedValue(x, y)

Add a file to scan when completing in vim

OK, so this was an easy thing to find out, but enormously useful.

In vim, Ctrl-P and Ctrl-N allow to complete a word in insert mode. By default (see the complete option) vim scans the current buffer, buffers in other windows, included files etc. However, to add a specific file (in my case, bibliography which I like to have opened in another terminal) you need to add a k to the complete option, plus the location of the file:

:set complete+=k./bibliography.bib

Invert a list / map

Often we use lists to map keywords onto values, for example

foo <- list(a=c("quark", "fark"), 
            b=c("quark", "foo", "bark"), 
            c=c("fark", "bark"))

To invert this list (such that “fark”, “bark” etc. become keywords, and “a”, “b” and “c” the values), do

foo.rev <- split(rep(names(foo), lengths(foo)), unlist(foo))

split splits a vector or data frame along a factor. In this case, we expand the names of foo using rep such that we get two vectors, as can be seen with the following command:

cbind(rep(names(foo), lengths(foo)), unlist(foo))

with the result

   [,1] [,2]   
a1 "a"  "quark"
a2 "a"  "fark" 
b1 "b"  "quark"
b2 "b"  "foo"  
b3 "b"  "bar"  
c1 "c"  "fark" 
c2 "c"  "bar"

When we apply split() to the first vector with the second to guide the split, we will get

$bar
[1] "b" "c"

$fark
[1] "a" "c"

$foo
[1] "b"

$quark
[1] "a" "b"

Using caret

A caret call I frequently use. Given that x is training data and y response,

library(doMC)
registerDoMC(cores=6)

tc <- trainControl(method="repeatedcv", number=10, repeats=1, 
  returnData=TRUE, savePredictions="all", verboseIter=TRUE, classProbs=TRUE)
mod <- train(x=x, y=y, trControl=tc, method="rf",
  tuneGrid=data.frame(mtry=500))
  • library(doMC) and registerDoMC allow me to use more than one processor
  • repeatedcv: if more than one repeat of k-fold crossvalidation is requested, the repeated= parameter should be modified. repeatedcv must be used instead of cv
  • savePredictions: if we want to evaluate predictions on our own
  • verboseIter: to see the progress
  • classProbs: to report class probabilities, so we can use them to calculate ROC post factum
  • tuneGrid: if not specified, caret will tune parameters. Normally, we don’t want that

Custom comparison function for sorting data in R

Many languages allow you to use a custom comparison function in sorting. R is not an exception, but it is not entirely straightforward – it requires you to define a new class and overload certain operators. Here is how to do it.

Consider the following example. You have a certain number of paired values, for example

v <- list(a=c(2,1), b=c(1,3), d=c(1,1), e=c(2,3))

The job is to order these pairs in the following way. Given two pairs, p1=(x1, y1) and p2=(x2, y2), p1 < p2 iff one of the following conditions is fulfilled: either x1 < x2 and y1 <= y2, or x1 <= x2 and y1 < y2. The point is that if we draw lines, where one end of the line is at the height x1, and the other end is at the height y1, we want to sort these lines only if they do not cross — at most, only if one of their ends overlaps (but not both, because then the lines would be identical):

On the figure above, left panel, p1 < p2, because one of the ends is below the end of the other line (x1 < x2 and y1=y2). Of course, if y1 < y2 the inequality still holds. On the other hand, the right panel shows a case where we cannot resolve the comparison; the lines cross, so we should treat them as equal.

If now we have a list of such pairs and want to order it, we will have a problem. Here is the thing: the desired order is {d, a, b, e}. The element d=(1,1) is clearly smaller (as defined above) than all the others. However, b=(1,3) is not smaller than a=(2,1), and a is not smaller than b; that means, that a is equal to b, and their order should not be modified.

There is no way to do that with regular tools such as order, especially since x and y may not only be on different scales — they might be even completely different data types! One might be a numeric vector, the other a character string. Or, possibly, a type of requisite from Monty Python (with a defined relation stating that a banana is less than a gun). We must use a custom comparator.

For this, we need to notice that the R functions sort and order rely on the function xtfrm. This in turns relies on the methods ==, &gt; and [, defined for a given class. For numeric vectors, for example, these give what you would expect.

Our v vector is a list with elements which are pairs of numbers. For this type of data, there is no comparison defined; and comparing two pairs of numbers results with a vector of two logical numbers, which is not what we want.

> v[1] < v[2]
Error in v[1] < v[2] : comparison of these types is not implemented
> v[[1]] < v[[2]]
[1] FALSE  TRUE

R, however, is an object oriented language (even if it does not always feel like that). Comparisons (“, ==) are generic functions and it is possible to define (or redefine) them for any class of objects. So here is the plan: we invent a new class for the object v, and define custom comparisons for the elements of this class of objects. Remember that if we define a function which name consists of a generic (like "plot" or "["), a dot, and a name of the class, we are defining the method for the given class:

## make v an object of class "foo"
class(v) <- "foo"

## to use the "extract" ([) method, 
## we need to momentarily change the class of x, because 
## otherwise we will end up in an endless loop
'[.foo' <- function(x, i) {
    class(x) <- "list"
    x <- x[i]
    class(x) <- "foo"
    x
}

## define ">" as stated above
## the weird syntax results from the fact that a and b
## are lists with one element, this element being a vector 
## of a pair of numbers
'>.foo' <- function(a,b) {
a <- a[[1]]
b <- b[[1]]
ifelse( (a[1] > b[1] && a[2] >= b[2])
                     ||
        (a[1] >= b[1] && a[2] > b[2]), TRUE, FALSE)
}

## if we can't find a difference, then there is no difference
'==.foo' <- function(a, b) 
    ifelse(a > b || b > a, FALSE, TRUE)

## we don't need that, but for the sake of completeness...
'<.foo' <- function(a, b) b > a

This will now do exactly what we want:

> v["a"] == v["b"]
[1] TRUE
> v["a"] > v["d"]
[1] TRUE
> sort(v)
$d
[1] 1 1

$a
[1] 2 1

$b
[1] 1 3

$e
[1] 2 3

attr(,"class")
[1] "foo"