rmarkdown and terminal colors

The R output in rmarkdown strips all terminal control sequences – colors and formats (i.e., bold or italics). However, it is relatively easy to restore it. For this, one needs to install the fansi package and include the following chunk in the rmarkdown document, hooking a custom function to the output:

    ```{r echo=FALSE}
    options(crayon.enabled = TRUE)
    knitr::knit_hooks$set(output = function(x, options){
      paste0(
        '<pre class="r-output"><code>',
        fansi::sgr_to_html(x = htmltools::htmlEscape(x), warn = FALSE),
        '</code></pre>'
      )
    })
    ```

If you are using the crayon package, however, you might run into the following problem: in some situations crayon “thinks” that the terminal has a limited capability of displaying colors, and will use only the 16 base colors. Even if more colors are available. One such situation is when one includes colored output in a vignette processed automatically – there is no way to convince the num_colors function from crayon that it should report 256 colors.

Therefore, we need to substitute the num_colors function by a dumber version:

num_colors <- function(forget=TRUE) 256
library(crayon)
assignInNamespace("num_colors", num_colors, pos="package:crayon") 

Colorful tables in a terminal

It all started when I wanted to have significant p-values shown on the terminal colored in red. The R terminal is capable of showing colors, simple formatting (like italics or bold) and Unicode characters, thanks to the actual terminal that does the job of displaying R output – whether it is the console of rstudio or a terminal window. You can see that when you use tibbles from tidyverse: they use some very limited formatting (like showing “NA” in red).

I ended up writing a new package, colorDF. The package defines a new class of data frames, but it really does not change their behavior – just the way they are shown (specifically, it modifies some attributes and introduces a print.colorDF function for printing). If you change a tibble to a colorDF, it will still behave exactly like a tibble, but it will be shown in color:

# Color data frame 6 x 87: # (Showing rows 1 - 20 out of 87) │name │height│mass │birth_year│gender │probability 1 Luke Skywalker 172 77 19male 0.0083 2 C-3PO 167 75 112NA 0.0680 3 R2-D2 96 32 33NA 0.0596 4 Darth Vader 202 136 42male 0.0182 5 Leia Organa 150 49 19female 0.0138 6 Owen Lars 178 120 52male 0.0115 7 Beru Whitesun lars 165 75 47female 0.0489 8 R5-D4 97 32 NANA 0.0040 9 Biggs Darklighter 183 84 24male 0.0954 10 Obi-Wan Kenobi 182 77 57male 0.0242 11 Anakin Skywalker 188 84 42male 0.0066 12 Wilhuff Tarkin 180 NA 64male 0.0605 13 Chewbacca 228 112 200male 0.0587 14 Han Solo 180 80 29male 0.0519 15 Greedo 173 74 44male 0.0204 16Jabba Desilijic Tiure 175 1358 600hermaphrodite0.0929 17 Wedge Antilles 170 77 21male 0.0457 18 Jek Tono Porkins 180 110 NAmale 0.0331 19 Yoda 66 17 896male 0.0931 20 Palpatine 170 75 82male 0.0012

Yes, it looks like that in the terminal window!

You can read all about it in the package vignette (please use the package from github, the CRAN version is lagging behind). Apart from the print function, I implemented also a summary function which is more informative than the default summary function for the data frames.

starwars %>% as.colorDF %>% summary
# Color data frame 5 x 13: │Col │Class│NAs │unique│Summary 1name <chr> 0 87All values unique 2height <int> 6 45 66 [167 <180> 191] 264 3mass <dbl> 28 38 15.0 [ 55.6 < 79.0> 84.5] 1358.0 4hair_color<chr> 5 12none: 37, brown: 18, black: 13, white: 4, blond: 3, auburn: 1, … 5skin_color<chr> 0 31fair: 17, light: 11, dark: 6, green: 6, grey: 6, pale: 5, brown… 6eye_color <chr> 0 15brown: 21, blue: 19, yellow: 11, black: 10, orange: 8, red: 5, … 7birth_year<dbl> 44 36 8 [ 35 < 52> 72] 896 8gender <chr> 3 4male: 62, female: 19, none: 2, hermaphrodite: 1 9homeworld <chr> 10 48Naboo: 11, Tatooine: 10, Alderaan: 3, Coruscant: 3, Kamino: 3, … 10species <chr> 5 37Human: 35, Droid: 5, Gungan: 3, Kaminoan: 2, Mirialan: 2, Twi'l… 11films <lst> 0 24Attack of the Clones: 40, Revenge of the Sith: 34, The Phantom … 12vehicles <lst> 0 11Imperial Speeder Bike: 2, Snowspeeder: 2, Tribubble bongo: 2, A… 13starships <lst> 0 17Millennium Falcon: 4, X-wing: 4, Imperial shuttle: 3, Naboo fig…

For numeric vectors, by default the function shows the minimum, quartiles and median, but it can also produce a boxplot-like graphical summary. Since the function works also on lists, implementing a text terminal based boxplot function was super easy:

term_boxplot(Sepal.Length ~ Species, data=iris, width=90)
# Color data frame 5 x 4: │Col │Class│NAs │unique│Summary 1setosa <dbl> 0 15╾──────┤ + ├────────╼ 2versicolor<dbl> 0 21 ╾─────────┤ + ├──────────╼ 3virginica <dbl> 0 21 ╾──────────────────┤ + ├──────────────╼ 4Range <chr> 0 1Only one value: Range: 4.3 - 7.9

Cool, isn’t it?

R, rmarkdown, cache and objects

If your rmarkdown takes hours to generate, and you want to be able to generate different document output types on the fly, using the output_format option from rmarkdown::render is extremely annoying: every time you change the output format, the cache is reset, so you need to wait hours to get the other format.

I found no clean solution to this problem, but here is an ugly hack. We create a copy of the document and render it. First time it will take hours, but then the cache will be separate from your original document:

file.copy("test.rmd", "test_html.rmd", overwrite=TRUE)
rmarkdown::render("test_html.rmd")

Of course, this is annoying, and we can wrap this two commands with a function. But beware! Markdown by default evaluates in its parent environment, so to make sure it is evaluated in the global environment, you need to set an option. Here is a wrapper function which also opens by default the resulting document in google-chrome:

myrender <- function(fn, open=TRUE) {

  fb <- gsub("\\.rmd$", "", fn, ignore.case=TRUE)
  fn2 <- paste0(fb, "_html.rmd")
  file.copy(fn, fn2, overwrite=T)
  res <- render(fn2, output_format="html_document", envir=globalenv())
  system(sprintf("google-chrome %s", res))
  res
} 

Add a file to scan when completing in vim

OK, so this was an easy thing to find out, but enormously useful.

In vim, Ctrl-P and Ctrl-N allow to complete a word in insert mode. By default (see the complete option) vim scans the current buffer, buffers in other windows, included files etc. However, to add a specific file (in my case, bibliography which I like to have opened in another terminal) you need to add a k to the complete option, plus the location of the file:

:set complete+=k./bibliography.bib

Invert a list / map

Often we use lists to map keywords onto values, for example

foo <- list(a=c("quark", "fark"), 
            b=c("quark", "foo", "bark"), 
            c=c("fark", "bark"))

To invert this list (such that “fark”, “bark” etc. become keywords, and “a”, “b” and “c” the values), do

foo.rev <- split(rep(names(foo), lengths(foo)), unlist(foo))

split splits a vector or data frame along a factor. In this case, we expand the names of foo using rep such that we get two vectors, as can be seen with the following command:

cbind(rep(names(foo), lengths(foo)), unlist(foo))

with the result

   [,1] [,2]   
a1 "a"  "quark"
a2 "a"  "fark" 
b1 "b"  "quark"
b2 "b"  "foo"  
b3 "b"  "bar"  
c1 "c"  "fark" 
c2 "c"  "bar"

When we apply split() to the first vector with the second to guide the split, we will get

$bar
[1] "b" "c"

$fark
[1] "a" "c"

$foo
[1] "b"

$quark
[1] "a" "b"

Using caret

A caret call I frequently use. Given that x is training data and y response,

library(doMC)
registerDoMC(cores=6)

tc <- trainControl(method="repeatedcv", number=10, repeats=1, 
  returnData=TRUE, savePredictions="all", verboseIter=TRUE, classProbs=TRUE)
mod <- train(x=x, y=y, trControl=tc, method="rf",
  tuneGrid=data.frame(mtry=500))
  • library(doMC) and registerDoMC allow me to use more than one processor
  • repeatedcv: if more than one repeat of k-fold crossvalidation is requested, the repeated= parameter should be modified. repeatedcv must be used instead of cv
  • savePredictions: if we want to evaluate predictions on our own
  • verboseIter: to see the progress
  • classProbs: to report class probabilities, so we can use them to calculate ROC post factum
  • tuneGrid: if not specified, caret will tune parameters. Normally, we don’t want that

Custom comparison function for sorting data in R

Many languages allow you to use a custom comparison function in sorting. R is not an exception, but it is not entirely straightforward – it requires you to define a new class and overload certain operators. Here is how to do it.

Consider the following example. You have a certain number of paired values, for example

v <- list(a=c(2,1), b=c(1,3), d=c(1,1), e=c(2,3))

The job is to order these pairs in the following way. Given two pairs, p1=(x1, y1) and p2=(x2, y2), p1 < p2 iff one of the following conditions is fulfilled: either x1 < x2 and y1 <= y2, or x1 <= x2 and y1 < y2. The point is that if we draw lines, where one end of the line is at the height x1, and the other end is at the height y1, we want to sort these lines only if they do not cross — at most, only if one of their ends overlaps (but not both, because then the lines would be identical):

On the figure above, left panel, p1 < p2, because one of the ends is below the end of the other line (x1 < x2 and y1=y2). Of course, if y1 < y2 the inequality still holds. On the other hand, the right panel shows a case where we cannot resolve the comparison; the lines cross, so we should treat them as equal.

If now we have a list of such pairs and want to order it, we will have a problem. Here is the thing: the desired order is {d, a, b, e}. The element d=(1,1) is clearly smaller (as defined above) than all the others. However, b=(1,3) is not smaller than a=(2,1), and a is not smaller than b; that means, that a is equal to b, and their order should not be modified.

There is no way to do that with regular tools such as order, especially since x and y may not only be on different scales — they might be even completely different data types! One might be a numeric vector, the other a character string. Or, possibly, a type of requisite from Monty Python (with a defined relation stating that a banana is less than a gun). We must use a custom comparator.

For this, we need to notice that the R functions sort and order rely on the function xtfrm. This in turns relies on the methods ==, &gt; and [, defined for a given class. For numeric vectors, for example, these give what you would expect.

Our v vector is a list with elements which are pairs of numbers. For this type of data, there is no comparison defined; and comparing two pairs of numbers results with a vector of two logical numbers, which is not what we want.

> v[1] < v[2]
Error in v[1] < v[2] : comparison of these types is not implemented
> v[[1]] < v[[2]]
[1] FALSE  TRUE

R, however, is an object oriented language (even if it does not always feel like that). Comparisons (“, ==) are generic functions and it is possible to define (or redefine) them for any class of objects. So here is the plan: we invent a new class for the object v, and define custom comparisons for the elements of this class of objects. Remember that if we define a function which name consists of a generic (like "plot" or "["), a dot, and a name of the class, we are defining the method for the given class:

## make v an object of class "foo"
class(v) <- "foo"

## to use the "extract" ([) method, 
## we need to momentarily change the class of x, because 
## otherwise we will end up in an endless loop
'[.foo' <- function(x, i) {
    class(x) <- "list"
    x <- x[i]
    class(x) <- "foo"
    x
}

## define ">" as stated above
## the weird syntax results from the fact that a and b
## are lists with one element, this element being a vector 
## of a pair of numbers
'>.foo' <- function(a,b) {
a <- a[[1]]
b <- b[[1]]
ifelse( (a[1] > b[1] && a[2] >= b[2])
                     ||
        (a[1] >= b[1] && a[2] > b[2]), TRUE, FALSE)
}

## if we can't find a difference, then there is no difference
'==.foo' <- function(a, b) 
    ifelse(a > b || b > a, FALSE, TRUE)

## we don't need that, but for the sake of completeness...
'<.foo' <- function(a, b) b > a

This will now do exactly what we want:

> v["a"] == v["b"]
[1] TRUE
> v["a"] > v["d"]
[1] TRUE
> sort(v)
$d
[1] 1 1

$a
[1] 2 1

$b
[1] 1 3

$e
[1] 2 3

attr(,"class")
[1] "foo"

R, shiny and source()

This one cost me more time to figure out than it should have. The reason being, it turns out that I never properly understood what the source() function does.

So here is the story: I was setting up a shiny server for a student based on her code. She was running the shiny app from within RDesktop, and so before starting the app with runApp() she would load all necessary object and source() a file called helpers.R with some common calculations.

In order to put the app on a server, I have moved these pre-runApp() initializations into ui.R and server.R. Suddenly, weird errors appeared. The functions in the helpers.R no longer seemed to be able to find anything in the parent environment — object X not found! Even though I called source() immediately after loading the necessary objects into the environment:

# file server.R
load("myobjects.rda")
source("helpers.R")

The solution was, as usual, to read documentation. Specifically, documentation on source():

local   TRUE, FALSE or an environment, determining where the 
        parsed expressions are evaluated. FALSE (the default) 
        corresponds to the user's workspace (the global 
        environment) and TRUE to the environment from which 
        source is called.

The objects which I have load()-ed before were not in the global environment, but instead in another environment created by shiny. However, the expressions from helpers.R were evaluated in the global environment. Thus, a new function defined in helpers.R could be seen from inside server.R, but an object loaded from server.R could not be seen by helpers.R.

It is the first time that I have noticed this. Normally, I would use a file such as helpers.R only to define helper functions, and actually call them from server.R or ui.R. However, I was thinking that source() is something like #include in C, simply calling the commands in the given file as if they were inserted at this position into the code — or called from the environment from which source() was called.

This is not so.