All my life

Here is a little script to show you your life. In weeks. Each point is a week. Each black point is a week that you have already spent. The number of weeks corresponds to 90 years, which is higher than the current life expectancy anywhere in the world.

Have fun.

birthdate <- "1973-05-25"
seq1 <- 1:(as.numeric((Sys.Date() - as.Date(birthdate)))/7) - 1
seq2 <- 1:(90*52) - 1
plot(NULL, xlim=c(0,53), ylim=c(91, 0), bty="n", xlab="Week of the year", ylab="Age")
points(seq2 %% 52, floor(seq2 / 52), pch=15, col="grey")
points(seq1 %% 52, floor(seq1 / 52), pch=15)


Scientific Editor

I have a dream: a scientific editor that would be suitable for editing scientific papers. Currently, Word is king. There is no way around it: everybody has it, everybody uses it, sooner or later you will have a co-author who knows how to edit text only when writing an e-mail or when writing in Word. Chances are, that collaborator will be one of the big fishes on your authors list, maybe even your boss. Word has some basic collaborative options (like tracking changes and comments), bibliography (via Endnote or similar tools), and is accepted by most journals.

However, Word sucks. Big time. I know you know it.

I envision an open source solution that is based on an updated markdown syntax and the pandoc system. Here is, point by point, an informal specification of the system.

  1. Markdown as the primary text format. A user should be able to edit markdown directly without compromising any information contained in the document or write the document in rmarkdown and directly pass it down to the system.
  2. Zip file as the primary format: documents would be exchanged using a zip file containing
    • manuscript file in markdown
    • bibliography file (I would recommend BibTeX)
    • figure files
  3. Bibliography as in pandoc — bibliography in a format that is acceptable by pandoc, with CLS for reference formatting.
  4. Figures: figures are a major pain in the neck. Publishers require usually a vector graphic format or a high resolution image, but you want low-res previews in your print files or documents. The manuscript zip file should therefore either contain both, or only previews, with original files to be contained later.
  5. Markdown extensions:
    • extensions that would allow a “rich” export to Word’s docx: marking reviews, comments etc.
    • (better) figure and label captioning and cross-referencing
    • special bibliography sections (currently, you can only place the references at the end of the file)
  6. A visual UI with editor:
    • Java or similar that allows a painless installation process for even the least computer-savvy users, and allows them to edit the manuscript in a way that they are used to
    • GUI operations for an easy update of bibliography (I mean like really easy, just paste+copy of whatever: pubmed ids, google scholar links etc)
    • Equation editor, table editor etc. suitable for saving in markdown format
    • Version tracking
  7. Version tracking and managing revisions. Still pondering how to do this best, but this should be one of the major points for the system.
  8. Misc operations. The system should be able to quickly and painlessly accomplish following tasks:
    • split the manuscript into submission files by using logical definitions in the markdown (e.g. in the main manuscript file, separate figure files, separate supplementary data files)
    • provide detailed statistics on the document (word count)
    • possibly the visual UI could provide a plug-in to facilitate submission specifically in some of the most common manuscript submission systems (e.g. manuscript central).

Fold-change bar plots with “0” on y axis

I see it more and more frequently: bar plots which are supposed to illustrate the regulation of a gene in terms of “fold change”, which include a “0” on the y axis.

It is subtle, but it irks me a lot. Also, the last time I tried to argue with my experimentally working colleagues, I heard that “everybody does it like this” and that I am nit-picking.

What is the fold change? Suppose that you have a before and after measurements, a_0 and a_1. Now, the fold change is


Could you replace a_0 by a_1 and vice versa? Yes, you could define it as \frac{a_0}{a_1}, right? Fold change decrease (how many times smaller) rather than fold change increase (how many times larger).

OK, so what does that mean if the fold change is equal to 0?

First, think what it means that the fold change is equal to 0.5. That means that a_1 is half of a_0, or that a_0 is two times that of a_1.

What about 0.1? That means that a_1 is ten times smaller than a_0.

0.01? Hundred times.

0.001? Thousand times.

You see where this is going. As we approach zero, the relation \frac{a_0}{a_1} approaches infinity; you could say (incorrectly) that when fold change is equal to zero, a_1 is infinitely smaller than a_0.

Of course, this is outside of regular statistics. In other words, a fold change of 0 is meaningless and cannot be computed. If you measured a_1 and it was zero, you cannot meaningfully compute the fold change. Putting a zero on the y axis is therefore as meaningfull as putting “infinity”.

For that and other reasons, in many applications one calculates the log-fold change rather than fold change:

log_2{FC} = \log_2\frac{a_1}{a_0} = \log_2{a_1} - \log_2{a_0}

That makes the measure nice and symmetric around 0. If a_1 is twice higher than a_0, then log_2{FC}=1. If it is half of a_0, then log_2{FC}=-1. Also, it follows that a_0 and a_1 cannot be equal to 0 — because you cannot logarithmize zero.

Moreover, in most applications, logFC is (more or less) normally distributed. Fold change not only isn’t, it is not even possible for it to be. That means that not only putting a zero on the y axis is meaningless; but calculating parametric statistics such as mean and standard deviation of fold change is equally misleading. You simply shouldn’t do that.

But people nonetheless do, and they are happy with that. That is why we cannot have nice things.

Testing variance before ANOVA


β€œTo make the preliminary test on variances [before running a t-test or ANOVA] is rather like putting to sea in a rowing boat to find out whether conditions are sufficiently calm for an ocean liner to leave port!”

  • George Box, Biometrika 1953;40:318–35.

More on reveal.js and pandoc

One of the problems I had with reveal.js was the interactive PDF exporting mode — not only you require google-chrome for that, there also is no way of easily automatizing that task.

It turns out that decktape.js is a good, command line solution. The only drawback is that it actually creates screenshots from a browser, so that the slides do not contain any text — they are just a bunch of screenshots! This makes the PDF huge and not searchable. Moreover, you really want the script to wait between the screenshots (by default one second, which makes the hole process slow), otherwise it creates screenshots of the transition, and the result does not look good.

On the up side, it looks exactly like the presentation.

There were two issues to install it in Ubuntu 14.04, though. First, it was necessary to install the libjpeg62 package, and second, it was necessary to install the gcc 4.9 compiler, which I did by using the toolchain ppa:

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install gcc-4.9 g++-4.9

Everything else went smooth.

Then I put phantomjs into ~/bin/, the decktape/ directory into ~/.local/share/, and wrote a little bash script to be able to call it easily from anywhere:



if [ -z "${FILE}" ] ; then
    cat <<EOF

    ${0##*/}  [output file [options]]

decktape options:
  exit 0

if [ -z "$PDF" ] ; then PDF=${FILE%.*}.pdf ; fi


Two bar plots

What is the difference between the two bar plots below?


I am sitting on a conference and these type of plots are relatively frequent in the presentations. Complete with a log-scale.

The answer is, of course, that there is no difference between these two — the data is exactly the same, the only thing different is the vertical scale. These two plots explain why you should never, ever use a bar plot to represent log-scaled data: the position of the y axis is completely arbitrary, yet it influences greatly our perception of which plot shows a larger difference.

(See also “Kick the bar chart habit”)

R-devel in parallel to regular R installation

Unfortunately, you need both: R-devel (development version of R) if you want to submit your packages to CRAN, and regular R for your research (you don’t want the unstable release for that).

Fortunately, installing R-devel in parallel is less trouble than one might think.

Say, we want to install R-devel into a directory called ~/R-devel/, and we will download the sources to ~/src/. We will first set up two environment variables to hold these two directories:

export RSOURCES=~/src
export RDEVEL=~/R-devel

Then we get the sources with SVN. In Ubuntu, you need package subversion for that:

mkdir -p $RSOURCES
svn co R-devel

Then, we compile R-devel. R might complain about missing developer packages with header files, in such a case the necessary package name must be guessed and the package installed (e.g. libcurl4-openssl-dev for Ubuntu when configure is complaining about missing curl):

mkdir -p $RDEVEL
$RSOURCES/R-devel/configure && make -j

That's it. Now we just need to set up a script to launch the development version of R:

export PATH="$RDEVEL/bin/:\$PATH"
export R_LIBS=$RDEVEL/library
R "$@"

You need to save the script in an executable file somewhere in your $PATH, e.g. ~/bin might be a good idea.

Here are commands that make this script automatically in ~/bin/Rdev:

cat <<EOF>~/bin/Rdev;

export R_LIBS=$RDEVEL/library
export PATH="$RDEVEL/bin/:\$PATH"
R "\$@"
chmod a+x ~/bin/Rdev

One last thing remaining is to populate the library with packages necessary for the R-devel to run and check the packages, in my case c("knitr", "devtools", "ellipse", "Rcpp", "extrafont", "RColorBrewer", "beeswarm", "testthat", "XML", "rmarkdown", "roxygen2" ) and others (I keep expanding this list while checking my packages). Also, bioconductor packages limma and, which I need for a package which I build.

Now I can check my packages with Rdev CMD build xyz / Rdev CMD check xyz_xyz.tar.gz