Adding authentication to a shiny server

Umph, that was a tough one. I spent ages figuring out how to do it correctly. I have a server running apache (on port 80) and shiny on port (say) 11111. Shiny has its own document root, and within this root, we have a shiny app, say, “example”. So to view this app you need to type http://server:1111/example/. So far, so good. What I wanted, though, was (i) some kind of password protection for the app, and (ii) calling the app from the URL http://server/example/. Turns out it can be done, but it was not trvial.

First, I modified configuration of the shiny server to listen only to the specified port and only to the localhost; this prevents anyone from any other machine to connect to shiny:

server {
  listen 11111;
  location / {
    site_dir /srv/shiny-server;
    log_dir /var/log/shiny-server;
    directory_index off;

Now to apache. In the httpd.conf file, I have added The following:

         <VirtualHost *:80>

            Redirect /example /example/
            ProxyPass /example/
            ProxyPassReverse /example/

            <Location /example>
                AuthType Basic
                AuthName "Enter your login name and password"
                AuthUserFile /etc/httpd/htpasswd.users
                Require valid-user


This makes apache work as a proxy to the shiny server; however, with the added benefit of a simple authentication for the shiny contents.

It took me quite some time to figure out that without the Redirect directive above, http://server/example/ works, but http://server/example (without the slash) doesn’t.

Finally, I created new users with htpasswd.

Update: Interestingly, shiny cannot handle HEAD requests. HEAD request is when a program asks whether a page is there rather than downloading the whole page. Apparently, this is how CRAN checks whether a site is available. In any case, just after the Virtualhosts directive, I have added the following:

    RewriteEngine on
    RewriteCond %{REQUEST_METHOD} ^HEAD
    RewriteRule ^/example(.*) /foo/index.html

This rewrites the requested URL to example only if the request method is HEAD, and instead asking the shiny server, it asks itself — and since the file /foo/index.html exists, we get 200 OK.

Update 2: I found why shiny is returning 404 to HEAD requests. In the file shiny/R/server.R, in line 177, you have the statement

  if (!identical(req$REQUEST_METHOD, 'GET'))

Obviously, any request other than “GET” gets turned down, as NULL results in a 404.

As a workaround, I have changed it to

  if(identical(req$REQUEST_METHOD, 'HEAD'))
    return(httpResponse(200, content="OK"))

  if (!identical(req$REQUEST_METHOD, 'GET'))

Of course, the down side is that it replies with “OK” even if the given resource does not exist. I am testing a proper way of handling these requests, but at the moment this will have to do.

tagcloud: creating tag / word clouds

May I present my new package: tagcloud. Tag clouds is for creating, um, tag clouds (aka word clouds). It is based on the code from the wordcloud package, but (i) has no tools for analysing word frequencies, instead (ii) focuses on doing better tag clouds. As to (ii), it adapts to the geometry of the window better and can produce different layouts. Also, with the extrafont package, it can use just any fonts you can think of:


The general syntax is simple. The function tagcloud takes one mandatory argument, the tags to display — a character vector. In the example below, I use the font names available through the extrafont packageยน, but it can be anything.

library( extrafont )
tags <- sample( fonts(), 50 )
tagcloud( tags )
dev.copy2pdf( file= "sample1.pdf", out.type= "cairo" )

Note the use of cairo in the above, otherwise the PDF does not include the fonts and the result is not impressive. Here is the result:


OK, let’s add some weights and colors. Also, wouldn’t it be cool if each font name was displayed in the actual font?

weights <- rgamma( 50, 1 )
colors <- colorRampPalette( brewer.pal( 12, "Paired" ) )( 50 )
tagcloud( tags, weights= weights, col= colors, family= tags )


How about mixed vertical and horizontal tags?

tagcloud( tags, weights= weights, col= colors, family= tags, 
fvert= 0.5 )


The fvert parameter specifies the proportion of tags that should be displayed vertically rather than horizontally.

Or a different layout?

tagcloud( tags, weights= weights, col= colors, 
family= tags, algorithm= "fill" )


Tagclouds comes with additional tools. Firstly, you have editor.tagcloud — a very minimalistic interactive editor. You need to store the object invisibly returned from tagcloud:

tc <- tagcloud( tags, weights= weights, col= colors, family= tags )
tc2 <- editor.tagcloud( tc )
plot( tc2 )

With strmultline you can break up long, multi-word tags into multi-line tags:

tagcloud( strmultline( tags ), weights= weights, col= colors, family= tags )

The result is as follows (notice “Andale Mono” or “DejaVu Sans Light”):


Finally, smoothPalette makes it easy to generate a gradient from numbers. Imagine that we want to code some other numeric information (this could be a p-value, for example) with a smooth gradient from light grey (low value) to black (high value):

newvar <- runif( 50 )
colors2 <- smoothPalette( newvar )
tagcloud( tags, weights=weights, col=colors2, family= tags )

By default, smoothPalette uses a grey-white gradient, but it can actually use any kind of color palette.


1: In order to use the fonts installed on your system, you need to import them — preferably as root — using the extrafont package. At least in my Ubuntu installation you should provide the paths to where your TTF fonts are installed, for example:

library( extrafont )
font_import( paths= "/usr/share/fonts/truetype/" )

Troubles with sapply

Say your function in sapply returns a matrix with a variable number of rows. For example

ff <- function( i ) matrix( i, ncol= 3, nrow= i )
pipa <- sapply( 1:3, , simplify= "array" )

sapply is too stupid to see the pattern here (or maybe I don’t know how to cast the return value into an appropriate shape…). The result of the above is a list:

     [,1] [,2] [,3]
[1,]    1    1    1

     [,1] [,2] [,3]
[1,]    2    2    2
[2,]    2    2    2

     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    3    3    3
[3,]    3    3    3

However, we can turn this list of matrices into a simple matrix using Reduce:

pipa <- Reduce( function( ... ) rbind( ... ), pipa )


The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.

ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" )
f <- listFilters( ensembl )
a <- listAttributes( ensembl )

The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.

g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" )
d <- getBM( 
        attributes= c( "entrezgene", "description", "uniprot_genename" ), 
        filters= "hgnc_symbol", 
        values= g, 
        mart= ensembl )

This returns the following:

  entrezgene                                                      description uniprot_genename
1       6347    chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618]             CCL2
2       1634                            decorin [Source:HGNC Symbol;Acc:2705]              DCN
3       3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018]              IL6
4       3976         leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596]              LIF
5       5328   plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052]             PLAU

OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.

g <- c( "BTC" )
d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )

will return two lines:

  uniprot_genename                                description ensembl_gene_id
1              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530
2              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808

We can use aggregate to collapse the ensembl IDs:

ff <- function( x ) { x <- x[ ! x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) }
d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]

The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:

  Group.1 uniprot_genename                                description                 ensembl_gene_id
1     BTC              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808

Pathway analysis in R and BioConductor.

There are many options to do pathway analysis with R and BioConductor.

First, it is useful to get the KEGG pathways:

library( gage )
kg.hsa <- kegg.gsets( "hsa" )
kegg.gs2 <- kg.hsa$kg.sets[ kg.hsa$sigmet.idx ]

Of course, “hsa” stands for Homo sapiens, “mmu” would stand for Mus musuculus etc.

Incidentally, we can immediately make an analysis using gage. However, gage is tricky; note that by default, it makes a pairwise comparison between samples in the reference and treatment group. Also, you just have the two groups — no complex contrasts like in limma.

res <- gage( E, gsets= kegg.gs2, 
       ref= which( group == "Control" ), 
       samp= which( group == "Treatment" ),
       compare= "unpaired", same.dir= FALSE )

Now, some filthy details about the parameters for gage.

  • E is the matrix with expression data: columns are arrays and rows are genes. If you use a limma EList object to store your data, this is just the E member of the object (rg$E for example). However, and this is important, gage (and KEGG and others) are driven by the Entrez gene identifiers, and this is not what you usually have when you start the analysis. To get the correct array, you need to
    • select only the genes with ENTREZ IDs,
    • make sure that there are no duplicates
    • change the row names of E to ENTREZ IDs
  • gsets is just a list of character vectors; the list names are the pathways / gene sets, and the character vectors must correspond to the column names of E.
  • ref and samp are the indices for the “reference” and “sample” (treatment) groups. This cannot be logical vectors. Only two groups can be compared at the same time (so for example, you cannot test for interaction).
  • compare — by default, gage makes a paired comparison between the “reference” and “sample” sets, which requires of course to have exactly the same number of samples in both sets. Use “unpaired” for most of your needs.
  • same.dir — if FALSE, then absolute fold changes are considered; if TRUE, then up- and down-regulated genes are considered separately

To visualise the changes on the pathway diagram from KEGG, one can use the package pathview. However, there are a few quirks when working with this package. First, the package requires a vector or a matrix with, respectively, names or rownames that are ENTREZ IDs. By the way, if I want to visualise say the logFC from topTable, I can create a named numeric vector in one go:

setNames( tt$logFC, tt$EID )

Another useful package is SPIA; SPIA only uses fold changes and predefined sets of differentially expressed genes, but it also takes the pathway topology into account.

Rotating movies with RGL

Learned another thing today: it is very simple to create animated GIFs in the rgl package with the build-in functions spin3d, play3d and movie3d.

library( pca3d )
pca <- prcomp( metabo[,-1], scale.= TRUE )
pca3d( pca, group= metabo[,1] )
rot <- spin3d( axis= c( 0, 1, 0 ) )
movie3d( rot, duration= 12 )

Here is the result:


Check credit card numbers using the Luhn algorithm

You can find a nice infographic explaining how to understand the credit card number here.

Credit card numbers include, as the last digit, a simple checksum calculated using the Luhn algorithm. In essence, we add up the digits of the card number in a certain way, and if the resulting sum divided by 10 is not an integer (i.e., the reminder of dividing by 10 is zero), then the number is invalid (the reverse is not true; a number divisible by 10 can still be invalid).

This can be easily computed in R, although I suspect that there might be an easier way:

checkLuhn <- function( ccnum ) {

  # helper function
  sumdigits <- function( x ) 
    sapply( x, function( xx ) 
          sum( as.numeric( unlist( strsplit( as.character( xx ), "" ))))
  # split the digits
  ccnum <- as.character( ccnum )
  v <- as.numeric( unlist( strsplit( ccnum, "" ) ) )
  v <- rev( v )

  # indices of every second digit
  i2 <- seq( 2, length( v ), by= 2 )
  v[i2] <- v[i2] * 2
  v[ v > 9 ] <- sumdigits( v[ v > 9 ] )

  if( sum( v ) %% 10 == 0 ) return( TRUE )
  else return( FALSE )


Sample size / power calculations for Kaplan-Meier survival curves

The problem is simple: we have two groups of animals, treated and controls. Around 20% of the untreated animals will die during the course of the experiment, and we would like to be able to detect effect such that instead of 20%, 80% of animals will die in the treated group, with power 0.8 and alpha=0.05. Group sizes are equal and no other parameters are given.

What is the necessary group size?

I used the `ssizeCT.default` function from the `powerSurvEpi` R package. Based on the explanation in the package manual, this calculates (in my simple case) the required sample size in a group as follows:

n = \frac{m}{p_E + p_C}

where p_E and p_C are, respectively, probabilities of failure in the E(xpermiental) and C(ontrol) groups. I assume that in my case I should use 0.8 and 0.2, respectively, so n=m. The formulas here are simplified in comparison with the manual page of ssizeCT.default, simply because the group sizes are identical.

m is calculated as


RR is the minimal effect size that we would like to be able to observe with power 0.8 and at alpha 0.05. That means, if the real effect size is RR or greater, we have 80% chance of getting a p-value smaller than 0.05 if the group sizes are equal to m. To calculate RR, I first calculate \theta, the hazard ratio, and for this I use the same approximate, expected mortality rates (20% and 80%):

\theta = \log(\frac{\log(0.8)}{\log(0.2)}) = -1.98

Since RR=exp(\theta)=0.139; thus m=18.3. This seems reasonable (based on previous experience).

Copy-paste and R interactive sessions

I wanted to have a way of easily copying R commands from my interactive sessions. I often work interactively, while creating a pipeline or lab-book like readme in a separate file.

Of course, you can just copy anything that is on the screen and paste it. However, I want to have my readme file to be directly executable, so generally I want only to copy the commands. The solution was to write a small function, called xclip, which copies an arbitrary number of lines from history into the X window system primary clipboard. Naturally, it is not compatible with Windows or MacOS. However, I can copy now a fragment of recent history with one command and one mouse click (since the primary X window system clipboard runs without shortcuts).

The function actually is only a wrapper around the little xclip command utility, available for Linux.

xclip <- function (n = 5) 
    t <- tempfile()
    savehistory(file = t)
    system(paste0("tail -n ", n, " ", t, "|xclip"))


Prompted by this cross-validated discussion, I have created the riverplot package. Here is a minimal gallery of the graphics produced by the package:


Here is an example which recreates the famous Minard plot:


So, how to do these figures:

First, you need to create a specific riverplot object that can be directly plotted. (Use riverplot.example to generate an example object). Here, I show how to recreate the Minard plot using the provided data. makeRiver, the function that will create the object necessary for plotting, will use data frames to input information about nodes and edges, but we must use specific naming of the columns:

library( riverplot )
data( minard )
nodes <- minard$nodes
edges <- minard$edges
colnames( nodes ) <- c( "ID", "x", "y" )
colnames( edges ) <- c( "N1", "N2", "Value", "direction" )

Now we can add some style information to the “edge” columns, to mimick the orignal Minard plot:

# color the edges by troop movement direction
edges$col <- c( "#e5cbaa", "black" )[ factor( edges$direction ) ]
# color edges by their color rather than by gradient between the nodes
edges$edgecol <- "col"

# generate the riverplot object
river <- makeRiver( nodes, edges )

The makeRiver function reads any columns that match the style information (like colors of the nodes) and uses it to create the river object. The river object is just a simple list, you can easily view and manipulate it — or create it with your own functions. The point about makeRiver is to make sure that the data is consistent.

Once you have created a riverplot object with one of the above methods (or manually), you can plot it either with plot(x) or riverplot(x). I enforce the use of lines, and I also tell the plotting function to use a particular style. The default edges look curvy.

style <- list( edgestyle= "straight", nodestyle= "invisible" )
# plot the generated object
plot( river, lty= 1, style )
# Add cities
with( minard$cities, points( Longitude, Latitude, pch= 19 ) )
with( minard$cities, text( Longitude, Latitude, Name, adj= c( 0, 0 ) ) )