Adding authentication to a shiny server

Umph, that was a tough one. I spent ages figuring out how to do it correctly. I have a server running apache (on port 80) and shiny on port (say) 11111. Shiny has its own document root, and within this root, we have a shiny app, say, “example”. So to view this app you need to type http://server:1111/example/. So far, so good. What I wanted, though, was (i) some kind of password protection for the app, and (ii) calling the app from the URL http://server/example/. Turns out it can be done, but it was not trvial.

First, I modified configuration of the shiny server to listen only to the specified port and only to the localhost; this prevents anyone from any other machine to connect to shiny:

server {
  listen 11111 127.0.0.1;
  location / {
    site_dir /srv/shiny-server;
    log_dir /var/log/shiny-server;
    directory_index off;
  }
}

Now to apache. In the httpd.conf file, I have added The following:

         <VirtualHost *:80>

            Redirect /example /example/
            ProxyPass /example/ http://127.0.0.1:11111/example/
            ProxyPassReverse /example/ http://127.0.0.1:11111/example/

            <Location /example>
                AuthType Basic
                AuthName "Enter your login name and password"
                AuthUserFile /etc/httpd/htpasswd.users
                Require valid-user
            </Location>

         </VirtualHost>

This makes apache work as a proxy to the shiny server; however, with the added benefit of a simple authentication for the shiny contents.

It took me quite some time to figure out that without the Redirect directive above, http://server/example/ works, but http://server/example (without the slash) doesn’t.

Finally, I created new users with htpasswd.

tagcloud: creating tag / word clouds

May I present my new package: tagcloud. Tag clouds is for creating, um, tag clouds (aka word clouds). It is based on the code from the wordcloud package, but (i) has no tools for analysing word frequencies, instead (ii) focuses on doing better tag clouds. As to (ii), it adapts to the geometry of the window better and can produce different layouts. Also, with the extrafont package, it can use just any fonts you can think of:

sample8

The general syntax is simple. The function tagcloud takes one mandatory argument, the tags to display — a character vector. In the example below, I use the font names available through the extrafont packageยน, but it can be anything.

library( extrafont )
tags <- sample( fonts(), 50 )
tagcloud( tags )
dev.copy2pdf( file= "sample1.pdf", out.type= "cairo" )

Note the use of cairo in the above, otherwise the PDF does not include the fonts and the result is not impressive. Here is the result:

sample1

OK, let’s add some weights and colors. Also, wouldn’t it be cool if each font name was displayed in the actual font?

weights <- rgamma( 50, 1 )
colors <- colorRampPalette( brewer.pal( 12, "Paired" ) )( 50 )
tagcloud( tags, weights= weights, col= colors, family= tags )

sample2

How about mixed vertical and horizontal tags?

tagcloud( tags, weights= weights, col= colors, family= tags, 
fvert= 0.5 )

sample3

The fvert parameter specifies the proportion of tags that should be displayed vertically rather than horizontally.

Or a different layout?

tagcloud( tags, weights= weights, col= colors, 
family= tags, algorithm= "fill" )

sample4

Tagclouds comes with additional tools. Firstly, you have editor.tagcloud — a very minimalistic interactive editor. You need to store the object invisibly returned from tagcloud:

tc <- tagcloud( tags, weights= weights, col= colors, family= tags )
tc2 <- editor.tagcloud( tc )
plot( tc2 )

With strmultline you can break up long, multi-word tags into multi-line tags:

tagcloud( strmultline( tags ), weights= weights, col= colors, family= tags )

The result is as follows (notice “Andale Mono” or “DejaVu Sans Light”):

sample5

Finally, smoothPalette makes it easy to generate a gradient from numbers. Imagine that we want to code some other numeric information (this could be a p-value, for example) with a smooth gradient from light grey (low value) to black (high value):

newvar <- runif( 50 )
colors2 <- smoothPalette( newvar )
tagcloud( tags, weights=weights, col=colors2, family= tags )

By default, smoothPalette uses a grey-white gradient, but it can actually use any kind of color palette.

sample6


1: In order to use the fonts installed on your system, you need to import them — preferably as root — using the extrafont package. At least in my Ubuntu installation you should provide the paths to where your TTF fonts are installed, for example:

library( extrafont )
font_import( paths= "/usr/share/fonts/truetype/" )

Troubles with sapply

Say your function in sapply returns a matrix with a variable number of rows. For example

ff <- function( i ) matrix( i, ncol= 3, nrow= i )
pipa <- sapply( 1:3, , simplify= "array" )

sapply is too stupid to see the pattern here (or maybe I don’t know how to cast the return value into an appropriate shape…). The result of the above is a list:

[[1]]
     [,1] [,2] [,3]
[1,]    1    1    1

[[2]]
     [,1] [,2] [,3]
[1,]    2    2    2
[2,]    2    2    2

[[3]]
     [,1] [,2] [,3]
[1,]    3    3    3
[2,]    3    3    3
[3,]    3    3    3

However, we can turn this list of matrices into a simple matrix using Reduce:

pipa <- Reduce( function( ... ) rbind( ... ), pipa )

biomaRt

The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.

ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" )
f <- listFilters( ensembl )
a <- listAttributes( ensembl )

The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.

g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" )
d <- getBM( 
        attributes= c( "entrezgene", "description", "uniprot_genename" ), 
        filters= "hgnc_symbol", 
        values= g, 
        mart= ensembl )

This returns the following:

  entrezgene                                                      description uniprot_genename
1       6347    chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618]             CCL2
2       1634                            decorin [Source:HGNC Symbol;Acc:2705]              DCN
3       3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018]              IL6
4       3976         leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596]              LIF
5       5328   plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052]             PLAU

OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.

g <- c( "BTC" )
d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )

will return two lines:

  uniprot_genename                                description ensembl_gene_id
1              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530
2              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808

We can use aggregate to collapse the ensembl IDs:

ff <- function( x ) { x <- x[ !is.na( x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) }
d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]

The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:

  Group.1 uniprot_genename                                description                 ensembl_gene_id
1     BTC              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808

Pathway analysis in R and BioConductor.

There are many options to do pathway analysis with R and BioConductor.

First, it is useful to get the KEGG pathways:

library( gage )
kg.hsa <- kegg.gsets( "hsa" )
kegg.gs2 <- kg.hsa$kg.sets[ kg.hsa$sigmet.idx ]

Of course, “hsa” stands for Homo sapiens, “mmu” would stand for Mus musuculus etc.

Incidentally, we can immediately make an analysis using gage. However, gage is tricky; note that by default, it makes a pairwise comparison between samples in the reference and treatment group. Also, you just have the two groups — no complex contrasts like in limma.

res <- gage( E, gsets= kegg.gs2, 
       ref= which( group == "Control" ), 
       samp= which( group == "Treatment" ),
       compare= "unpaired", same.dir= FALSE )

Now, some filthy details about the parameters for gage.

  • E is the matrix with expression data: columns are arrays and rows are genes. If you use a limma EList object to store your data, this is just the E member of the object (rg$E for example). However, and this is important, gage (and KEGG and others) are driven by the Entrez gene identifiers, and this is not what you usually have when you start the analysis. To get the correct array, you need to

    • select only the genes with ENTREZ IDs,
    • make sure that there are no duplicates
    • change the row names of E to ENTREZ IDs
  • gsets is just a list of character vectors; the list names are the pathways / gene sets, and the character vectors must correspond to the column names of E.
  • ref and samp are the indices for the “reference” and “sample” (treatment) groups. This cannot be logical vectors. Only two groups can be compared at the same time (so for example, you cannot test for interaction).
  • compare — by default, gage makes a paired comparison between the “reference” and “sample” sets, which requires of course to have exactly the same number of samples in both sets. Use “unpaired” for most of your needs.
  • same.dir — if FALSE, then absolute fold changes are considered; if TRUE, then up- and down-regulated genes are considered separately

To visualise the changes on the pathway diagram from KEGG, one can use the package pathview. However, there are a few quirks when working with this package. First, the package requires a vector or a matrix with, respectively, names or rownames that are ENTREZ IDs. By the way, if I want to visualise say the logFC from topTable, I can create a named numeric vector in one go:

setNames( tt$logFC, tt$EID )

Another useful package is SPIA; SPIA only uses fold changes and predefined sets of differentially expressed genes, but it also takes the pathway topology into account.

New editor in WordPress and how to circumvent it

EDIT: No longer an issue — there is a button on the right-hand side of the new editor that allows you to switch to the old version. However, I’ll leave this post here, as the trickery might come in handy in another situation.


I have problems like everyone else with the new editor, except that my main issue is editing existing posts.

1. Why I don’t want to use the new editor, in order of importance:

– it doesn’t take up all of the screen estate, only a narrow column in the middle of one of the two of 24″ monitors I use to work with text.
– it meddles with my HTML code. I have many “tl;dr” (very long) posts (not here, on another blog) which I like to format with empty lines in order to facilitate editing.
– it is buggy (read: non-functional) in my browser of choice (experimental Opera running on Ubuntu). Now, I don’t ask WordPress to adapt to my needs; but if everything else works fine with that browser, why would I want to switch to something I don’t like just because one beta feature on one site doesn’t work?

The first two problems make the editor unusable; the third can be circumvented (I could work with WordPress and WordPress only in Firefox, for example).

2. How to circumvent the new editor

Of course, clicking on the little pencil icon on your blog post takes you to the new editor. To use the old editor, the only way is to go to “Dashboard” -> “Posts”, search for the post you would like to edit, and click on “Edit”.

3. A better solution

Another, better solution would be to add a link rewrite plugin. The link to the old style editor looks like this: https://yourblogname.wordpress.com/wp-admin/post.php?post=1234&action=edit

Since the link to the new editor looks like this:

https://wordpress.com/post/BlogIDNumber/1234

I think that a regular expression rewriting your links (for example, a simple bookmarklet) should do the trick. Essentially, you would catch the regular expression “https://wordpress.com/post/BlogIDNumber/(%5B0-9%5D*)&#8221; and replace the link by “https://yourblogname.wordpress.com/wp-admin/post.php?post=$1&action=edit&#8221;.

The code below is untested and you should use it on your own responsibility

There is actually quite an easy piece of javascript code that you can add as a bookmark. You click the bookmark, and all your new-editor links convert in your old-editor links. You may start with the following code:

javascript:(function(){
var m = /wordpress.com\/post\/YourBlogID\/([0-9]*)/ ;
var r = 'yourblogname.wordpress.com/wp-admin/post.php?post=$1&action=edit';
var links = window.document.getElementsByTagName('a');
for(var i = 0, l; l = links[i]; i++) {
if (l.href)
l.href = l.href.replace(m,r);
}
})();

Replace “yourblogname” by your blog name (e.g. “logfc” in case of my blog); replace “YourBlogID” by your blog ID. Add a bookmark that contains, as reference, the above code. Clicking on the bookmark should replace all “new style links” by “old style links”.

Here is an example screenshot from Google Chrome. You see that in the field where you would normally put your URL (http…), you enter the above code (except that you replace “YourBlogID” by the actual ID and “yourblogname” by the actual name):

logfc

The downside is that you will need a bookmark for each blog that you are writing.

Sloppy Science

Last week, Science has published a paper by Rodriguez and Laio on a density-based clustering algorithm. As a non-expert, I found the results actually quite good compared to the standard tools that I am using in my everyday work. I even implemented the package as an R package (soon to be published on CRAN, look out for “fsf”).

However, there are problems with the paper. More than one.

1. The authors claim that the density for each sample is determined with a simple formula which is actually the number of other samples within a certain diameter. This does not add up, since then the density must be always a whole number. It is obvious from the figures that this is not the case. When you look up the original matlab code in the supplementary material, you see that the authors actually use a Gaussian kernel function for density calculation.

2. If you use the simple density count as described in the paper, the algorithm will not and cannot work. Imagine a relatively simple case with two distinct clusters. Imagine that in one cluster, there is a sample A with density 25, and in the other cluster, there are two samples, B and C, with identical densities 24. This is actually quite likely to happen. The algorithm now determines, for each sample, \delta, that is the distance to the next sample with higher density. The whole idea of the algorithm is that for putative cluster centres, this distance will be very high, because it will point to the center of another cluster.

However, with ties, we have the following problem. If we choose the approach described by the authors, then both of the samples with density B and C (which have identical density 24) will be assigned a large \delta value and will become cluster center candidates. If we choose to use a weak inequality, then B will point to C, and C to B, and both of them will have a small \delta.

Therefore, we either have both B and C as equivalent cluster candidates, or none of them. No wonder that the authors never used this approach!

3. The authors explicitly claim that their algorithm can “automatically find the correct number of clusters.” This does not seem to be true, at least there is nothing in the original paper that warrants this statement. If you study their matlab code, you will find that the selection of cluster centers is done manually by a user selecting a rectangle on the screen. Frankly, I cannot even comment on that, this is outrageous.

I think that Science might have done a great disservice to the authors — everyone will hate them for having a sloppy, half-baked paper that others would get rejected in PLoS ONE published in Science. I know I do :-)