biomaRt

The biomaRt package allows to query the gigantic ensembl data base directly from R. Since it happens only once in a while, I find myself reading the biomaRt vignette every time I’m using it. Here are the crucial points for the most common application.

ensembl <- useMart( "ensembl", dataset="hsapiens_gene_ensembl" )
f <- listFilters( ensembl )
a <- listAttributes( ensembl )

The data frames f and a hold the available filters and attributes, respectively. To retrieve information, we need to specify by what key we are searching (filters), what keys do we want to retrieve from the database (attributes), what are the values of our search keys (values) and from which mart we want to retrieve the information.

g <- c( "CCL2", "DCN", "LIF", "PLAU", "IL6" )
d <- getBM( 
        attributes= c( "entrezgene", "description", "uniprot_genename" ), 
        filters= "hgnc_symbol", 
        values= g, 
        mart= ensembl )

This returns the following:

  entrezgene                                                      description uniprot_genename
1       6347    chemokine (C-C motif) ligand 2 [Source:HGNC Symbol;Acc:10618]             CCL2
2       1634                            decorin [Source:HGNC Symbol;Acc:2705]              DCN
3       3569 interleukin 6 (interferon, beta 2) [Source:HGNC Symbol;Acc:6018]              IL6
4       3976         leukemia inhibitory factor [Source:HGNC Symbol;Acc:6596]              LIF
5       5328   plasminogen activator, urokinase [Source:HGNC Symbol;Acc:9052]             PLAU

OK, one problem that presents itself is that if there are multiple matching values for a given key, you are getting multiple lines for that key. It may be useful to aggregate this information.

g <- c( "BTC" )
d <- getBM( attributes= c( "uniprot_genename", "description", "ensembl_gene_id" ), filters= "hgnc_symbol", values= g, mart= ensembl )

will return two lines:

  uniprot_genename                                description ensembl_gene_id
1              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530
2              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000174808

We can use aggregate to collapse the ensembl IDs:

ff <- function( x ) { x <- x[ !is.na( x ) & x != "" ] ; paste( unique( x ), collapse= ";" ) }
d <- aggregate( d, by= list( d$uniprot_genename ), FUN=ff )[,-1]

The ff function ignores empty strings and NAs. Also, we remove the first column as it is added by the aggregate function. The result:

  Group.1 uniprot_genename                                description                 ensembl_gene_id
1     BTC              BTC betacellulin [Source:HGNC Symbol;Acc:1121] ENSG00000261530;ENSG00000174808
Advertisements