Pathway analysis in R and BioConductor.

There are many options to do pathway analysis with R and BioConductor.

First, it is useful to get the KEGG pathways:

library( gage )
kg.hsa <- kegg.gsets( "hsa" )
kegg.gs2 <- kg.hsa$kg.sets[ kg.hsa$sigmet.idx ]

Of course, “hsa” stands for Homo sapiens, “mmu” would stand for Mus musuculus etc.

Incidentally, we can immediately make an analysis using gage. However, gage is tricky; note that by default, it makes a pairwise comparison between samples in the reference and treatment group. Also, you just have the two groups — no complex contrasts like in limma.

res <- gage( E, gsets= kegg.gs2, 
       ref= which( group == "Control" ), 
       samp= which( group == "Treatment" ),
       compare= "unpaired", same.dir= FALSE )

Now, some filthy details about the parameters for gage.

  • E is the matrix with expression data: columns are arrays and rows are genes. If you use a limma EList object to store your data, this is just the E member of the object (rg$E for example). However, and this is important, gage (and KEGG and others) are driven by the Entrez gene identifiers, and this is not what you usually have when you start the analysis. To get the correct array, you need to
    • select only the genes with ENTREZ IDs,
    • make sure that there are no duplicates
    • change the row names of E to ENTREZ IDs
  • gsets is just a list of character vectors; the list names are the pathways / gene sets, and the character vectors must correspond to the column names of E.
  • ref and samp are the indices for the “reference” and “sample” (treatment) groups. This cannot be logical vectors. Only two groups can be compared at the same time (so for example, you cannot test for interaction).
  • compare — by default, gage makes a paired comparison between the “reference” and “sample” sets, which requires of course to have exactly the same number of samples in both sets. Use “unpaired” for most of your needs.
  • same.dir — if FALSE, then absolute fold changes are considered; if TRUE, then up- and down-regulated genes are considered separately

To visualise the changes on the pathway diagram from KEGG, one can use the package pathview. However, there are a few quirks when working with this package. First, the package requires a vector or a matrix with, respectively, names or rownames that are ENTREZ IDs. By the way, if I want to visualise say the logFC from topTable, I can create a named numeric vector in one go:

setNames( tt$logFC, tt$EID )

Another useful package is SPIA; SPIA only uses fold changes and predefined sets of differentially expressed genes, but it also takes the pathway topology into account.