Retrieving Expression Levels of all Members of a Gene Family (GO) from an Oncomine DataSet

This page describes a quick way to extract gene expression information for a specific group of genes (as defined in one or more GO terms) from a Oncomine DataSet. This is useful, for example, if we want to study a a general process (in this case: protein acetylation) and we want to know which are the most dysregulated genes belonging to that gene group in a DataSet of interest (in this case, I am using the Crabtree Uterus DataSet, from Oncomine).

The first part of the program deals with the preparation of the DataSet. The second half of the script performs the following operations:

  1. retrieve all members of a GO family of genes
  2. retrieve expression info for each of these genes (if included in the DataSet)
  3. discard non-differentially expressed genes

We start by loading the required libraries, defining the working directory and reading the csv files. Oncomine provides two files for each DataSet (one file comes as result of a up-regulation analysis, the other is obtained after a down-regulation analysis).

source("http://bioconductor.org/biocLite.R")
library(GO.db)
library(org.Hs.eg.db)

setwd("../Documents/R/homemade_tools/leyo/")
dataset_UP <- read.csv("Crabtree_Uterus_Gene_List_up.csv", header = FALSE, as.is=TRUE)
colnames(dataset_UP) <- dataset_UP[3,]
dataset_UP<- dataset_UP[-c(1,2,3),]
dataset_DOWN <- read.csv("Crabtree_Uterus_Gene_List_down.csv", header = FALSE, as.is=TRUE)
colnames(dataset_DOWN) <- dataset_DOWN[3,]
dataset_DOWN<- dataset_DOWN[-c(1,2,3),]

The Data Prep step proceeds through the following points:

  1. merging two csv files in one matrix
  2. remove non-specific (duplicated) probes (probes that match multiple genes)
  3. deal with duplicated SYMBOLS (some genes match multiple probes. It is advisable to check these genes first and decide what to keep or discard (here, I remove all duplicated genes = 1)
  4. deal with the Oncomine way of expressing FoldChange and transform it in Log2FoldChange
dataset_UP<- dataset_UP[as.numeric(dataset_UP$`Fold Change`)>0,]
dataset_DOWN<- dataset_DOWN[as.numeric(dataset_DOWN$`Fold Change`)<0,]
dataSet <- rbind(dataset_UP, dataset_DOWN)
dupli_genes <- dataSet[duplicated(dataSet$`Gene Symbol`),'Gene Symbol']
dataSet <- dataSet[!(dataSet$`Gene Symbol` %in% dupli_genes),]
dataSet <- dataSet[order(dataSet$`Gene Symbol`),]
log2FC<- sapply(as.numeric(dataSet$`Fold Change`), (function(fc){
  if ( fc >0) {
    log2(fc)
  } else {
    log2(1/(fc*(-1)))
  }
}))
dataSet <- cbind(dataSet,log2FC)
zScores <- scale(log2FC)
dataSet <- cbind(dataSet, zScores)
# DataSet is now ready to be analysed. The following line will plot a volcano plot
plot(-log10(as.numeric(dataSet$`Q-value`))~((dataSet$log2FC)))

We can start working with the GO.db now. Let’s start by extracting keys and searching GO Terms for one (or more) keywords of interest.

my_keyword<-c("acetylase", "acetylation","deacetylase", "deacetylation")
GOdb_keys <- keys(GO.db, keytype="GOID")
GOdb_list <- select(GO.db, keys=GOdb_keys, columns=c("GOID","TERM"), keytype = "GOID")
# We use sapply as grep takes only one "patter" per time
my_goIDs <- sapply(my_keyword,(function(k_word){
  my_GOlist <- GOdb_list[grep(k_word, GOdb_list$TERM, value=FALSE),]
  my_GOlist$GOID
}))
# We need a vector of GO IDs, but now my_goIDs is a list
my_goIDs <- unique(as.vector(unlist(my_goIDs)))
# Retireve human genes belonging to those terms
gene_list<-select(org.Hs.eg.db, keys=my_goIDs, 
                  columns=c("ENTREZID","SYMBOL","GOALL"), keytype = "GOALL")
# Let's remove NA
gene_list<-gene_list[!is.na(gene_list$ENTREZID),]
# And now let's save a txt file that includes all genes (SYMBOLS) we are looking for
write(as.character(unique(gene_list$SYMBOL)), "gene_shortlist.txt", sep="\n")

We now have a  DataSet ready to be queried (dataSet) and a list of interesting gene Symbols (gene_list). Let’s slice the DataSet according to gene_list and then save a csv file. Then, we can remove every gene which is not statistically DE expressed (q-val > 0.05 or absolute Fold Change <40%) and save a second csv file.

GO_Set <- dataSet[dataSet$`Gene Symbol` %in% gene_list$SYMBOL,]
GO_Set <- GO_Set[order(as.numeric(GO_Set$log2FC)),]
rownames(GO_Set) <- GO_Set$`Gene Symbol`
GO_Set <- GO_Set[,4:ncol(GO_Set)]
# Now Go_Set contains all genes of interest (DE and non-DE genes)
# Let's save a csv copy
write.csv(GO_Set, "GOgenes_in_Dataset.csv", sep= ",")
# Let's remove what is non DE and write
GO_Set <- GO_Set[as.numeric(GO_Set$`Q-value`)<0.05 & abs(as.numeric(GO_Set$log2FC))>0.5,]
write.csv(GO_Set, "DE_GOgenes_in_Dataset.csv", sep= ",")

We can wrap up by drawing a Volcano plot where DE genes of interest are depicted in colors (print to PDF here).

# adjust offset based on the plot
setx_offset = 0.04
y_offset = 0.25
pdf("volcano.pdf", 6,8)
plot(-log10(as.numeric(dataSet$`Q-value`))~((dataSet$log2FC)),
     pch=19, col="darkgrey", cex=0.5, main="Volcano Plot")
for (i in 1:nrow(GO_Set)) {
  my_x <- GO_Set[i,"log2FC"]
  my_y <- -log10(as.numeric(GO_Set[i,"Q-value"]))
  if (my_x > 0) {
    my_col = "red"
    my_adj = 0
    x_offset = setx_offset
  } else {
    my_col = "darkgreen"
    my_adj = 1
    x_offset = setx_offset * (-1)
  }
  points(my_x,my_y,pch=19, col=my_col)
  text(my_x + x_offset, my_y + y_offset, rownames(GO_Set)[i],cex=0.86, col=my_col, adj= my_adj)
}
dev.off()

Done. That’s it, as simple as that. For questions or comments, don’t hesitate to contact me. For more info, please visit the corresponding GitHub page: https://github.com/dami82/bioinfo/blob/master/GO.in.Dset.R

About Author

Damiano
Postdoc Research Fellow at Northwestern University (Chicago)

Leave a Comment

Your email address will not be published. Required fields are marked *