Gene function prediction

Gene function prediction is complicated by three main issues. First, it is not clear that a "gene" is an appropriate unit for the purpose of predicting function. For example, 90% of human genes are alternatively spliced and most of those are spliced in a tissue-dependent manner, so that in different tissues a single gene may yield different isoforms with different functions (see our project on alternative splicing). Second, close examination of the term "function" in the context of realistic biology and biochemistry reveals that genes act in a complicated interdependent ways to enable cellular processes. Categorizations of gene function, such as those used in the Gene Ontology Project slot each gene into one category in a hierarchy of categories. They are helpful, but they ignore the fact that an individual gene has no function in isolation. Third, the goal of making a "prediction" makes sense only when there is a need for a prediction and when the system has sufficient regularity that making a prediction is possible. Possible reasons for wanting to make gene function predictions include: directing focused experimental work to elucidate the activities of previously uncharacterized genes; uncovering hidden variables that explain similarity in gene function, but may also explain other, more important regularities (eg, two genes with similar function may be activated when a third, "control gene" is activated); and publishing papers.

Our view is that gene function prediction is a stepping stone to the more important problem of inferring regulatory networks that explain how genes interact so as to enable cellular processes (see our project on genetic networks).

In 2001, Frey's group started collaborating with Tim Hughes on the problem of elucidating the functions of yeast genes. We found that predictive analysis using publicly available yeast functional genomics and proteomics data suggests that many more proteins may be involved in biogenesis of ribonucleoproteins than were currently known (Peng et al, Cell 2003). Using a microarray that monitors abundance and processing of noncoding RNAs, we analyzed 468 yeast strains carrying mutations in protein-coding genes, most of which had not previously been associated with RNA or RNP synthesis. Many strains mutated in uncharacterized genes displayed aberrant noncoding RNA profiles. Ten factors involved in noncoding RNA biogenesis were verified by further experimentation, including a protein required for 20S pre-rRNA processing (Tsr2p), a protein associated with the nuclear exosome (Lrp1p), and a factor required for box C/D snoRNA accumulation (Bcd1p). These data presented a global view of yeast noncoding RNA processing and confirmed that many previously uncharacterized yeast proteins are involved in biogenesis of noncoding RNA.

In 2003, we turned our attention to mammals and in particular mouse. Large-scale quantitative analysis of transcriptional co-expression had been used to dissect regulatory networks and to predict the functions of new genes discovered by genome sequencing in model organisms such as yeast. Although the idea that tissue-specific expression is indicative of gene function in mammals was widely accepted, it had not been objectively tested nor compared with the related but distinct strategy of correlating gene co-expression as a means to predict gene function. We generated microarray expression data for nearly 40,000 known and predicted mRNAs in 55 mouse tissues, using custom-built oligonucleotide arrays (Zhang et al, J Biology 2004). We showed that quantitative transcriptional co-expression is a powerful predictor of gene function. Hundreds of functional categories, as defined by Gene Ontology 'Biological Processes', were associated with characteristic expression patterns across all tissues, including categories that bear no overt relationship to the tissue of origin. In contrast, simple tissue-specific restriction of expression was found to be a poor predictor of which genes are in which functional categories. As an example, the highly conserved mouse gene PWP1 is widely expressed across different tissues but is co-expressed with many RNA-processing genes; we show that the uncharacterized yeast homolog of PWP1 is required for rRNA biogenesis. We concluded that functional genomics strategies based on quantitative transcriptional co-expression could be as fruitful in mammals as they had been in simpler organisms, and that transcriptional control of mammalian physiology is more modular than was generally appreciated at the time.

In 2001, Frey's group began a project whose purpose was to elucidate the extent of alternative splicing, its regulatory processes, and its functional implications. The website describing our alternative splicing project contains more details, but here we describe work that explored functional implications (Fagnani et al, Genome Biology 2007). Alternative splicing (AS) functions to expand proteomic complexity and plays numerous important roles in gene regulation. However, the extent to which AS coordinates functions in a cell and tissue type specific manner was not known when we began this project. Using quantitative AS profiling in diverse tissues, we identified a large number of widely expressed mouse genes that contain single or coordinated pairs of alternative exons that are spliced in a tissue regulated fashion. The majority of these AS events display differential regulation in central nervous system (CNS) tissues. Approximately half of the corresponding genes have neural specific functions and operate in common processes and interconnected pathways. Differential regulation of AS in the CNS tissues correlates strongly with a set of mostly new motifs that are predominantly located in the intron and constitutive exon sequences neighboring CNS-regulated alternative exons. Different subsets of these motifs are correlated with either increased inclusion or increased exclusion of alternative exons in CNS tissues, relative to the other profiled tissues. Our findings provided new evidence that specific cellular processes in the mammalian CNS are coordinated at the level of AS, and that a complex splicing code underlies CNS specific AS regulation. These data provided a basis for understanding the molecular mechanisms by which the tissue specific functions of widely expressed genes are coordinated at the level of AS. We later assembled a comprehensive splicing code using this data (Barash et al, Nature 2010; see the alternative splicing project web page).


  • WT Peng, MD Robinson, S Mnaimneh, NJ Krogan, G Cagney, Q Morris, AP Davierwala, J Grigull, X Yang, W Zhang, N Mitsakakis, OW Ryan, N Datta, V Jojic, C Pal, V Canadien, D Richards, B Beattie, LF Wu, SJ Altschuler, S Roweis, BJ Frey*, A Emili*, JF Greenblatt*, TR Hughes*. A panoramic view of yeast non-coding RNA processing, Cell 113:919-33, 2003.

  • W Zhang, QD Morris, R Chang, O Shai, MA Bakowski, N Mitsakakis, N Mohammad, MD Robinson, R Zirngibl, E Somogyi, N Laurin, E Eftekharpour, E Sat, J Grigull, Q Pan, WT Peng, N Krogan, J Greenblatt, M Fehlings, D van der Kooy, J Aubin, BG Bruneau, J Rossant, BJ Blencowe, BJ Frey*, TR Hughes*. The functional landscape of mouse gene expression, Journal of Biology 3(5):21, 2004.

  • M Fagnani, Y Barash, JY Ip, C Misquitta, Q Pan, A Saltzman, O Shai, L Lee, A Rozenhek, N Mohammad, S Willaime-Morawek, T Babak, W Zhang, TR Hughes, D Kooy, BJ Frey* and BJ Blencowe*. Functional coordination of alternative splicing in the mammalian central nervous system, Genome Biology 8, June 2007.