The transcriptome: A universe beyond the genome

The transcriptome can be described as the complete list of all classes of RNA molecules, whether coding or non-coding, expressed in a particular cell, tissue, or whole organism. Once thought to be merely an intermediate stage to facilitate information flow from DNA to protein, it has become more and more evident that RNA molecules have great complexity and play a central role in cellular processes and gene regulation. Using high-throughput technologies, we have developed novel machine learning tools and performed sophisticated computational analyses to discover such a fascinating universe beyond the genome.

Human alternative splicing complexity (Nature Genetics, Nov 2008)

We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in ~20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from ~95% of multiexon genes undergo alternative splicing and that there are ~100,000 intermediate- to high-abundance alternative splicing events in major human tissues. In another word, about one in every two exons could be alternatively spliced. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels.

Human AS complexity

Above: assessing human alternative splicing complexity using mRNA-Seq data. (a) Diagram showing a gene with ‘known’ splice junctions (blue lines) supported by cDNA-EST evidence. Dashed pink lines represent all hypothetically possible ‘new’ splice junctions, and the solid pink lines indicate a new junction detected using Illumina mRNA-Seq data. Alternative exons are indicated in red. (b) Numbers of known and new splice junctions detected using mRNA-Seq data from human tissues. Each point in the four plots indicates the mean number of junctions detected when comparing data from all possible combinations of the specified numbers of tissues. The light blue and dark blue plots show the numbers of detected known junctions when junction sequences that are repeated elsewhere in the surveyed genes are either included or excluded, respectively. The pink and purple plots show the numbers of new junctions detected when including or excluding repeated sequences. (c) Histograms of the tissue distribution of known and new alternatively spliced junctions. We detected 7,917 known and 2,368 new splice junctions representing evidence for skipping of one or more alternative cassette exons in mRNA-Seq read alignments. The tissue distribution of these junction reads was plotted as the percentage of junctions that appear in one to all six tissues. (d) Tissue distributions of new splice junctions detected in pairs of tissues. The size of each blue box indicates the number of junctions shared between a given pair of tissues, with the highest number of shared junctions corresponding to the largest box. Br, whole brain; Ce, cerebral cortex; He, heart; Li, liver; Lu, lung; Sk, skeletal muscle.


  • Qun Pan, Ofer Shai, Leo J. Lee, Brendan J. Frey, and Benjamin J. Blencowe. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing, Nature Genetics, 40:1413-1415, Nov 2, 2008. [link to the paper]

  • Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes, Nature, 456:470–476, Nov 2, 2008. [link to the paper]

  • Blencowe BJ, Ahmad S, Lee LJ. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes, Genes & Development, 23:1379-1386, 2009. [link to the paper]

Bayesian learning of microRNA targets (Nature Methods, Nov 2007)

MicroRNAs (miRNAs) have recently been discovered as an important class of non-coding RNA genes that play a major role in regulating gene expression, providing a means to control the relative amounts of mRNA transcripts and their protein products. Although much work has been done in the genome-wide computational prediction of miRNA genes and their target mRNAs, an open question is how to efficiently detect bona fide miRNA targets from a large number of candidate miRNA targets predicted by existing computational algorithms. We propose a novel probabilistic model that accounts for gene expression using miRNA expression data and a set of candidate miRNA targets. A set of underlying miRNA targets are learned from the data using our algorithm, GenMiR (Generative model for miRNA regulation). Our high-confidence miRNA targets represent a dramatic increase in the number of known miRNA targets and provide a wide range of novel testable hypotheses, offering a starting point for understanding miRNA regulation on a global scale.

Bayesian network used for detecting miRNA targets: given a set of candidate miRNA-target interactions generated using a target-finding program, the GenMiR probability model uses expression data for mRNAs and miRNAs to find a subset of the candidates which are well-supported by the data.

Project website:


  • J.C. Huang, T. Babak, T.W. Corson, G. Chua, S. Khan, B.L. Gallie, T.R. Hughes, B.J. Blencowe, B.J. Frey and Q.D. Morris. (2007) Using expression profiling data to identify human microRNA targets, Nature Methods 4(12), 1045-1049. [Click here to access this paper.]

  • J.C. Huang, Q.D. Morris and B.J. Frey. (2007) Bayesian inference of microRNA targets from sequence and expression data, J. Comp. Bio. 14(5): 550-563. [Click here to access this paper.]

  • J.C. Huang, Q.D. Morris and B.J. Frey. (2006) Detecting microRNA targets by linking sequence, microRNA and gene expression data, Proceedings of the Tenth Annual International Conference on Research in Computational Molecular Biology (RECOMB), Venice, Italy, April 2-5, 2006.

How many new genes are there (Science, Mar 2006)

Toronto claim

In a recent (Sep, 2005) Science report, the FANTOM Consortium claim to have found 5,154 new proteins in the mouse genome not encoded by previously known mRNA sequences. This claim contrasts dramatically with the view of the International Human Genome Sequencing Consortium and our previous study using exon microarrays. By downloading FANTOM's protein sequence data and performing a careful, independent computational analysis, we concluded that the number of new protein-coding genes discovered by the FANTOM consortium is at most in the hundreds, with the remaining either splice isoforms of known proteins or false positives arising randomly from noncoding transcripts.

FANTOM claim

This has fueled a heated but friendly debate featured in Science ...

Project website:


  • Leo J. Lee, Timothy R. Hughes, and Brendan J. Frey. How Many New Genes Are There? Science, vol. 311, no. 5768, pp. 1709-1711, Mar 26, 2006. [link to the paper]

  • The FANTOM Consortium: P. Carninci et al. The Transcriptional Landscape of the Mammalian Genome, Science, vol. 309, no. 5740, pp. 1559-1563, Sep 2, 2005. [link to the paper]

  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics, Aug 28, 2005. [PDF]

A revised view of the mammalian library of genes (Nature Genetics, Aug 2005)

Recent mammalian microarray experiments have detected widespread transcription and raised the possibility that there may be a large number of undiscovered multi-exon protein-coding genes. To explore this possibility, we hybridized unamplified, polyadenylation-selected samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false detection rate of 2.7%, GenRate detects 12,145 gene-length transcripts and confirms 81% of the 10,000 most highly-expressed known genes. Our analysis shows that most of the 155,839 exons detected by GenRate are associated with known genes, and leads to two conclusions. First, these results provide for the first time microarray-based evidence that the vast majority of multi-exon genes have already been discovered. Second, however, GenRate also detects tens of thousands of potential new exons and reconciles discrepancies in current cDNA databases, by stitching novel transcribed regions into previously-annotated genes. Importantly, these results also highlight tissue-dependent variability in exon expression, introduced by tissue-dependent alternative splicing. (See our alternative splicing project and our paper featured on the cover of the May 6 edition of Nature.)

GenRate detected approximately 30,000 novel putative exons (not appearing in the following databases: human ensembl, human ensembl novel, mouse refseq, mouse ensembl, mouse fantom2, mouse unigene). The following figure shows how the 10% most highly-expressed putative exons compare to well-annotated RefSeq genes.

GenRate Novel Exons

Project website (with data matrix):


  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics 37:9, Aug 28, 2005. [PDF]

  • BJ Frey, TR Hughes and QD Morris. GenRate: A generative model that reveals novel transcripts in genome-tiling microarray data, Journal of Computational Biology 13:2, 200-214, March 2006.

  • B. J. Frey, Q. D. Morris, M. D. Robinson and T. R. Hughes 2005 Finding novel transcripts in high-resolution genome-wide microarray data using the GenRate model, RECOMB 2005, MIT, June 2005.