Alternative splicing

Alternative splicing is a major source of transcriptional diversity and its regulation is critical, since numerous diseases arise from errors in this process. Splicing is controlled by cis-acting RNA features in introns and exons, such as protein binding sites and transcript structure characteristics, such as exon length and secondary structure. While many such features have been identified, a comprehensive map is needed to understand how these and as yet unidentified features work together to regulate alternative splicing. For the past decade, Frey's group has worked on developing new biotechnologies and computational techniques for elucidating a "splicing code", ie, a set of biological rules that govern how RNA features work together to regulate tissue-dependent alternative splicing

High-throughput sequencing of human tissues reveals widespread alternative splicing

We carried out the first analysis of alternative splicing complexity in human tissues using mRNA-Seq data. New splice junctions were detected in ~20% of multiexon genes, many of which are tissue specific. By combining mRNA-Seq and EST-cDNA sequence data, we estimate that transcripts from ~95% of multiexon genes undergo alternative splicing and that there are ~100,000 intermediate- to high-abundance alternative splicing events in major human tissues. In another word, about one in every two exons could be alternatively spliced. From a comparison with quantitative alternative splicing microarray profiling data, we also show that mRNA-Seq data provide reliable measurements for exon inclusion levels.

Human AS complexity

Above: assessing human alternative splicing complexity using mRNA-Seq data. (a) Diagram showing a gene with ‘known’ splice junctions (blue lines) supported by cDNA-EST evidence. Dashed pink lines represent all hypothetically possible ‘new’ splice junctions, and the solid pink lines indicate a new junction detected using Illumina mRNA-Seq data. Alternative exons are indicated in red. (b) Numbers of known and new splice junctions detected using mRNA-Seq data from human tissues. Each point in the four plots indicates the mean number of junctions detected when comparing data from all possible combinations of the specified numbers of tissues. The light blue and dark blue plots show the numbers of detected known junctions when junction sequences that are repeated elsewhere in the surveyed genes are either included or excluded, respectively. The pink and purple plots show the numbers of new junctions detected when including or excluding repeated sequences. (c) Histograms of the tissue distribution of known and new alternatively spliced junctions. We detected 7,917 known and 2,368 new splice junctions representing evidence for skipping of one or more alternative cassette exons in mRNA-Seq read alignments. The tissue distribution of these junction reads was plotted as the percentage of junctions that appear in one to all six tissues. (d) Tissue distributions of new splice junctions detected in pairs of tissues. The size of each blue box indicates the number of junctions shared between a given pair of tissues, with the highest number of shared junctions corresponding to the largest box. Br, whole brain; Ce, cerebral cortex; He, heart; Li, liver; Lu, lung; Sk, skeletal muscle.

References:

  • Qun Pan, Ofer Shai, Leo J. Lee, Brendan J. Frey, and Benjamin J. Blencowe. Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing, Nature Genetics, 40:1413-1415, Nov 2, 2008. [link to the paper]

  • Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB. Alternative isoform regulation in human tissue transcriptomes, Nature, 456:470–476, Nov 2, 2008. [link to the paper]

  • Blencowe BJ, Ahmad S, Lee LJ. Current-generation high-throughput sequencing: deepening insights into mammalian transcriptomes, Genes & Development, 23:1379-1386, 2009. [link to the paper]

Baby steps toward a splicing code: Functional analysis of coordinated alternative splicing

Alternative splicing (AS) functions to expand proteomic complexity and plays numerous important roles in gene regulation. However, the extent to which AS coordinates functions in a cell and tissue type specific manner is not known. Moreover, the sequence code that underlies cell and tissue type specific regulation of AS is poorly understood.

Using quantitative AS microarray profiling, we have identified a large number of widely expressed mouse genes that contain single or coordinated pairs of alternative exons that are spliced in a tissue regulated fashion. The majority of these AS events display differential regulation in central nervous system (CNS) tissues. Approximately half of the corresponding genes have neural specific functions and operate in common processes and interconnected pathways. Differential regulation of AS in the CNS tissues correlates strongly with a set of mostly new motifs that are predominantly located in the intron and constitutive exon sequences neighboring CNS-regulated alternative exons. Different subsets of these motifs are correlated with either increased inclusion or increased exclusion of alternative exons in CNS tissues, relative to the other profiled tissues.

Our findings provide new evidence that specific cellular processes in the mammalian CNS are coordinated at the level of AS, and that a complex splicing code underlies CNS specific AS regulation. This code appears to comprise many new motifs, some of which are located in the constitutive exons neighboring regulated alternative exons. These data provide a basis for understanding the molecular mechanisms by which the tissue specific functions of widely expressed genes are coordinated at the level of AS.

References:

  • M. Fagnani, Y. Barash, J. Y. Ip, C. Misquitta, Q. Pan, A. Saltzman, O. Shai, L. Lee, A. Rozenhek, N. Mohammad, S. Willaime-Morawek, T. Babak, W. Zhang, T. R. Hughes, D. Kooy, B. J. Frey and B. J. Blencowe. Functional coordination of alternative splicing in the mammalian central nervous system, Genome Biology, 8, June 2007. [link to the paper]

Quantitative profiling of alternative splicing using GenASAP

CLICK HERE FOR DATA AND SOFTWARE

We present GenASAP (Generative model for Alternative Splicing Array Platform), a new algorithm coupled with a microarray platform for the study of alternative splicing (AS). A new microarray, targeted towards studying single cassette exon inclusion/exclusion (see figure below) was designed.

AS events
AS probes
On the left, the most common alternative splicing events are shown: (a) single cassette exon inclusion/exclusion, (b) alternative 3'/5' splice site, (c) mutually excluded exons, and (d) intron inclusion/exclusion. On the right, the six microarray probes designed to study each AS event are shown. The microarray has three body probes (C1, C2, and A) and three junction probes (C1:A, A:C2, and C1:C2) for each AS event.

The model explains the observed values, consisting of measured transcription for exon body and junction probes, as a weighted linear combination of the abundance of the alternative isoforms with scale dependent noise and an outlier process. Learning in the generative model is carried out using a variational approximation of the Expectation-Maximization (EM) algorithm.

We carried out the learning on a new data set, consisting of 3126 "cassette" AS events across 10 tissues, where 3 exon body probes and 3 exon junction probes were used to study each event. A small subset (~200) of the events were closely examined using semi-quantitative RT-PCR, the results of which were used as the ground truth for evaluating the performance of GenASAP. The probabilistic nature of the algorithm suggests an approach to evaluating the confidence in the inferred values, which proved an important factor in evaluating the algorithm. The relative abundances of isoforms obtained from GenASAP's unsupervised learning were found to closely match the RT-PCR measurements, and to outperform supervised methods, such as KNN, logistic regression, and linear regression.

AS Bayes Net

References:

  • O. Shai, Q. D. Morris, B. J. Blencowe and B. J. Frey 2006, Inferring global levels of alternative splicing isoforms using a generative model of microarray data, Bioinformatics 22(5):606-613 (pdf).

  • O. Shai, B. J. Frey, Q. D. Morris, Q. Pan, C. Misquitta, and B. J. Blencowe 2004, Probabilistic inference of alternative splicing events in microarray data. accepted to Neural Information Processing Systems 17 (NIPS 04)

  • Q. Pan1, O. Shai1, C. Misquitta, W. Zhang, N. Mohammad, T. Babak, H. Siu, T. R. Hughes, Q. D. Morris2, B. J. Frey2, and B. J. Blencowe2 2004 Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform, Molecular Cell 16:6, 929-941 [PubMed]
    1 Joint-first authors
    2 Joint-senior authors

Where does biological complexity come from, when the number of genes is only ~22,000?

Toronto claim

In a recent (Sep, 2005) Science report, the FANTOM Consortium claim to have found 5,154 new proteins in the mouse genome not encoded by previously known mRNA sequences. This claim contrasts dramatically with the view of the International Human Genome Sequencing Consortium and our previous study using exon microarrays. By downloading FANTOM's protein sequence data and performing a careful, independent computational analysis, we concluded that the number of new protein-coding genes discovered by the FANTOM consortium is at most in the hundreds, with the remaining either splice isoforms of known proteins or false positives arising randomly from noncoding transcripts.

FANTOM claim

This has fueled a heated but friendly debate featured in Science ...

Project website:

http://www.psi.toronto.edu/TransLand/

References:

  • Leo J. Lee, Timothy R. Hughes, and Brendan J. Frey. How Many New Genes Are There? Science, vol. 311, no. 5768, pp. 1709-1711, Mar 26, 2006. [link to the paper]

  • The FANTOM Consortium: P. Carninci et al. The Transcriptional Landscape of the Mammalian Genome, Science, vol. 309, no. 5740, pp. 1559-1563, Sep 2, 2005. [link to the paper]

  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics, Aug 28, 2005. [PDF]

Inferring a revised library of tissue-dependent transcript structures using exon tiling microarray data

Recent mammalian microarray experiments have detected widespread transcription and raised the possibility that there may be a large number of undiscovered multi-exon protein-coding genes. To explore this possibility, we hybridized unamplified, polyadenylation-selected samples from 37 mouse tissues to microarrays encompassing 1.14 million exon probes. We analyzed these data using GenRate, a Bayesian algorithm that uses a genome-wide scoring function in a factor graph to infer genes. At a stringent exon false detection rate of 2.7%, GenRate detects 12,145 gene-length transcripts and confirms 81% of the 10,000 most highly-expressed known genes. Our analysis shows that most of the 155,839 exons detected by GenRate are associated with known genes, and leads to two conclusions. First, these results provide for the first time microarray-based evidence that the vast majority of multi-exon genes have already been discovered. Second, however, GenRate also detects tens of thousands of potential new exons and reconciles discrepancies in current cDNA databases, by stitching novel transcribed regions into previously-annotated genes. Importantly, these results also highlight tissue-dependent variability in exon expression, introduced by tissue-dependent alternative splicing. (See our alternative splicing project and our paper featured on the cover of the May 6 edition of Nature.)

GenRate detected approximately 30,000 novel putative exons (not appearing in the following databases: human ensembl, human ensembl novel, mouse refseq, mouse ensembl, mouse fantom2, mouse unigene). The following figure shows how the 10% most highly-expressed putative exons compare to well-annotated RefSeq genes.

GenRate Novel Exons

Project website (with data matrix):

http://www.psi.toronto.edu/genrate

References:

  • BJ Frey, N Mohammad, QD Morris, W Zhang, MD Robinson, S Mnaimneh, R Chang, Q Pan, E Sat, J Rossant, BG Bruneau, JE Aubin, BJ Blencowe, TR Hughes. Genome-wide analysis of mouse transcripts using exon microarrays and factor graphs, Nature Genetics 37:9, Aug 28, 2005. [PDF]

  • BJ Frey, TR Hughes and QD Morris. GenRate: A generative model that reveals novel transcripts in genome-tiling microarray data, Journal of Computational Biology 13:2, 200-214, March 2006.

  • B. J. Frey, Q. D. Morris, M. D. Robinson and T. R. Hughes. Finding novel transcripts in high-resolution genome-wide microarray data using the GenRate model, RECOMB 2005, MIT, June 2005.