Computational Biology

PTMClust - A Post-translational Modification Refinement Algorithm

A post-translational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson’s, Alzheimer’s, diabetes and cancer. To discover PTMs on a genomewide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (so-called 'blind') PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments.

To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We showed that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we showed that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a large-scale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis.



  • C. Chung, J. Liu, A. Emili and B.J. Frey 2011 Computational Refinement of Post-translational Modifications Predicted from Tandem Mass Spectrometry, Bioinformatics [Epub ahead of print] [Link to Epub]

Untangling biological networks (NIPS 2003)

Networks describing biological processes (like metabolism, transcriptional regulation, and protein-protein interactions) are invaluable for understanding and manipulating those processes. However, experimental methods that can generate these networks on a large-scale are prone to error, producing many false-positive and false-negative measurements. We have developed a method to identify measurement errors using knowledge about the connectivity structure of the noise-free network and the dependency structure of the measurement noise.

Many types of biological networks share a very similar connectivity structure: most nodes have a small degree (i.e., are connected to very few other nodes) and a few nodes, "hubs", have a very large degree. This structure is well-described by a degree distribution. Similarly, the dependency structure of the noise can also be well-represented by a degree distribution. These regularities are important because they can be used to untangle the real network (the signal) from the network of false positive measurements (the noise).

We developed a probabilistic generative model of tangled, experimentally measured networks. Probabilistic inference in our model untangles the signal and noise networks, however, exact inference in the model is intractable. To address these, we have developed an accurate, linear time sum-product algorithm for approximate inference.

Untangling 1  Untangling 2

We applied our algorithm to the problem of denoising yeast protein-protein interaction networks. After using a small set of trusted interactions to fit noise and signal degree distributions, our method detects 40% to 80% more true interactions, for the same false-positive rate, as a method which ignores graph structure.

Untangling 3

Our current work involves applying our method to untangle other types of biological networks and extending our generative model to incorporate other descriptors of connectivity structure, like degree correlations and clustering coefficients.


  • Q. D. Morris, B. J. Frey, and C. J. Paige 2003 Denoising and untangling graphs using degree priors, in Proceedings of Neural Information Processing Systems 16 (NIPS 03) [PDF]

  • Q. D. Morris and B. J. Frey 2003 Denoising and untangling graphs using degree priors, invited talk in International Conference on Machine Learning, Bioinformatics Workshop (ICML 03), Washington DC

GenXHC - Generative model for cross-hybridization compensation in high-density microarray data (Bioinformatics 2005)

Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Probes in such high-density arrays have been shown to measure signal from both their target transcript and many other non-specific transcripts due to short oligonucleotide lengths. This problem of cross-hybridization noise will become increasingly significant in the upcoming era of genome-wide exon-tiling microarray experiments and algorithms which can accurately compensate for cross-hybridization will play an important role in future large-scale microarray assays. We have developed GenXHC (Generative model for X-Hybridization Compensation), a probabilistic model for cross-hybridization compensation which estimates transcript abundances from probe intensities by accounting for potential cross-hybridization in high-density genomewide microarray data.

The generative process for cross-hybridization using measured expression profiles. Expression profiles consist of measurements across 12 tissue pools, where high intensity indicates high expression. Each probe can hybridize to multiple transcripts and thus measures components of expression from transcripts other than its target. The measured expression levels can be used in tandem with knowledge of probe-transcript hybridization constraints to infer expression levels for the transcripts.

We perform a pairwise alignment between microarray probes and known transcripts to identify likely cross-hybridizidation interactions for each microarray probe. Using this information, we fit a parameterized generative model of the observed probe intensities as a noisy weighted sum of unobserved transcript abundances. The unobserved transcript abundances and cross-hybridization weights are estimated through variational learning. The algorithm was applied to a subset of a genome-wide M.musculus exon-tiling microarray dataset: GO-BP clustering using the denoised profiles produced clsuters that were statistically enriched for many categories compared to clustering using noisy data.


  • J.C. Huang, Q.D. Morris, T.R. Hughes and B.J. Frey 2005. GenXHC: A probabilistic generative model for cross-hybridization compensation in high-density, genome-wide microarray data. Proceedings of the Thirteenth Annual Conference on Intelligent Systems for Molecular Biology, Detroit, MI, June 25-29, 2005. Bioinformatics 21 Suppl 1:i222-i231. [Bioinformatics]

GenASAP - Generative model for alternative splicing predictions from microarray data (Bioinformatics 2006)


We present GenASAP (Generative model for Alternative Splicing Array Platform), a new algorithm coupled with a microarray platform for the study of alternative splicing (AS). A new microarray, targeted towards studying single cassette exon inclusion/exclusiong (see figure below) was designed.

AS events
AS probes
On the left, the most common alternative splicing events are shown: (a) single cassette exon inclusion/exclusion, (b) alternative 3'/5' splice site, (c) mutually exluded exons, and (d) intron inclusion/exclusion. On the right, the six microarray probes desined to study each AS event are shown. The microarray has three body probes (C1, C2, and A) and three junction probes (C1:A, A:C2, and C1:C2) for each AS event.

The model explains the observed values, consisting of measured transcription for exon body and junction probes, as a weighted linear combination of the abundance of the alternative isoforms with scale dependent noise and an outlier process. Learning in the generative model is carried out using a variational approximation of the Expectation-Maximization (EM) algorithm.

We carried out the learning on a new data set, consisting of 3126 "cassette" AS events across 10 tissues, where 3 exon body probes and 3 exon junction probes were used to study each event. A small subset (~200) of the events were closely examined using semi-quantitative RT-PCR, the results of which were used as the ground truth for evaluating the performance of GenASAP. The probabilistic nature of the algorithm suggests an approach to evaluating the confidence in the inferred values, which proved an important factor in evaluating the algorithm. The relative abundances of isoforms obtained from GenASAP's unsupervised learning were found to closely match the RT-PCR measurements, and to outperform supervised methods, such as KNN, logistic regression, and linear regression.

AS Bayes Net


  • O. Shai, Q. D. Morris, B. J. Blencowe and B. J. Frey 2006, Inferring global levels of alternative splicing isoforms using a generative model of microarray data, Bioinformatics 22(5):606-613 (pdf).

  • O. Shai, B. J. Frey, Q. D. Morris, Q. Pan, C. Misquitta, and B. J. Blencowe 2004, Probabilistic inference of alternative splicing events in microarray data. accepted to Neural Information Processing Systems 17 (NIPS 04)

  • Q. Pan1, O. Shai1, C. Misquitta, W. Zhang, N. Mohammad, T. Babak, H. Siu, T. R. Hughes, Q. D. Morris2, B. J. Frey2, and B. J. Blencowe2 2004 Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform, Molecular Cell 16:6, 929-941 [PubMed]
    1 Joint-first authors
    2 Joint-senior authors

Spatial trend removal (STR) - denoising microarray data

Microarrays allow biologists to measure the expression levels of thousands of mRNA transcripts simultaneously. To improve data quality, we attempt to remove spatial systematic noise, arising from slide imperfection, scanner artifacts, uneven washing, etc.

STR relies on the assumption that the probe placement on the array is random, and there should be no correlation between nearby probes. This assumtion leads to a view of the data as high frequency data added to a low frequency spatial trend. We have formulated an algorithm, STR, to remove spatial trends in the data, while preserving high data fidelity, accounting for outliers, and efficiently optimizing for slide specific trends.

filter steps

STR first removes outlying measurements (arising from differentially expressed genes, faulty probes, etc.) from the original data (shown in false colors in (a)). The "non-outliers" data (b) is then filtered with a low-pass filter (d) to obtain an estimate of the spatial trend (c). The final, detrended data is obtained by subtracting the spatial trend from the original data (e). STR used gradient descent to optimize for the parameters of the low-pass filter.

Matlab code for STR is available by contacting one of the following:

  • Ofer Shai - ofer[at]psi. utoronto. ca

  • Quaid Morris - quaid[at]psi. utoronto. ca

  • Brendan Frey - frey[at]psi. utoronto. ca


  • Q. D. Morris1, O. Shai1, B. J. Frey, W. Zhang, and T. R. Hughes 2004 Spatial trend removal - reducing systematic noise in microarrays, in preparation
    1 Joint-first authors