Computational Biology
PTMClust  A Posttranslational Modification Refinement Algorithm
A posttranslational modification (PTM) is a chemical modification of a protein that occurs naturally. Many of these modifications, such as phosphorylation, are known to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations have been linked to diverse diseases like Parkinson’s, Alzheimer’s, diabetes and cancer. To discover PTMs on a genomewide scale, there is a recent surge of interest in analyzing tandem mass spectrometry data, and several unrestrictive (socalled 'blind') PTM search methods have been reported. However, these approaches are subject to noise in mass measurements and in the predicted modification site (amino acid position) within peptides, which can result in false PTM assignments.
To address these issues, we devised a machine learning algorithm, PTMClust, that can be applied to the output of blind PTM search methods to improve prediction quality, by suppressing noise in the data and clustering peptides with the same underlying modification to form PTM groups. We showed that our technique outperforms two standard clustering algorithms on a simulated dataset. Additionally, we showed that our algorithm significantly improves sensitivity and specificity when applied to the output of three different blind PTM search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outforms another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is able to reduce false PTM assignments, improve overall detection coverage and facilitate novel PTM discovery, including terminus modifications. We applied our technique to a largescale yeast MS/MS proteome profiling dataset and found numerous known and novel PTMs. Accurately identifying modifications in protein sequences is a critical first step for PTM profiling, and thus our approach may benefit routine proteomic analysis.
Website:
http://www.psi.toronto.edu/PTMClust/References:

C. Chung, J. Liu, A. Emili and B.J. Frey 2011 Computational Refinement of Posttranslational Modifications Predicted from Tandem Mass Spectrometry, Bioinformatics [Epub ahead of print] [Link to Epub]
Untangling biological networks (NIPS 2003)
Networks describing biological processes (like metabolism, transcriptional regulation, and proteinprotein interactions) are invaluable for understanding and manipulating those processes. However, experimental methods that can generate these networks on a largescale are prone to error, producing many falsepositive and falsenegative measurements. We have developed a method to identify measurement errors using knowledge about the connectivity structure of the noisefree network and the dependency structure of the measurement noise.
Many types of biological networks share a very similar connectivity structure: most nodes have a small degree (i.e., are connected to very few other nodes) and a few nodes, "hubs", have a very large degree. This structure is welldescribed by a degree distribution. Similarly, the dependency structure of the noise can also be wellrepresented by a degree distribution. These regularities are important because they can be used to untangle the real network (the signal) from the network of false positive measurements (the noise).
We developed a probabilistic generative model of tangled, experimentally measured networks. Probabilistic inference in our model untangles the signal and noise networks, however, exact inference in the model is intractable. To address these, we have developed an accurate, linear time sumproduct algorithm for approximate inference.
We applied our algorithm to the problem of denoising yeast proteinprotein interaction networks. After using a small set of trusted interactions to fit noise and signal degree distributions, our method detects 40% to 80% more true interactions, for the same falsepositive rate, as a method which ignores graph structure.
Our current work involves applying our method to untangle other types of biological networks and extending our generative model to incorporate other descriptors of connectivity structure, like degree correlations and clustering coefficients.
References:

Q. D. Morris, B. J. Frey, and C. J. Paige 2003 Denoising and untangling graphs using degree priors, in Proceedings of Neural Information Processing Systems 16 (NIPS 03) [PDF]
Q. D. Morris and B. J. Frey 2003 Denoising and untangling graphs using degree priors, invited talk in International Conference on Machine Learning, Bioinformatics Workshop (ICML 03), Washington DC
GenXHC  Generative model for crosshybridization compensation in highdensity microarray data (Bioinformatics 2005)
Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are currently being released. Probes in such highdensity arrays have been shown to measure signal from both their target transcript and many other nonspecific transcripts due to short oligonucleotide lengths. This problem of crosshybridization noise will become increasingly significant in the upcoming era of genomewide exontiling microarray experiments and algorithms which can accurately compensate for crosshybridization will play an important role in future largescale microarray assays. We have developed GenXHC (Generative model for XHybridization Compensation), a probabilistic model for crosshybridization compensation which estimates transcript abundances from probe intensities by accounting for potential crosshybridization in highdensity genomewide microarray data.
The generative process for crosshybridization using measured expression profiles. Expression profiles consist of measurements across 12 tissue pools, where high intensity indicates
high expression. Each probe can hybridize to multiple transcripts and thus measures components of expression from transcripts
other than its target. The measured expression levels can be used in tandem with knowledge of probetranscript hybridization
constraints to infer expression levels for the transcripts.
We perform a pairwise alignment between microarray probes and known transcripts to identify likely crosshybridizidation interactions for each microarray probe. Using this information, we fit a parameterized generative model of the observed probe intensities as a noisy weighted sum of unobserved transcript abundances. The unobserved transcript abundances and crosshybridization weights are estimated through variational learning. The algorithm was applied to a subset of a genomewide M.musculus exontiling microarray dataset: GOBP clustering using the denoised profiles produced clsuters that were statistically enriched for many categories compared to clustering using noisy data.
References:

J.C. Huang, Q.D. Morris, T.R. Hughes and B.J. Frey 2005. GenXHC: A probabilistic generative model for crosshybridization compensation in highdensity, genomewide microarray data. Proceedings of the Thirteenth Annual Conference on Intelligent Systems for Molecular Biology, Detroit, MI, June 2529, 2005. Bioinformatics 21 Suppl 1:i222i231. [Bioinformatics]
GenASAP  Generative model for alternative splicing predictions from microarray data (Bioinformatics 2006)
CLICK HERE FOR DATA AND SOFTWAREWe present GenASAP (Generative model for Alternative Splicing Array Platform), a new algorithm coupled with a microarray platform for the study of alternative splicing (AS). A new microarray, targeted towards studying single cassette exon inclusion/exclusiong (see figure below) was designed.
On the left, the most common alternative splicing events are shown: (a) single cassette exon
inclusion/exclusion, (b) alternative 3'/5' splice site, (c) mutually exluded exons, and (d) intron
inclusion/exclusion. On the right, the six microarray probes desined to study each AS event are shown.
The microarray has three body probes (C_{1}, C_{2}, and A) and three junction probes
(C_{1}:A, A:C_{2}, and C_{1}:C_{2}) for each AS event.
The model explains the observed values, consisting of measured transcription for exon body and junction probes, as a weighted linear combination of the abundance of the alternative isoforms with scale dependent noise and an outlier process. Learning in the generative model is carried out using a variational approximation of the ExpectationMaximization (EM) algorithm.
We carried out the learning on a new data set, consisting of 3126 "cassette" AS events across 10 tissues, where 3 exon body probes and 3 exon junction probes were used to study each event. A small subset (~200) of the events were closely examined using semiquantitative RTPCR, the results of which were used as the ground truth for evaluating the performance of GenASAP. The probabilistic nature of the algorithm suggests an approach to evaluating the confidence in the inferred values, which proved an important factor in evaluating the algorithm. The relative abundances of isoforms obtained from GenASAP's unsupervised learning were found to closely match the RTPCR measurements, and to outperform supervised methods, such as KNN, logistic regression, and linear regression.
References:

O. Shai, Q. D. Morris, B. J. Blencowe and B. J. Frey 2006, Inferring global levels of alternative splicing isoforms using a generative model of microarray data, Bioinformatics 22(5):606613 (pdf).

O. Shai, B. J. Frey, Q. D. Morris, Q. Pan, C. Misquitta, and B. J. Blencowe 2004, Probabilistic inference of alternative splicing events in microarray data. accepted to Neural Information Processing Systems 17 (NIPS 04)

Q. Pan^{1}, O. Shai^{1}, C. Misquitta, W. Zhang, N. Mohammad, T. Babak, H. Siu, T. R. Hughes, Q. D. Morris^{2}, B. J. Frey^{2}, and B. J. Blencowe^{2} 2004 Revealing global regulatory features of mammalian alternative splicing using a quantitative microarray platform, Molecular Cell 16:6, 929941 [PubMed]
^{1} Jointfirst authors
^{2} Jointsenior authors
Spatial trend removal (STR)  denoising microarray data
Microarrays allow biologists to measure the expression levels of thousands of mRNA transcripts simultaneously. To improve data quality, we attempt to remove spatial systematic noise, arising from slide imperfection, scanner artifacts, uneven washing, etc.
STR relies on the assumption that the probe placement on the array is random, and there should be no correlation between nearby probes. This assumtion leads to a view of the data as high frequency data added to a low frequency spatial trend. We have formulated an algorithm, STR, to remove spatial trends in the data, while preserving high data fidelity, accounting for outliers, and efficiently optimizing for slide specific trends.
STR first removes outlying measurements (arising from differentially expressed genes, faulty probes, etc.) from the original data (shown in false colors in (a)). The "nonoutliers" data (b) is then filtered with a lowpass filter (d) to obtain an estimate of the spatial trend (c). The final, detrended data is obtained by subtracting the spatial trend from the original data (e). STR used gradient descent to optimize for the parameters of the lowpass filter.
Matlab code for STR is available by contacting one of the following:
Ofer Shai  ofer[at]psi. utoronto. ca
Quaid Morris  quaid[at]psi. utoronto. ca
Brendan Frey  frey[at]psi. utoronto. ca
References:

Q. D. Morris^{1}, O. Shai^{1}, B. J. Frey, W. Zhang, and T. R. Hughes 2004 Spatial trend removal  reducing systematic noise in microarrays, in preparation
^{1} Jointfirst authors