PTMClust - A Post-translational Modification Refinement Algorithm
A post-translational modification (PTM) is a chemical modification of a protein
that occurs naturally. Many of these modifications, such as phosphorylation, are known
to play pivotal roles in the regulation of protein function. Henceforth, PTM perturbations
have been linked to diverse diseases like Parkinson’s, Alzheimer’s, diabetes and cancer.
To discover PTMs on a genomewide scale, there is a recent surge of interest in analyzing
tandem mass spectrometry data, and several unrestrictive (so-called 'blind') PTM search
methods have been reported. However, these approaches are subject to noise in mass
measurements and in the predicted modification site (amino acid position) within peptides,
which can result in false PTM assignments.
To address these issues, we devised a machine learning algorithm, PTMClust, that
can be applied to the output of blind PTM search methods to improve prediction quality,
by suppressing noise in the data and clustering peptides with the same underlying modification
to form PTM groups. We showed that our technique outperforms two standard clustering
algorithms on a simulated dataset. Additionally, we showed that our algorithm significantly
improves sensitivity and specificity when applied to the output of three different blind PTM
search engines, SIMS, InsPecT and MODmap. Additionally, PTMClust markedly outforms
another PTM refinement algorithm, PTMFinder. We demonstrate that our technique is
able to reduce false PTM assignments, improve overall detection coverage and facilitate
novel PTM discovery, including terminus modifications. We applied our technique to a
large-scale yeast MS/MS proteome profiling dataset and found numerous known and
novel PTMs. Accurately identifying modifications in protein sequences is a critical first
step for PTM profiling, and thus our approach may benefit routine proteomic analysis.
C. Chung, J. Liu, A. Emili and B.J. Frey 2011
Computational Refinement of Post-translational Modifications Predicted from Tandem Mass Spectrometry,
Bioinformatics [Epub ahead of print]
[Link to Epub]
Untangling biological networks (NIPS 2003)
Networks describing biological processes (like metabolism, transcriptional
regulation, and protein-protein interactions) are invaluable for understanding
and manipulating those processes. However, experimental methods that can
generate these networks on a large-scale are prone to error, producing many
false-positive and false-negative measurements. We have developed a method
to identify measurement errors using knowledge about the connectivity structure
of the noise-free network and the dependency structure of the measurement noise.
Many types of biological networks share a very similar connectivity
structure: most nodes have a small degree (i.e., are connected to very
few other nodes) and a few nodes, "hubs", have a very large degree.
This structure is well-described by a degree distribution.
Similarly, the dependency structure of the noise can also be well-represented
by a degree distribution. These regularities are important because they
can be used to untangle the real network (the signal) from the network
of false positive measurements (the noise).
We developed a probabilistic generative model of tangled, experimentally
measured networks. Probabilistic inference in our model untangles the
signal and noise networks, however, exact inference in the model is intractable.
To address these, we have developed an accurate, linear time sum-product
algorithm for approximate inference.
We applied our algorithm to the problem of denoising yeast protein-protein
interaction networks. After using a small set of trusted interactions to
fit noise and signal degree distributions, our method detects 40% to 80%
more true interactions, for the same false-positive rate, as a method which
ignores graph structure.
Our current work involves applying our method to untangle other types of
biological networks and extending our generative model to incorporate other
descriptors of connectivity structure, like degree correlations and
Q. D. Morris, B. J. Frey, and C. J. Paige 2003
Denoising and untangling graphs using degree priors,
in Proceedings of Neural Information Processing Systems 16 (NIPS 03)
Q. D. Morris and B. J. Frey 2003
Denoising and untangling graphs using degree priors,
invited talk in International Conference on Machine Learning,
Bioinformatics Workshop (ICML 03), Washington DC
GenXHC - Generative model for cross-hybridization compensation in high-density microarray data (Bioinformatics 2005)
Microarray designs containing millions to hundreds of millions of probes that tile entire genomes are
currently being released. Probes in such high-density arrays have been shown to measure signal from both
their target transcript and many other non-specific transcripts due to short oligonucleotide lengths. This
problem of cross-hybridization noise will become increasingly significant in the upcoming era of genome-wide
exon-tiling microarray experiments and algorithms which can accurately compensate for cross-hybridization
will play an important role in future large-scale microarray assays. We have developed
GenXHC (Generative model for X-Hybridization Compensation), a probabilistic
model for cross-hybridization compensation which estimates transcript abundances from probe intensities by
accounting for potential cross-hybridization in high-density genomewide
The generative process for cross-hybridization using measured expression profiles. Expression profiles consist of measurements across 12 tissue pools, where high intensity indicates
high expression. Each probe can hybridize to multiple transcripts and thus measures components of expression from transcripts
other than its target. The measured expression levels can be used in tandem with knowledge of probe-transcript hybridization
constraints to infer expression levels for the transcripts.
We perform a pairwise alignment between microarray probes and known
transcripts to identify likely cross-hybridizidation interactions for each microarray probe.
Using this information, we fit a parameterized generative
model of the observed probe intensities as a noisy weighted sum of unobserved
transcript abundances. The unobserved transcript abundances and
cross-hybridization weights are estimated through variational learning.
The algorithm was applied to a subset of a genome-wide M.musculus exon-tiling microarray
dataset: GO-BP clustering using the denoised profiles produced clsuters
that were statistically enriched for many categories compared to clustering using noisy data.
J.C. Huang, Q.D. Morris, T.R. Hughes and B.J. Frey 2005.
GenXHC: A probabilistic generative model for cross-hybridization compensation in high-density, genome-wide microarray data. Proceedings of the Thirteenth Annual Conference on Intelligent Systems for Molecular Biology, Detroit, MI, June 25-29, 2005. Bioinformatics 21 Suppl 1:i222-i231.
GenASAP - Generative model for alternative splicing predictions from microarray data (Bioinformatics 2006)
We present GenASAP (Generative model for Alternative Splicing Array
Platform), a new algorithm coupled with a microarray platform for the study of
alternative splicing (AS). A new microarray, targeted towards studying single cassette exon
inclusion/exclusiong (see figure below) was designed.
On the left, the most common alternative splicing events are shown: (a) single cassette exon
inclusion/exclusion, (b) alternative 3'/5' splice site, (c) mutually exluded exons, and (d) intron
inclusion/exclusion. On the right, the six microarray probes desined to study each AS event are shown.
The microarray has three body probes (C1, C2, and A) and three junction probes
(C1:A, A:C2, and C1:C2) for each AS event.
The model explains the observed values, consisting of measured
transcription for exon body and junction probes, as a weighted linear combination of
the abundance of the alternative isoforms with scale dependent noise
and an outlier process.
Learning in the generative model is carried out using a variational approximation
of the Expectation-Maximization (EM) algorithm.
We carried out the learning on a new data set, consisting of 3126 "cassette" AS events across 10 tissues,
where 3 exon body probes and 3 exon junction probes were used to study each event. A small subset (~200)
of the events were closely examined using semi-quantitative RT-PCR, the results of which were used
as the ground truth for evaluating the performance of GenASAP. The probabilistic nature of the algorithm suggests
an approach to evaluating the confidence in the inferred values, which proved an important factor in evaluating
the algorithm. The relative abundances of isoforms obtained from GenASAP's unsupervised learning were found
to closely match the RT-PCR measurements, and to outperform supervised methods, such as KNN, logistic regression,
and linear regression.
O. Shai, Q. D. Morris, B. J. Blencowe and B. J. Frey 2006,
Inferring global levels of alternative splicing isoforms using a generative model of microarray data,
Bioinformatics 22(5):606-613 (pdf).
O. Shai, B. J. Frey, Q. D. Morris, Q. Pan, C. Misquitta, and B. J. Blencowe 2004,
Probabilistic inference of alternative splicing events in microarray data.accepted
to Neural Information Processing Systems 17 (NIPS 04)
Q. Pan1, O. Shai1, C. Misquitta, W. Zhang,
N. Mohammad, T. Babak, H. Siu, T. R. Hughes, Q. D. Morris2,
B. J. Frey2, and B. J. Blencowe2 2004
Revealing global regulatory features of mammalian alternative splicing
using a quantitative microarray platform,
Molecular Cell 16:6, 929-941
[PubMed] 1 Joint-first authors 2 Joint-senior authors
Spatial trend removal (STR) - denoising microarray data
Microarrays allow biologists to measure the expression levels of thousands of mRNA
transcripts simultaneously. To improve data quality, we attempt to remove spatial systematic noise,
arising from slide imperfection, scanner artifacts, uneven washing, etc.
STR relies on the assumption that the probe placement on the array is random, and there should be no
correlation between nearby probes. This assumtion leads to a view of the data as high frequency data added
to a low frequency spatial trend. We have formulated an algorithm, STR, to remove spatial trends in
the data, while preserving high data fidelity, accounting for outliers, and efficiently optimizing for
slide specific trends.
STR first removes outlying measurements (arising from differentially expressed genes, faulty
probes, etc.) from the original data (shown in false colors in (a)). The "non-outliers" data (b) is then
filtered with a low-pass filter (d) to obtain an estimate of the spatial trend (c). The final,
detrended data is obtained by subtracting the spatial trend from the original data (e). STR used
gradient descent to optimize for the parameters of the low-pass filter.
Matlab code for STR is available by contacting one of the following:
Ofer Shai - ofer[at]psi. utoronto. ca
Quaid Morris - quaid[at]psi. utoronto. ca
Brendan Frey - frey[at]psi. utoronto. ca
Q. D. Morris1, O. Shai1, B. J. Frey, W. Zhang, and T. R. Hughes 2004
Spatial trend removal - reducing systematic noise in microarrays,
in preparation 1 Joint-first authors