The computational task of organizing data based on a pre-defined or learned measure of similarity is a critical step in scientific data analysis and in engineering systems
A common approach is to use the data to learn a set of data centers such that the sum of squared errors between data points and their nearest centers is small
When the centers are selected from actual data points they are called exemplars
The most frequently used technique for learning exemplars is k-centers clustering which starts with an initial set of exemplars usually a randomly selected subset of the data points and iteratively refines this set so as to decrease the sum of squared errors in each iteration
k-centers clustering is quite sensitive to the initial set so the method is usually run many times with different random initializations in an attempt to find the best possible solution
We take a quite different approach and introduce a method that simultaneously considers all data points as potential exemplars
By viewing each data point as a node in a network we devised a method that recursively transmits real valued messages along edges of the network until a good set of exemplars and corresponding clusters emerges
As described later the simple formulas used to update the messages search for minima of an appropriately chosen energy function
At any point in time the magnitude of each message reflects the current affinity that one data point feels for choosing another data point as its exemplar so we call our method affinity propagation
Affinity propagation takes as input a collection of real valued similarities between data points where the similarity indicates how well the data point with index k is suited to be the exemplar for the data point with index i
When the goal is to minimize squared error the appropriate measure of similarity is negative squared error
In fact the method described here can be applied when the optimization criterion is much more general
Later we describe tasks where similarities are derived for pairs of images pairs of microarray measurements pairs of English sentences and pairs of cities
When an exemplar dependent probability model is available s can be set to the log likelihood of data point i given that its exemplar is point k
Additionally similarities may be set by hand
Rather than requiring the prior specification of the number of clusters to be found affinity propagation takes as input a real number s for each data point k so that data points with larger values of s are more likely to be chosen as exemplars
These numbers as are referred to as preferences
The preferences of all data points can be set to the same value such as the median of the input similarities resulting in a moderate number of clusters or the 0.1 percentile resulting in a small number of clusters
Varying the value of this shared preference will produce different numbers of clusters
It is often the case that a natural choice for the value of the preference of each data point is available we describe such a task later
Affinity propagation identifies exemplars by recursively sending real valued messages between pairs of data points
Unlike k-centers clustering and related methods which keep track of a fixed size set of candidate exemplars affinity propagation considers all points as potential exemplars and by passing messages between data points gradually decides on a good set of exemplars
The number of detected exemplars number of clusters is influenced by the values of the input preferences but also emerges from the message passing procedure
There are two kinds of message exchanged between data points each of which takes into account a different kind of competition
They can be combined at any point during the procedure to decide which points are exemplars and for every other point which exemplar it belongs to
The responsibility r sent from point i to candidate exemplar point k reflects the accumulated evidence for how well-suited point k is to serve as the exemplar for point i taking into account other potential exemplars for point i
The availability a sent from candidate exemplar point k to point i reflects the accumulated evidence for how appropriate it would be for point i to choose point k as its exemplar taking into account what other points support point k being an exemplar
Before affinity propagation begins the availabilities are initialized to zero
The responsibilities are computed using the rule
In the first iteration of this rule since the availabilities are zero r is set to the input similarity between point i and point k as its exemplar minus the largest of the similarities between point i and other candidate exemplars
This competitive update is data driven and does not take into account how many other points favor each candidate exemplar
In later iterations when some points are effectively assigned to other exemplars their availabilities will drop below zero as prescribed by the update rule described below
These negative availabilities will act to decrease the effective values of some of the input similarities s in the above update rule removing the corresponding candidate exemplars from competition
For k=i the responsibility r is set to the input preference that point k be chosen as an exemplar s minus the largest of the similarities between point i and all other candidate exemplars
This self responsibility reflects accumulated evidence that point k is an exemplar based on its input preference tempered by how ill suited it is to be assigned to another exemplar
Whereas the above responsibility update lets all candidate exemplars compete for ownership of a data point the availability update given below gathers evidence from data points as to whether each candidate exemplar would make a good exemplar
The availability a is set to the self responsibility r plus the sum of the positive responsibilities candidate exemplar k receives from other points
Only the positive portions of incoming responsibilities are added because it is only necessary for a good exemplar to explain some data points well positive responsibilities regardless of how poorly it explains other data points negative responsibilities
If the self responsibility r is negative indicating that point k is currently better suited as belonging to another exemplar rather than being an exemplar itself the availability of point k as an exemplar can be increased if some other points have positive responsibilities for point k being their exemplar
To limit the influence of strong incoming positive responsibilities the total sum is rectified so that it cannot go above zero
The self availability a is updated differently
The self availability reflects accumulated evidence that point k is an exemplar based on the positive responsibilities sent to candidate exemplar k from other points
At any point during affinity propagation a data point can combine its availabilities and responsibilities to decide whether or not it is an exemplar and if not which other point is its exemplar
For point i the value of k that maximizes a+r either identifies point i as an exemplar if k=i or identifies the data point which is the exemplar for point i
The rules for updating responsibilities and availabilities and deciding on label values require only simple local computations that are easily implemented
The messages may be updated in parallel or sequentially according to a prescribed schedule and messages only need to be exchanged between pairs of points with known input similarities
The message passing procedure is terminated after a fixed number of iterations after changes in the messages fall below a threshold or after the local decisions stay constant for some number of iterations
Numerical oscillations that arise in some circumstances can be removed and good solutions can be found by damping the responsibilities and availabilities, ie setting each message to lam times its value from the previous iteration plus 1 - lam times is prescribed updated value where the damping factor lam is between 0 and 1
In all of our experiments each iteration of affinity propagation consisted of 1) updating all responsibilities given the availabilities 2) updating all availabilities given the responsibilities and 3) combining availabilities and responsibilities to monitor the exemplar decisions and terminate the algorithm when these decisions did not change for 10 iterations
Unless otherwise noted we used a damping factor of lam = 0.5
F1A shows the dynamics of affinity propagation when it is applied to 25 two dimensional data points using negative squared error as the similarity
One advantage of affinity propagation is that the number of exemplars need not be specified beforehand
Instead the appropriate number of exemplars emerges from the message passing method and depends on the input exemplar preferences
This enables automatic model selection based on a prior specification of how preferable each point is as an exemplar
F1D shows the effect of the value of the input preference common to all points here on the number of identified exemplars
This relationship is quite similar to the relationship found by exactly minimizing the squared error which is computationally tractable in this case because the number of possible solutions is only 33 million
We next studied the problem of clustering images of faces using the standard optimization criterion of squared error
We used both affinity propagation and k-centers clustering to identify exemplars among 900 grayscale images containing different faces with varying expressions and orientation extracted from the Olivetti face database
Each 50 x 50 image was normalized so that its pixel intensities had a mean of 0 and a variance of 01
The similarity between two images was set to the negative sum of squared pixel differences
For a preference of -600 affinity propagation detected 62 exemplars which were much more representative had lower squared error than the 62 exemplars obtained in the best of 100 runs of k-centers clustering with random restarts
Although a single run of our implementation of affinity propagation took only a few seconds we next asked whether a huge number of random restarts of k-centers clustering could achieve the same squared error
F2B shows the average squared error achieved by one run of affinity propagation and the distribution of errors achieved by 10k runs of k-centers clustering plotted against the number of detected exemplars
Affinity propagation uniformly achieved much lower squared error in over two orders of magnitude less time
Another popular optimization criterion is the sum of absolute pixel differences which better tolerates outlying pixel intensities so we repeated the above procedure using this error measure
As shown in F2C affinity propagation again uniformly achieves lower error
Many tasks require the identification of exemplars among sparsely related data, ie where most similarities are either unknown or large and negative
To examine affinity propagation in this context we next addressed the task of clustering exons to find genes in mammalian DNA using the sparse similarity matrix derived from microarray data and reported in (3)
In that work 75k segments of DNA 60 bases long corresponding to putative exons were mined from the genome of mouse Chromosome 1
Their transcription levels were measured across 12 tissue samples and the similarity between every pair of putative exons considered as data points was computed
Squared error was not an appropriate optimization criterion in this application
Instead the measure of similarity between putative exons was based on a cost function measuring their proximity in the genome and the degree of coordination of their transcription levels across the 12 tissues
To account for putative exons that are not in fact exons eg, introns an additional artificial exemplar was included and the similarity of each other data point to this non exon exemplar was determined using statistics taken over the entire data set
The resulting 75k x 75k similarity matrix contained 99% similarities with values of -inf because most DNA segments were so far apart in the genome that they could not be part of the same gene
In (3) a computational tool engineered to take into account additional biological knowledge was used to detect genes
For results reported here we applied affinity propagation directly to the similarity matrix
Since messages need not be exchanged between point i and k if s = -inf each iteration of affinity propagation required exchanging messages between only a tiny subset of data point pairs
For a small portion of the data F3A shows the identification of exemplars representing putative genes and the assignments of other data points either to one of these exemplars or the exemplar representing non transcribed regions
For different numbers of clusters the reconstruction errors achieved by affinity propagation and k-centers clustering are compared in F3B
For each number of clusters affinity propagation was run once and took 6 minutes while k-centers clustering was run 10k times and took 280 hours
To address the question of how well these methods perform in detecting bona fide gene segments F3C plots the true positive rate against the false positive rate using the labels provided in the REFSEQ database as a standard
Affinity propagation achieved significantly higher true positive rates especially at low false positive rates which are most important to biologists
At a false positive rate of 3% affinity propagation achieved a true positive rate of 39% whereas the best k-centers clustering result was 17%
Affinity propagation performed better than k-centers clustering both in terms of the criterion used for optimization and biological accuracy
For comparison at the same false positive rate the best of several versions of hierarchical agglomerative clustering achieved a true positive rate of 19% and the method described in which accounts for additional biological knowledge achieved a true positive rate of 43%
Affinity propagations ability to operate using non standard optimization criteria makes it suitable for exploratory data analysis using unusual measures of similarity
Unlike metric space clustering techniques such as k-means clustering affinity propagation can be applied to problems where the data does not lie in a continuous space
In fact it can be applied to problems where the similarities are not symmetric and even to problems where the similarities do not satisfy the triangle inequality
To identify a small number of sentences in this manuscript that reflect all other sentences well we treated each sentence as a bag of words and computed the similarity of sentence i to sentence k based on the information theoretic cost of encoding the words in sentence i using the words in sentence k
In the resulting matrix of similarities 97% of the similarities were not symmetric
The input preference was adjusted so that three exemplar sentences were detected and these are shown in F4A
We also applied affinity propagation to explore the problem of detecting a restricted number of Canadian and American cities that are most easily accessible by large subsets of other cities in terms of commercial airline travel time
Each data point was a city in Canada or the USA and the similarity s was set to the negative time it takes to travel from city i to city k by airline including estimated stop over delays headwinds etc
Due to headwinds the transit time was in many cases different depending on the direction of travel so that 36% of the similarities were asymmetric
Further for 97% of city pairs i and k there was a third city j such that the triangle inequality was violated because the trip from i to k included a long stop over delay in city j so it took longer than the sum of the durations of the trips from i to j and j to k
If the number of most accessible cities is constrained to be seven by adjusting the input preference appropriately the cities shown in F4B E were detected
It is interesting that several major cities were not selected either because heavy international travel makes them inappropriate as easily accessible domestic destinations eg New York City Los Angeles or because their neighborhoods can more efficiently access other destinations eg Atlanta Philadelphia and Minneapolis St Paul account for Chicagos destinations while avoiding huge potential airport delays
Affinity propagation can be viewed as a method that minimizes a cost function or energy function
The goal of identifying exemplars and assigning every other data point to an exemplar can be formulated as minimizing an energy function that depends on a set of N hidden labels c_1 c_N corresponding to the N data points
Each label indicates the exemplar to which the data point belongs so that s is the similarity of data point i to its exemplar c_i
The assignment c_i = i is a special case indicating that point i is itself an exemplar in which case s is the input preference for point i being an exemplar
Not all configurations of the labels are valid
A configuration of the labels c is valid when for every data point i if some point i has chosen i as its exemplar then i must be labeled as an exemplar
The energy of a valid configuration of the labels c = c_1 c_N is E = - sum s
Exactly minimizing the energy is computationally intractable since a special case of this minimization problem is the NP-hard metric k-median problem
However the update rules for affinity propagation correspond to fixed point recursions for minimizing a free form of the above energy
The update rules are most easily derived as an instance of the min-sum algorithm in a factor graph describing the constraints on the labels and the energy function
Affinity propagation has several advantages over related techniques
Iterative methods such as k-centers clustering k-means clustering and the expectation maximization algorithm store a relatively small set of current estimates of the cluster centers at each step
In contrast by simultaneously considering all data points as candidate cluster centers affinity propagation is able to avoid many of the poor solutions caused by unlucky initialization
While Markov chain Monte Carlo sampling techniques can be used to search over multiple models the state of the Markov chain at any point in time is meant to specify a plausible and relatively small set of cluster centers
So these sampling techniques do not share affinity propagations advantage of simultaneously considering a much larger number of relevant putative cluster centers
Hierarchical agglomerative clustering and spectral clustering solve the quite different problem of recursively comparing pairs of data points to find partitions of the data
These techniques do not require all points within a cluster to be similar to a single exemplar and are thus not well suited to many tasks
In particular two data points that should not be in the same cluster may be grouped together by an unfortunate sequence of pair wise groupings
Also because these methods do not search for a good set of representative exemplars a new test case would be compared with all points before the appropriate cluster can be identified
In it was shown that the related metric k-median problem could be relaxed to form a linear program with a constant factor approximation
There the input was a matrix of costs negative similarities but they were assumed to be metric ie non negative symmetric and satisfying the triangle inequality
In contrast affinity propagation can take as input general non metric similarities and while we dont prove guarantees on performance we demonstrated here that affinity propagation can find good solutions when the similarities are non metric
Importantly affinity propagation also provides a conceptually novel approach in light of the linear programming relaxation
Whereas the linear programming relaxation is quite difficult to solve and sophisticated software packages need to be applied affinity propagation makes use of intuitive message updates that can be simply implemented in a few lines of code
Affinity propagation is related in spirit to techniques recently used to obtain record breaking results in quite different disciplines
The approach of recursively propagating messages in a loopy graph has been used to approach Shannons limit in error correcting decoding solve random satisfiability problems with an order of magnitude increase in size and efficiently estimate depth from pairs of stereo images
Yet to our knowledge affinity propagation is the first method to make use of this idea to solve the age old fundamental problem of clustering data
Because of its simplicity general applicability and performance we believe affinity propagation will prove to be of broad value in science and engineering