AFFINITY PROPAGATION FAQ

DEFINITIONS

affinity propagation: An algorithm that identifies exemplars among data points and forms clusters of data points around these exemplars. It operates by simultaneously considering all data point as potential exemplars and exchanging messages between data points until a good set of exemplars and clusters emerges. (See BJ Frey and D Dueck, Science 315, Feb 16, 2007.)

k-centers clustering (a.k.a. k-medoids clustering and k-medians clustering): An exemplar learning algorithm that takes as input an initial set of exemplars (often randomly-selected) and then iteratively refines that set while changing the clusters to match the set of exemplars. It is similar to k-means clustering or the EM algorithm, except that the centers must lie on data points.

data point (a.k.a. training case): One item of data; affinity propagation clusters data points

exemplar: A data point that is nicely representative of itself and some other data points.

similarity: The similarity of data point i to data point j, called s(i,j), is the suitability of point j to serve as the exemplar for data point i. If the data are real-valued, a common choice for similarity is negative Euclidean distance: s(i,j) = -(x(i)-x(j))(x(i)-x(j)). Affinity propagation can be applied using more general notions of similarity, and the similarities may be positive or negative. In fact, the output of the algorithm is unchanged if the similarities are scaled and/or offset by a constant (as long as the preferences are scaled and/or offset by the same constant).

preference: The preference of point i, called p(i) or s(i,i), is the a priori suitability of point i to serve as an exemplar. Preferences can be set to a global (shared) value, or customized for particular data points. High values of the preferences will cause affinity propagation to find many exemplars (clusters), while low values will lead to a small number of exemplars (clusters). A good initial choice for the preference is the minimum similarity or the median similarity.

availability: Messages sent from candidate exemplars (data points) to potential cluster members (data points), indicating how appropriate that candidate would be as an exemplar.

responsibility: Messages sent from cluster members (data points) to candidate exemplars (data points), indicating how well-suited the data point would be as a member of the candidate exemplar's cluster.

net similarity: A score of how appropriate the exemplars are for explaining the data. This is the objective function that affinity propagation tries to maximize.

belief propagation: An iterative algorithm (used in affinity propagation) for searching over and scoring configurations of variables in a graphical model. It is also referred to as the "sum-product algorithm". When it is applied to a graph with cycles, it is often referred to as "loopy belief propagation".

sum-product and max-product algorithms: Whereas the sum-product algorithm combines evidence by adding probabilities (scores), the max-product algorithm combines evidence by taking the largest probability (score).

iteration: In affinity propagation, a single iteration involves computing all responsibility messages based on the current availability messages, the input similarities and the input preferences, and then computing all availability messages based on the responsibility messages, which were just updated.

damping / dampfact: Computing responsibilities and availabilities according to simple update rules will often lead to oscillations caused by "overshooting" the solution, so the responsibility and availability messages are "damped" like this:

msgnew = (dampfact)(msgold) + (1-dampfact)(msgnew)

clustering: An unsupervised machine learning task that involves partitioning data into groups of similar data points.

cluster: One of the data point groups arising from clustering.

convits and maxits: Affinity propagation iteratively computes responsibilities and availabilities. The algorithm terminates if decisions for the exemplars and the cluster boundaries are unchanged for convits iterations, or if maxits iterations are reached.

GENERAL QUESTIONS

What applications can affinity propagation be used for? It can be applied whenever there is a way to measure or pre-compute a number for each pair of data points, which indicates how similar they are (high values indicate high similarity, while low values indicate low similarity). We have been emailed by people who have used affinity propagation to analyze network traffic, group monkey and human skulls, detect interest points in images, group motion features in videos, identify common trading patters, analyze basketball statistics, identify patterns in audio streams, group together similarly-acting biomolecules and cluster astrophysical objects. Note that similarities need only be provided for data points that can potentially be linked together (see below).

Does affinity propagation work on sparse data? If by "sparse", you mean that you know beforehand that each data point has the potential to be assigned to only a small fraction of the other data points, then the answer is "yes". Running affinity propagation on such data is equivalent to setting the similarity between two data points that shouldn't be linked together to negative infinity. However, there is a version of affinity propagation that only exchanges messages between pairs of points when the similarity is not negative infinity. This algorithm's time and memory requirements scale linearly with the number of similarities, which would be NxN if a full set of pairwise similarities is input, but much, much less if the set of similarities is sparse.

For my problem, computing all possible NxN similarities would take too long or require too much memory because N (the number of data points) is too large. I've heard that I can still use affinity propagation. How? There are two options that we know of. First, if you know beforehand that your similarities are sparse, you can use a much more efficient version of affinity propagation (see above). Second, if you have a way of computing similarities, but you don't know beforehand that the similarities are sparse, you can use what we call "leveraged affinity propagation" (available at the main affinity propagation web site). Leveraged affinity propagation samples from the full set of potential similarities and performs several rounds of sparse affinity propagation, iteratively refining the samples.

Does affinity propagation work with parallel computing? Interesting question. We have not explored a parallel version of affinity propagation yet, though there may be some benefit to running it on a large number of cores or machines. Note that it is a deterministic algorithm, so the trivial idea of running it with different initializations on different cores is not particularly interesting.

I heard that you ran affinity propagation on the Netflix data. How well did it work? Very well, in terms of net similarity :^) We haven't been able to find another algorithm that can achieve the same or better value for the objective function. As for how well it works for predicting user ratings, we don't know yet.

PREFERENCES AND THE NUMBER OF CLUSTERS

Why isn't the net similarity (objective function) maximized when every data point is an exemplar?Because, it will usually be the case that the gain in similarity a data point achieves by assigning itself to an existing exemplar is higher than the preference value. If the shared preference value is larger than all similarities, then every data point will become an exemplar. This can happen unwittingly if all similarities are negative (e.g. using negative Euclidean distance) and the preference is set to zero. Lower preference values will lead the algorithm to find better configurations (higher net similarity) with data points assigned to clusters with other exemplars.

I ran affinity propagation and it found more clusters than I wanted to find. What should I do?If you're using a preference value shared among all data points, treat it like a control knob for the number of clusters and lower it. Keep in mind that it has the same "units" as the similarities, so if your similarities range from smin to smax, try values like smin, smin-(smax-smin) and so on. The program preferenceRange.m available here computes, for your problem, the range of preference values that are sensible for your problem. If your preferences are different for different data points, you can subtract a constant from all preferences to find fewer clusters.

How can I find exactly K clusters with affinity propagation?The number of clusters is close to being monotonically related to the preference, so you can run affinity propagation several times with different preference values searching for exactly K clusters. We implemented a bisection method called apclusterK.m (available here), which does this automatically.

Is affinity propagation only good at finding a large number of quite small clusters? It depends on what you mean by "large" and "small". For example, it beats other methods at finding 100 clusters in 17,770 Netflix movies. While "100" may seem like a large number of clusters, it is not unreasonable to think that there may be 100 distinct types of movies. Also, on average there are 178 points per cluster, which is not "quite small". However, if you're looking for just a few clusters (eg, 1 to 5), you'd probably be better off using a simple method (see below).

How well does affinity propagation perform when finding two clusters?Partitioning N data points into two clusters can be done exactly in time N(N-1)(N-2) by trying out all N(N-1) pairs of possible exemplars and for each pair assigning all other (N-2) data points to their most similar exemplars. Affinity propagation would take NxN time per iteration, but since it is an approximate algorithm, you may be better off using an exact method for two clusters. Alternatively, using the best of a large number of k-centers clustering runs initialized randomly should also lead to a good result. For more than just a few clusters, affinity propagation is more appropriate.

Affinity propagation seems to outperform methods that randomly resample potential centers, especially for non-trivial numbers of clusters. Why?Finding k clusters within n data points involves a search space of size n choose k, or [n(n-1)(n-2)...(n-k+1)]/[k(k-1)(k-2)...(2)], which is polynomial in n if k is within a fixed distance of 1 or n, but exponential in n if k grows linearly with n. Algorithms dependent on random initialization (e.g. k-centers clustering) will perform better with smaller search spaces because the initialization will be more likely to be sampled from a region near the optimal solution. So, those methods work well when k is close to 1 or n. Affinity does not rely on random sampling, but instead exchanges soft information between data points while considering all of them to be potential exemplars. It works better in the large-search-space situation, where k is not close to 1 or n.

CONCEPTUAL ISSUES

Why does affinity propagation work so well? Most methods that cluster by finding a set of prototypes work by keeping track of a fixed set of prototypes and iteratively refining that set. Sometimes, they make small changes to the size of the set during learning, by merging clusters or splitting clusters, but all that is kept track of is a particular configuration of the set of prototypes. In contrast, affinity propagation simultaneously considers all data points as potential prototypes and passes soft information around until a subset of data points "win" and become the exemplars. We think this gives affinity propagation an advantage over other methods.

How do you derive/justify affinity propagation? We derived affinity propagation by setting up a factor graph that describes the objective function used to identify exemplars and cluster data. Affinity propagation follows by applying the max-product algorithm (a variant of the sum-product algorithm or belief propagation algorithm) in this factor graph and then simplifying the resulting algorithm by noticing that certain numbers need not be stored. The full derivation is in the supporting online material of the Science paper.

How can the results of affinity propagation be evaluated for my problem? You can use the resulting net similarity (objective function) to evaluate the quality of the clustering solution; it is the sum of all similarities between non-exemplar data points and their exemplars plus the sum of all exemplar preferences. This is an indicator only for how well the objective function has been maximized given the similarities and may or may not correspond closely to your evaluation criteria. If you measure or compute the input similarities so that maximizing the net similarity corresponds closely to your true objective, then affinity propagation should do a good job. (You can compare it to other algorithms, such as k-centers clustering, which try to maximize the same objective function.)

How do I know that affinity propagation is the right algorithm for my problem?If you can formulate your problem in terms of identifying a subset of data points as exemplars and using them to best account for all other data points, then affinity propagation would be a good algorithm to try. This way of viewing a problem is quite general and is relevant to problems in vision, text analysis, planning, finance, business and politics!

What can I compare affinity propagation's solution with? We have compared affinity propagation with thousands or even millions of random restarts of k-centers clustering (which takes a long time!). If your problem is small enough (several hundred data points) you can probably compare with the optimal solution found with linear programming (e.g. CPLEX software).

How do you decide if a data point is an exemplar? The responsibilities and availabilities are messages that provide evidence for whether or not each data point should be an exemplar and if not to what exemplar that data point should be assigned. After the messages have converged, there are two ways you can identify exemplars: 1) For data point i, if r(i,i)+a(i,i) > 0, then data point i is an exemplar; 2) For data point i, if r(i,i)+a(i,i) > r(i,j)+a(i,j) for all i not equal to j, then data point i is an exemplar. Both methods work well.

Is there a connection between affinity propagation and other methods, like hierarchical agglomerative clustering and spectral clustering (normalized cuts)? Spectral clustering (normalized cuts) identifies clusters in a way that can be viewed as passing messages between data points, but that method does not identify an exemplar for each cluster. Affinity propagation can be viewed as a spectral clustering algorithm that requires each cluster to vote for a good exemplar from within its data points. Hierarchical agglomerative clustering starts with every data point as its own cluster and then recursively merges pairs of clusters, but that method makes hard decisions that can cause it to get stuck in poor solutions. Affinity propagation can be viewed as a version of hierarchical clustering that makes soft decisions so that it is free to hedge its bets when forming clusters.

Affinity propagation is an amazing concept! Could the same methodology be used to analyze affinities in computer networks, i.e. the Internet? Yes, in many different ways. For example, if each network node were considered as a "data point", and the similarity of node i to node k were set to unused bandwidth from node k to node i, then affinity propagation could be used to identify a set of nodes that would have maximal total bandwidth to all other nodes, eg, be useful as file servers. That's overly-simplistic, but it exemplifies how affinity propagation could be used in computer networks.

Is there a probabilistic interpretation of affinity propagation? Yes. Affinity propagation can be viewed as performing model selection and for the selected model, performing MAP estimation of the cluster centers and the assignments of data points to centers. Unlike parametric methods, the centers are forced to be on top of data points (which has advantages and disadvantages). The similarity of point x(i) to point x(k) can be thought of as -log P(data point x(i) | center x(k)) and the preference of point x(i) can be thought of as -log P(center x(i)), which corresponds to the Bayesian prior density over the center.

Is there a relationship between Affinity Propagation and the Markov Clustering Algorithm (MCL)? We don't know -- we're looking into this, but if you have an opinion we'd like to hear it!

How close to perfect (optimal) is affinity propagation? Finding exemplars is called the "facility location problem" in theoretical computer science and it is known to be NP-hard. So, finding the optimal solution to compare with affinity propagation is computationally infeasible in general. We have used linear programming software (CPLEX) to study the Olivetti faces dataset and found (for 63 clusters and after 16 hours of computation) that affinity propagation's solution gets at least ten times closer to optimal than the best of a million runs of k-centers clustering. We have found this to be generally the case for similar-sized problems with over 10 clusters; smaller problems (or those with less than, say, 5 clusters) have small search spaces so k-centers techniques can usually reach optimal after a few thousand restarts while affinity propagation may be slightly suboptimal. We have no exact solutions to compare with affinity propagation, for problems with thousands of data points (and moderate to large numbers of clusters).

To what extent can affinity propagation be carried out in real-time, i.e., when the similarities are changing in time? The current implementation does not take this into account, but similarities are only used for responsibility updates so you could modify the code by updating the similarities after each iteration. Let us know how it goes!

Could messages be passed asynchronously in affinity propagation? We haven't explored this question extensively for affinity propagation. However, we described an algorithm in NIPS 2006 which is similar, but where messages are computed asynchronously, such that each data point took in responsibilities and then sent out availabilities, in sequence. The algorithm worked, but had the disadvantage of requiring that an order be specified for the data points. Also, the result depended on that order.

Could the sum-product algorithm be used instead of the max-product algorithm? Yes, the sum-product can be used instead of the max-product algorithm, however, affinity propagation would no longer be invariant to constant multiples of the similarities so the multiplier would need to be tuned. In addition, it needs to be implemented in the log-domain as is the case with the max-product algorithm to mitigate numerical precision issues, but requires computationally-expensive log-sum operations (unlike log-domain max-product a.k.a. min-sum which only needs maxes and sums). Finally, the O(NxN) implementation of sum-product affinity propagation has numerical precision issues arising from subtracting relatively large numbers from each other meaning that an O(NxNxN) implementation may be required. We have performed limited experiments with sum-product affinity propagation and have not found the results to be better than the current max-product affinity propagation.

Could affinity propagation be derived using a different graphical model? It could be derived using an MRF or a Bayesian network, but we found that the factor graph was the most elegant and straightforward graphical model to use in this case.

Is there a convexified version of affinity propagation? Yes. We have tried "convexified free energy" belief propagation techniques (c.f. Weiss et al, Wainwright et al) to facilitate convergence (instead of the current damping). While the algorithm converged, it converged to poor solutions.

Could affinity propagation be used to obtain a hierarchical clustering of the data? There are two ways we can think of for doing this. The first is to trace the messages over time and extract a hierarchy of graphs. The second is to run affinity propagation using a low preference (so that few clusters are found), and then successively increase the preference while re-running affinity propagation. This will lead to a sequence of clustering solutions. Interestingly, the resulting sequence of partitions will not form a tree, ie, finer-resolution clusters may overlap the boundaries of courser-resolution clusters. In fact, this is often a desirable property of hierarchical clustering, which standard methods such as agglomerative clustering don't have.

Most of the time, I found that affinity propagation achieves net similarities that are substantially higher than those I could obtain using any other method. But once I saw it find a lower net similarity. Why? Affinity propagation is an approximate method for maximizing the net similarity, a problem that is NP-hard in general. It is possible for algorithms - especially those given many restarts with random initializations - to achieve a better result than affinity propagation, but we have found that possibility to be remote if the number of data points is large and the number of desired exemplars is non-trivial. (For example, it is very easy to find the single best exemplar without running affinity propagation, so affinity propagation isn't appropriate for that task.)

When the clusters found by affinity propagation are fed into the standard k-centers clustering method, the net similarity sometimes increases (albeit only slightly). Why? Affinity propagation seems to be good at resolving the "major battles" between many subsets of data points competing to form good clusters, but when it's finished leaves a few players strewn about the battle-field without tidying them up properly. Sometimes, putting the output of affinity propagation through a couple iterations of k-centers clustering or another greedy algorithm will "polish up" the solution and since doing this is computationally cheap, we recommend it.

Can affinity propagation be viewed as just a good way to initialize standard methods, such as k-centers clustering? Tricky question. If you plan on using k-centers clustering no matter what, an exact clustering algorithm would be a terrific way to initialize k-centers clustering! However, if the question is whether k-centers clustering or affinity propagation would be doing the bulk of the work, the answer is affinity propagation. One thing we've tried doing is using the output of affinity propagation after each iteration to initialize k-centers clustering. While k-centers clustering always increased the objective function a little bit for the data we looked at, the final objective achieved by affinity propagation was significantly higher than what was achieved by most of the k-centers clustering runs. This indicates that affinity propagation is solving the problem in a quite different way than k-centers clustering.

Is it necessary to provide as input both s(i,j) and s(j,i) if it is always the case that s(i,j)=s(j,i), i.e. the similarities are symmetric? (Here, s(i,j) is the similarity of point i to point j and s(j,i) is the similarity of point j to point i.) Yes, all our implementations require you to input both s(i,j) and s(j,i) even if they are equal (e.g. arising from Euclidean distance) - any missing similarities are assumed to be negative infinity. The reason for this is that affinity propagation doesn't require that the similarities be symmetric (or satisfy the triangle inequality). This makes it applicable to more general definitions of similarity.

Should I use negative Euclidean distance as similarity? The best choice for the similarity depends on the application and there's no computational/statistical/optimality reason forcing you to choose Euclidean distance. Try using a similarity measure that matches your goal. (If it's easiest to initially experiment using negative Euclidean distance, that's a good starting point.) It is interesting that affinity propagation has been applied to a variety of problems where the data points are not points in a vector space, so Euclidean distance can't be used.

In your Science paper, you say that affinity propagation can be applied even when the similarities are non-metric (i.e., they don't satisfy the triangle inequality). This can make the search space highly non-convex. Do you have any results showing that affinity propagation actually does well in such cases? Yes. The Science paper describes applications to exon clustering, text clustering, and clustering of cities based on inter-city flight time and all of these applications involve non-metric similarities. We found that affinity propagation performed better than k-centers clustering in these applications.

TECHNICAL ISSUES

How many data points can affinity propagation handle on a modern, single-core computer? For exact affinity propagation, the number of scalar computations is equal to a constant times the number of input similarities, where in practice the constant is approximately 1000. The number of real numbers that need to be stored is equal to 8 times the number of input similarities. So, the number of data points is usually limited by memory. On our 16GB 2.4GHz machine, exact affinity propagation was able to cluster 23,000 data points based on all pairwise similarities (23000x23000) in a few hours. When the number of input similarities is too large, leveraged affinity propagation can be used, which exchanges messages between randomly selected points in each iteration - see here for software. On the same machine, leveraged affinity propagation was able to cluster 120,000 data points in under 1 day.

For many clustering algorithms, the computational and memory requirements scale with the dimensionality of the data. Is this true for affinity propagation? No, at least not directly. The input to affinity propagation is pairwise measures of similarity, so the size of the input to affinity propagation is independent of the dimensionality of the data. However, this dimensionality will probably affect the computation of similarities. Also, leveraged affinity propagation requests similarity values as needed, so if the function you provide computes the similarity using high-dimensional data points, the speed of leveraged affinity propagation may depend on the dimensionality of the data.

I ran affinity propagation with the default parameter settings and the plot of the net similarity continued to oscillate until the maximum number of iterations was reached. What should I do? In cases where affinity propagation fails to converge, you should first increase the damping factor and the parameters affecting the number of iterations, eg, set dampfact=0.9, maxits=2000, convits=200, and if that fails set dampfact=0.95, maxits=4000, convits=400. You may also want to increase the preference value(s), because affinity propagation converges better for higher preferences.

How does the affinity propagation software that is posted decide when to stop iterating? It monitors the exemplar decisions that would be made after each iteration and if these don't change over convits iterations, the procedure terminates. In any case, if maxits iterations are reached, the procedure terminates.

I've run affinity propagation many times on several data sets and my intuition is that it always converges when there exists a "stable" or "good" solution (in some sense) but may oscillate when a good solution doesn't exist. Can you prove anything about this scenario? Yes. It turns out that under certain conditions, if a stable solution exists (ie, a solution that is robust to perturbations in the similarities and preferences), affinity propagation will find it. Let us know if you're interested in the details.

Is damping needed? How does it affect the final result?Usually, yes. Loopy belief propagation (on which affinity propagation is based) can be viewed as a particular kind of over-relaxation. Damping is commonly needed in over-relaxation methods and here it prevents the availability and responsibility updates from overshooting the solution and leading to oscillations. As long as affinity propagation converges, the exact damping level should not have a significant affect on the resulting net similarity. Higher damping factors will lead to slower convergence.

What values should I try for damping? The damping factor dampfact should be at least 0.5 and less than 1. We recommend setting it to 0.9. If the algorithm does not converge, damping can be increased but numerical precision issues can arise if it goes beyond 0.99. Be mindful that large damping values slow convergence, so convits and maxits should be adjusted accordingly.

I noticed that one version of the affinity propagation software adds a tiny amount of noise (at the level of machine precision) to the input similarities. I read that this is *not* used for random initialization. So, what is it used for?We add a small amount of noise to the input similarities (i.e. randomly flipping a few of the least significant bits) to address possible degeneracies in the set of similarities provided by the user. For example, this can happen if similarities are symmetric and two data points are isolated from the rest - then, there may be some difficulty deciding which of the two points should be the exemplar and this can lead to oscillations. As another example, this can happen if the similarities are integer-valued and multiple solutions have the same, maximum, net similarity. However, affinity propagation is itself a deterministic algorithm and is robust to tiny changes in the input similarities. So, because the amount of noise we add is so tiny, you should find that multiple runs of affinity propagation always give the same answer, even though different, random noise patterns are being added.

I ran affinity propagation and I found that the list of exemplar indices contained NaN. What should I do?This is most likely caused by one or more data points not "finding" exemplars by the time the algorithm is stopped. This may be caused by the algorithm not converging, in which case the maximum number of iterations (maxits) and damping factor (dampfact) can be increased.

Affinity propagation is initialized by setting the "availabilities" to zero. What happens if they are set to random values and the algorithm is run multiple times using different random values? Neat idea. In fact, slightly better solutions can be obtained using this method, so if you really, really, really want to fine-tune the net similarity, you could try this method.

Should I use double precision or single when I call the software? Some of our implementations allow both single and double precision inputs; use single-precision (float in C/C++) if you find memory to be a concern.

SOFTWARE ISSUES

I had trouble compiling apcluster.c on the Mac OS X operating system. What can I do to successfully compile it on that OS? To compile on the Mac OS X, try replacing "#include values.h" with "#include float.h" and add the following defines: "#define MAXDOUBLE DBL_MAX" and "#define MINDOUBLE DBL_MIN". (Thanks go to Robert Mark for pointing this out.)

Your APClustering does quite a good job, and it's fast! Any chances you'll be releasing code for the R-Project's, R or S+ languages? I know that many people in the various fields are using R (econometrics, biology, thermodynamics, physics). I know that other clustering algorithms are implemented already for R and S+, but not yours. We typically release implementations that we use in our research, which is primarily done in MATLAB or C/C++. Developing an R or S+ version would be a good project. Let us know if you find such an implementation.

I'm trying to apply your method to some of my data. I think I'm performing the operations correctly, but the output of the clustering algorithm (apcluster from your website, compiled fine) doesn't seem to return meaningful results. Any ideas? The most common problem is that input files aren't in the right format. We suggest you download the similarities file and the preferences file for the toy problem (25 data points) available here and make sure your input files follow the same format. (Note that the data point indices must be provided in the similarities file, but must not be provided in the preferences file.) Another reason the results may not be meaningful is that the algorithm hasn't converged. Try setting the preferences equal to the maximum similarity for your data, plus a small number. This should cause the algorithm to converge quickly and all data points to become exemplars. If this doesn't happen, check that the input files are correct and that your similarities are in proper format.

LEGAL ISSUES

Can affinity propagation be used without paying a fee? The software is copyrighted, but you can freely use it for non-commercial purposes. Please cite the Science paper if you do use it. For commercial purposes, please contact us for a site license.

Can I use affinity propagation for a commercial application? If you want to use it to obtain results for commercial purposes, you should contact us for a site license. If you want to incorporate it into software that is to be used for commercial purposes, you should contact us for licensing.