com.aliasi.cluster
Class SingleLinkClusterer<E>

java.lang.Object
  extended by com.aliasi.cluster.AbstractHierarchicalClusterer<E>
      extended by com.aliasi.cluster.SingleLinkClusterer<E>
Type Parameters:
E - the type of object being clustered
All Implemented Interfaces:
Clusterer<E>, HierarchicalClusterer<E>

public class SingleLinkClusterer<E>
extends AbstractHierarchicalClusterer<E>

A SingleLinkClusterer implements standard single-link agglomerative clustering. Single link clustering is a greedy algorithm in which the two closest clusters are always merged up to a specified distance threshold. Distance between clustes for single link clustering is defined to be the minimum of the distances between the members of the clusters. See CompleteLinkClusterer for a clusterer that takes the maximum rather than the minimum in making clustering decisions.

For example, consider the following proximity matrix representing distances between pairs of elements (the same example is used in CompleteLinkClusterer):

 ABCDE
A01275
B 0386
C  059
D   04
E    0
The result of single-link clustering is the following dendrogram (pardon the ASCII art):
         A   B   C   D   E
         |   |   |   |   |
      1  -----   |   |   |
      2    -------   |   |
      3       |      |   |
      4       |      -----
      5       ----------
 

First, the objects A and B are merged at distance one. Then object C is merged into the cluster at distance 2, because that is its distance from A. Objects D and E are merged next at distance 4, and finally, the two big clusters are merged at distance 5, the distance between A and E.

The various clusters at each proximity bound threshold are:

Threshold RangeClusters
[Double.NEGATIVE_INFINITY,1){A}, {B}, {C}, {D}, {E}
[1,2) {A,B}, {C}, {D}, {E}
[2,4){A,B,C}, {D}, {E}
[4,5){A,B,C}, {D,E}
[5,Double.POSITIVE_INFINITY]{A,B,C,D,E}
The intervals show the clusters returned for thresholds within the specified interval. As usual, square brackets denote inclusive range bounds and round brackets exclusive bounds. For instance, if the distance threshold is 1.5, four clusters are returned, whereas if the threshold is 4.0, two clusters are returned.

Note that this example has the same clusters as that in CompleteLinkClusterer. A simple example where the results are different is the result of clustering four points on the x-axis, x1=1, x2=3, x3=6, x4=10. The distance is computed by subtraction, e.g. d(x2, x4) = x4- x2. This leads to the following distance matrix:

 x1x2x3x4
x10259
x2 037
x3  04
x4   0
For this example, the complete link clustering is {{x1,x2},{x3,x4}} whereas the single-link clustering is: {{{x1,x2},x3},x4}

Implementation Note: This clusterer is implemented using the minimum-spanning-tree-based algorithm described in:

Jain, Anil K. and Richard C. Dubes. 1988. Algorithms for Clustering Data. Prentice-Hall.
The minimum spanning tree is constructed using Kruskal's algorithm, which is described in:
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. 2001. Introduction to Algorithms. MIT Press.
This algorithm requires an efficient lookup of whether two nodes are already in the same connected component, which is implemented by the standard disjoint set algorithm with path compression (also described in Cormen et al.).

In brief, the algorithm sorts all pairs of objects in order of increasing distance (with n objects that involves sorting n2 pairs of objects, leading to an initial step with worst case time complexity bound O(n2 log n). The pairs are then merged in order of decreasing distance. The sets involved in Kruskal's algorithm translate directly into dendrograms with their dereferencing.

Since:
LingPipe2.0
Version:
3.8.3
Author:
Bob Carpenter

Constructor Summary
SingleLinkClusterer(Distance<? super E> distance)
          Construct a single-link clusterer with no distance bound.
SingleLinkClusterer(double maxDistance, Distance<? super E> distance)
          Construct a single-link clusterer with the specified distance bound.
 
Method Summary
 Dendrogram<E> hierarchicalCluster(Set<? extends E> elementSet)
          Return the array of clusters derived from the specified distance matrix by performing single-link clustering up to the specified maximum distance bound.
 
Methods inherited from class com.aliasi.cluster.AbstractHierarchicalClusterer
cluster, distance, getMaxDistance, setMaxDistance
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SingleLinkClusterer

public SingleLinkClusterer(double maxDistance,
                           Distance<? super E> distance)
Construct a single-link clusterer with the specified distance bound.

Parameters:
maxDistance - Maximum distance for clusters.
distance - Distance measure between objects to cluster.
Throws:
IllegalArgumentException - If the specified bound is not a non-negative number.

SingleLinkClusterer

public SingleLinkClusterer(Distance<? super E> distance)
Construct a single-link clusterer with no distance bound. The distance bound is set to Double.POSITIVE_INFINITY, which effectively removes the distance bound.

Parameters:
distance - Distance measure between objects to cluster.
Method Detail

hierarchicalCluster

public Dendrogram<E> hierarchicalCluster(Set<? extends E> elementSet)
Return the array of clusters derived from the specified distance matrix by performing single-link clustering up to the specified maximum distance bound. Every pair of elements in the matrix that are within the distance bound will fall in the same dendrogram in the result. Setting the maximum distance to Double.POSITIVE_INFINITY results in a complete clustering.

Specified by:
hierarchicalCluster in interface HierarchicalClusterer<E>
Specified by:
hierarchicalCluster in class AbstractHierarchicalClusterer<E>
Parameters:
elementSet - Set of elements to cluster.
Returns:
Clustering in the form of a set of sets of elements.
Throws:
IllegalArgumentException - If the set of elements is empty.