com.aliasi.cluster
Class SingleLinkClusterer<E>
java.lang.Object
com.aliasi.cluster.AbstractHierarchicalClusterer<E>
com.aliasi.cluster.SingleLinkClusterer<E>
- Type Parameters:
E - the type of object being clustered
- All Implemented Interfaces:
- Clusterer<E>, HierarchicalClusterer<E>
public class SingleLinkClusterer<E>
- extends AbstractHierarchicalClusterer<E>
A SingleLinkClusterer implements standard single-link
agglomerative clustering. Single link clustering is a greedy
algorithm in which the two closest clusters are always merged
up to a specified distance threshold. Distance between clustes for
single link clustering is defined to be the minimum of the distances
between the members of the clusters. See CompleteLinkClusterer
for a clusterer that takes the maximum rather than the minimum in
making clustering decisions.
For example, consider the following proximity matrix
representing distances between pairs of elements (the same example
is used in CompleteLinkClusterer):
| | A | B | C | D | E |
| A | 0 | 1 | 2 | 7 | 5 |
| B | | 0 | 3 | 8 | 6 |
| C | | | 0 | 5 | 9 |
| D | | | | 0 | 4 |
| E | | | | | 0 |
The result of single-link clustering is the following dendrogram (pardon
the ASCII art):
A B C D E
| | | | |
1 ----- | | |
2 ------- | |
3 | | |
4 | -----
5 ----------
First, the objects A and B are merged
at distance one. Then object C is merged into the
cluster at distance 2, because that is its distance from
A. Objects D and E are
merged next at distance 4, and finally, the two big clusters are
merged at distance 5, the distance between A and
E.
The various clusters at each proximity bound threshold are:
| Threshold Range | Clusters |
| [Double.NEGATIVE_INFINITY,1) | {A}, {B}, {C}, {D}, {E}
|
| [1,2) | {A,B}, {C}, {D}, {E} |
| [2,4) | {A,B,C}, {D}, {E} |
| [4,5) | {A,B,C}, {D,E} |
| [5,Double.POSITIVE_INFINITY] | {A,B,C,D,E} |
The intervals show the clusters returned for thresholds within
the specified interval. As usual, square brackets denote inclusive
range bounds and round brackets exclusive bounds. For instance,
if the distance threshold is 1.5, four clusters are returned,
whereas if the threshold is 4.0, two clusters are returned.
Note that this example has the same clusters as that in CompleteLinkClusterer. A simple example where the results are
different is the result of clustering four points on the x-axis,
x1=1, x2=3,
x3=6, x4=10.
The distance is computed by subtraction, e.g. d(x2,
x4) = x4- x2.
This leads to the following distance matrix:
| | x1 | x2 | x3 | x4 |
| x1 | 0 | 2 | 5 | 9 |
| x2 | | 0 | 3 | 7 |
| x3 | | | 0 | 4 |
| x4 | | | | 0 |
For this example, the complete link clustering is
{{x1,x2},{x3,x4}}
whereas the single-link clustering is:
{{{x1,x2},x3},x4}
Implementation Note: This clusterer is implemented using
the minimum-spanning-tree-based algorithm described in:
Jain, Anil K. and Richard C. Dubes. 1988. Algorithms for Clustering
Data. Prentice-Hall.
The minimum spanning tree is constructed using Kruskal's algorithm,
which is described in:
Cormen, Thomas H., Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
2001.
Introduction to Algorithms. MIT Press.
This algorithm requires an efficient lookup of whether two
nodes are already in the same connected component, which
is implemented by the standard disjoint set algorithm with
path compression (also described in Cormen et al.).
In brief, the algorithm sorts all pairs of objects in order of
increasing distance (with n objects that involves
sorting n2 pairs of objects,
leading to an initial step with worst case time complexity bound
O(n2 log n). The pairs are then
merged in order of decreasing distance. The sets involved in
Kruskal's algorithm translate directly into dendrograms with their
dereferencing.
- Since:
- LingPipe2.0
- Version:
- 3.8.3
- Author:
- Bob Carpenter
|
Method Summary |
Dendrogram<E> |
hierarchicalCluster(Set<? extends E> elementSet)
Return the array of clusters derived from the specified
distance matrix by performing single-link clustering up to the
specified maximum distance bound. |
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SingleLinkClusterer
public SingleLinkClusterer(double maxDistance,
Distance<? super E> distance)
- Construct a single-link clusterer with the specified distance
bound.
- Parameters:
maxDistance - Maximum distance for clusters.distance - Distance measure between objects to cluster.
- Throws:
IllegalArgumentException - If the specified bound is not
a non-negative number.
SingleLinkClusterer
public SingleLinkClusterer(Distance<? super E> distance)
- Construct a single-link clusterer with no distance bound.
The distance bound is set to
Double.POSITIVE_INFINITY,
which effectively removes the distance bound.
- Parameters:
distance - Distance measure between objects to cluster.
hierarchicalCluster
public Dendrogram<E> hierarchicalCluster(Set<? extends E> elementSet)
- Return the array of clusters derived from the specified
distance matrix by performing single-link clustering up to the
specified maximum distance bound. Every pair of elements in
the matrix that are within the distance bound will fall in the
same dendrogram in the result. Setting the maximum distance
to
Double.POSITIVE_INFINITY results in a complete
clustering.
- Specified by:
hierarchicalCluster in interface HierarchicalClusterer<E>- Specified by:
hierarchicalCluster in class AbstractHierarchicalClusterer<E>
- Parameters:
elementSet - Set of elements to cluster.
- Returns:
- Clustering in the form of a set of sets of elements.
- Throws:
IllegalArgumentException - If the set of elements is empty.