com.aliasi.cluster
Class Dendrogram<E>

java.lang.Object
  extended by com.aliasi.cluster.Dendrogram<E>
Type Parameters:
E - the type of objects being clustered
All Implemented Interfaces:
Scored
Direct Known Subclasses:
LeafDendrogram, LinkDendrogram

public abstract class Dendrogram<E>
extends Object
implements Scored

A Dendrogram represents the result of a weighted hierarchical clustering of a set of objects. A dendrogram's cost is interpreted as the cost of forming the cluster of its members; this may be a join cost for a bottom-up (agglomerative) clusterer or a split cost for a top-down (divisive) clusterer. The minimum cost is 0.0, and each dendrogram's sub-dendrogram's must have a lower cost than it.

There are two subclasses of dendrogram, leafs and links. A LeafDendrogram is a single element cluster with a cost of 0.0. A LinkDendrogram is constructed by merging two sub-dendrograms with a given cost. The resulting link dendrogram then becomes the parent of the sub-dendrograms.

Dendrograms that are linked point to their parent dendrogram through the method parent(). The parent is always a link dendrogram. The method dereference() returns the result of following the chain of parent links to the final, top-level dendrogram. All of the members of a given dendrogram will dereference to the same parent.

Dendrograms return their set of elements through memberSet(). They may also be partitioned into clusterings in one of two ways. These clusterings are sets of sets that cover the members of the dendrogram. The method partitionDistance(double) returns all top-level dendrograms where all links are within the specified score. The method partitionK(int) returns a partition of the specified size. Both operate by breaking links in the dendrogram. The thresholded method breaks all links whose score are above a threshold. The size-based method breaks links in order of decreasing distance until the specified number of partitions is generated.

The method copheneticCorrelation(Distance) may be used to evaluate the distances between objects implied by the dendrogram and the distances in a specified distance function. The distances implied by the dendrogram are so-called cophenetic distances. The cophenetic distance between a pair of elements in the member set of a dendrogram is the distance of the sub-dendrogram linking them.

Given a distance measure and number of clusters, the method withinClusterScatter(int,Distance) evaluates the within-cluster scatter, as defined in the class documentation for ClusterScore.

Since:
LingPipe2.0
Version:
3.9.3
Author:
Bob Carpenter, Mike Ross

Field Summary
 
Fields inherited from interface com.aliasi.util.Scored
REVERSE_SCORE_COMPARATOR, SCORE_COMPARATOR
 
Method Summary
 boolean contains(E elt)
          Returns true if this dendrogram contains the specified object as a leaf.
 double copheneticCorrelation(Distance<? super E> distance)
          Returns the Pearson correlation coefficient between the inter-object distances and the cophenetic distances defined by this dendrogram.
 Dendrogram<E> dereference()
          Returns the top-level cluster of which this dendrogram is a member.
abstract  Set<E> memberSet()
          Returns the members of this dendrogram.
 LinkDendrogram<E> parent()
          Returns the dendrogram that immediately contains this dendrogram or null if this is a top-level dendrogram.
 Set<Set<E>> partitionDistance(double maxDistance)
          Returns the partition produced by cutting this dendrogram at the specified maximum distance.
 Set<Set<E>> partitionK(int numClusters)
          Return the partition of this dendrogram with the specified number of clusters.
 String prettyPrint()
          Returns a pretty-printed version of this dendrogram with entries on a single line and indenting with scores to show the merges.
abstract  double score()
          Returns the proximity between the components of this dendrogram.
 int size()
          Returns the size of this dendrogram measured by the number of objects.
static
<E> boolean
structurallyEquivalent(Dendrogram<E> dendrogram1, Dendrogram<E> dendrogram2)
          Return true if the specified dendrograms are structurally equivalent.
 String toString()
          Returns a string-based representation of this dendrogram.
 double withinClusterScatter(int numClusters, Distance<? super E> distance)
          Returns the within-cluster scatter relative to the specified distance of the clustering determined by splitting this dendrogram into the specified number of clusters.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Method Detail

parent

public LinkDendrogram<E> parent()
Returns the dendrogram that immediately contains this dendrogram or null if this is a top-level dendrogram.

Returns:
the parent of this dendrogram or null if this is a root dendrogram.

dereference

public Dendrogram<E> dereference()
Returns the top-level cluster of which this dendrogram is a member. A dendrogram with a null parent dereferences to itself.

Returns:
The top-level cluster of which this dendrogram is a member.

size

public int size()
Returns the size of this dendrogram measured by the number of objects. This result is consistent with the interpretation of the dendrogram as a set.

Returns:
The size of this dendrogram.

contains

public boolean contains(E elt)
Returns true if this dendrogram contains the specified object as a leaf.

Parameters:
elt - Object to test.
Returns:
true if this dendrogram contains the specified object as a leaf.

partitionK

public Set<Set<E>> partitionK(int numClusters)
Return the partition of this dendrogram with the specified number of clusters. This is computed by breaking links from the top down; graphically, this is like drawing a line through the dendrogram. Note that if two clusters have the same score, one will be chosen to split first.

Parameters:
numClusters - Number of clusters in returned partition.
Returns:
The partition with the specified number of clusters.
Throws:
IllegalArgumentException - If the specified number of clusters is less than one or larger than the number of elements.

partitionDistance

public Set<Set<E>> partitionDistance(double maxDistance)
Returns the partition produced by cutting this dendrogram at the specified maximum distance. Two elements will be in the same set within the partition if they were merged in the dendrogram at a distance less than or equal to the specified distance.

This operation may be understood in terms of the isomorphism between equivalence relations and partitions. An equivalence relation is a transitive, symmetric, reflexive relation between objects. An equivalence relation determines a partition with clusters corresponding to sets of mutually equivalent objects. For this method, objects are equivalent if the distance between them is less than or equal to the specified maximum distance in the dendrogram.

Parameters:
maxDistance - Maximum dendrogram distance at which elements are considered to be part of the same cluster.
Returns:
The partition of the elements.

prettyPrint

public String prettyPrint()
Returns a pretty-printed version of this dendrogram with entries on a single line and indenting with scores to show the merges.

Returns:
A pretty string representation of this dendrogram.

toString

public String toString()
Returns a string-based representation of this dendrogram.

Overrides:
toString in class Object
Returns:
A string-based representation of this dendrogram.

memberSet

public abstract Set<E> memberSet()
Returns the members of this dendrogram.

Returns:
The members of this dendrogram.

score

public abstract double score()
Returns the proximity between the components of this dendrogram.

Specified by:
score in interface Scored
Returns:
The proximity between this dendrogram's components.

withinClusterScatter

public double withinClusterScatter(int numClusters,
                                   Distance<? super E> distance)
Returns the within-cluster scatter relative to the specified distance of the clustering determined by splitting this dendrogram into the specified number of clusters. Thus the result is ClusterScore.withinClusterScatter(partitionK(numClusters),distance).

See ClusterScore.withinClusterScatter(Set,Distance) for more information on the scatter evaluation metric.

Throws:
IllegalArgumentException - If the number of clusters is not between 1 and the size of this dendrogram inclusive.

copheneticCorrelation

public double copheneticCorrelation(Distance<? super E> distance)
Returns the Pearson correlation coefficient between the inter-object distances and the cophenetic distances defined by this dendrogram.

The cophenetic distance between a pair of elements is defined to be the score of the dendrogram which joins the two elements together; that is, they are in distinct sub-dendrograms of the dendrogram.

A cophenetic distance forms what is known as an ultrametric. It forms a usual metric distance satisfying the following requirements:

 d(x,y) >= 0
 d(x,x) = 0
 d(x,y) = d(y,x)
 d(x,y) + d(y,z) >= d(x,z)
In addition, an ultrametric satisfies the following:
d(x,z) << max(d(x,y), d(y,z))
Geometrically, this means that for all triples x,y,z at least two of d(x,y),d(x,z),d(y,z) are the same.

Given the cophenetic distance ultrametric from a dendrogram, both single-link and complete-link clusterers will produce the same dendrogram when clustering based on the cophenetic distance.

Parameters:
distance - Underlying distance measure.

structurallyEquivalent

public static <E> boolean structurallyEquivalent(Dendrogram<E> dendrogram1,
                                                 Dendrogram<E> dendrogram2)
Return true if the specified dendrograms are structurally equivalent. Two dendrograms are structurally equivalent if and only if they have the same score, and they are either both leaf dendrograms with the same object, or they are both link dendrograms with equivalent daughters (in either order).

Note that this is not the same relation as Object.equals(Object), which is defined by reference, not by structural equivalence. Nor is it the same relation as defined by the score comparators.

Parameters:
dendrogram1 - First dendrogram to compare.
dendrogram2 - Second dendrogram to compare.
Returns:
true if the two dendrograms are equivalent.