What is Clustering?

Clustering is a technique for grouping objects based on similarity. For instance, Clusty provides clustering of web search results it retrieves from other search engines. A popular technique for performing coreference between mentions of entities, such as telling if the "Cpt. John J. Smith" in one document is the same as the "Captain J. J. Smith" in another document, is to compare the aliases used by weighted edit distance and the context of words around the mention by TF/IDF measures. Clustering then provides a way of grouping aliases by reference.

Simple Disjoint Clustering

The Clustering Interface

The Clusterer interface provides a single method:

interface Clusterer<E> {
    Set<Set<E>> cluster(Set<? extends E> elements);
}

Note that a clusterer is parameterized by the type E of the objects being clustered. For example, a clusterer that clusters strings would be declared as Clusterer<String>. The clustering method is defined to take a set of elements extending E, which means they are guaranteed to extend the E class or implement the E interface. The clustering is returned as a set of sets of Es, declared as type Set<Set<E>>.

Example

For example, a clusterer of strings might take a set of strings containing elements:

{ "aa", "aaa", "aaaaa", "bbb", "bbbb" }

and clustering by edit distances up to two, return the following clustering:

  {
    { "aa", "aaa", "aaaaa" },
    { "bbb", "bbbb" }
  }

The code for this example may be found in src/StringEdits.java; we return to it after introducing hierarchical clustering.

Partitions

For the two clustering implementations we provide in LingPipe, the resulting clusters form what is known as a partition of the input set. A set of sets is a partition of another set if each element of the set is a member of exactly one set of the partition. In mathematical terms, the sets making up a partition are pairwise disjoint and their union is the original set. The example clustering in the previous section is a partition of the input set.

Hierarchical Clustering and Dendrograms

The HierarchicalClusterer interface extends the clusterer interface with a single method:

interface HierarchicalClusterer<E> extends Clusterer<E> {
    Dendrogram<E> hierarchicalCluster(Set<? extends E> elements);
}

The input is the same as for simple clustering, a set of sets extending the type of the clusterer. The output is an instance of com.aliasi.cluster.Dendrogram. A dendrogram is a binary tree over the elements being clustered, with distances attached to each branch indicating the distance between the two subbranches.

Example (Continued)

Continuing the edit-distance based example from the last section, consider the following dendrogram produced by single-link clustering (explained below), as printed by the dendrogram to-string method:

{{{aaa+aa}:1.0+aaaaa}:2.0+{bbbb+bbb}:1.0}:3.0

and in indented binary tree notation:

3.0
    1.0
        bbbb
        bbb
    2.0
        1.0
            aaa
            aa
        aaaaa

and finally in the standard dendrogram notation:

3        ----------------
        |                |
2       |            --------
        |           |        |
1     -----       -----      |
     |     |     |     |     |
    bbbb  bbb   aaa   aa   aaaaa

There is a leaf for each of the five members of the input set. there are then four "links". First, bbbb and bbb are linked at their distance, 1. Similarly, aa and aaa are linked at distance 1. Then {aa,aaa} is linked to aaaaa at distance 2. Finally, the a cluster is linked to the b cluster at distance 3.

Clusterings from Dendrograms

There are two ways to extract standard clusterings from dendrograms. The simplest is to set a distance bound and maintain every cluster formed at less than or equal to that bound. This is equivalent to cutting the standard dendrogram notation at a specified height. The other way to construct a clustering is to continue cutting the highest distance cluster until a specified number of clusters is obtained.

Single-Link and Complete-Link Clustering

Agglomerative Clustering

LingPipe provides the two standard aggolomerative clustering algorithms, single-link and complete-link clustering. These clusterers are aggolomerative in the sense that the algorithms begin with a single leaf dendrogram for each element being clustered and then build up to the complete dendrogram by joining the next closest pairs of clusters.

The only difference between the approaches is that single-link clustering considers the distance between clusters to be the minimum distance between elements of the clusters, whereas complete-link clustering takes it to be the maximum. Thus Single-link clustering tends to create highly separated clusters whereas complete-link clustering tends to create more tightly centered clusters. In practice, complete link clustering tends to be more useful, though it is more expensive to compute.

General Distance Functions

Single-link and complete-link clustererers are both based on a notion of distance between objects. This is made concrete by the com.aliasi.util.Distance interface:

public interface Distance<E> {
    public double distance(E e1, E e2);
}

The distance interface is parameterized by type E and provides a floating point distance between two objects of type E.

Each clusterer requires an appropriate distance function, which must be a distance between supertypes of its own types; that is, a single- or complete-link clusterer implementing Clusterer<E> requires a distance function implementing Distance<? super E>.

Example: Strings By Edit Distance

Edit Distance

The edit distance between strings is the number of insert, delete, or substitution operations required to convert one string into the other. For instance, the edit distance between "Fred" and "Ted" is two, as it requires the substitution of T for F and the deletion of r.

In the following example, we will use LingPipe's implementation of simple edit distance, com.aliasi.spell.EditDistance, which computes edit distances between character sequences, and is thus specified to implement Distance<CharSequence>. Instances are created directly:

static final Distance<CharSequence> EDIT_DISTANCE
    = new EditDistance(false);

The boolean false indicates that transposition is not permitted as an edit operation.

Example: Strings to Cluster

We consider In this section, we illustrate the differences between single- and complete-link clustering in the context of string edit distances. We The following inputs provide an example where complete-link and single-link clustering produce different results:

{ "a", "aaa", "aaaaaa", "aaaaaaaaaa" }

That's 1, 3, 6 and 10 instances of the letter a. In this case, edit distance is just the distance in length, as that is the number of insertions required to convert the shorter string into the longer (equivalently deletions working from longer to shorter).

Example: Clusterings

Here are the dendrograms produced by the complete-link and single-link clusterers.

Complete- and Single-Link Clusterings
Complete Link Single Link
9.0
    4.0
        aaaaaa
        aaaaaaaaaa
    2.0
        aaa
        a
4.0
    3.0
        aaaaaa
        2.0
            aaa
            a
    aaaaaaaaaa

Note that the clusterings are different trees. The single-link clustering is deeper, whereas the complete link is more balanced.

Example: Step-by-Step

Here are the steps taken by single-link clustering:

a  aaa  aaaaaa  aaaaaaaaaa
{a aaa}:2  aaaaaa  aaaaaaaaaa
{{a aaa}:2  aaaaaa}:3  aaaaaaaaaa
{{{a aaa}:2 aaaaaa}:3  aaaaaaaaaa}:4

Note that aaaaaa is linked to {a aaa} at distance 3, which is the distance to aaa, not at distance 5, which is the distance to a. Similarly the final linkage of aaaaaaaaaa to the cluster of the smaller strings is at distance 4, its distance to aaaaaa, the smallest distance among the distances to a, aaa and aaaaaa.

Here are the steps taken taken by complete-link:

a  aaa  aaaaaa  aaaaaaaaaa
{a aaa}:2  aaaaaa  aaaaaaaaaa
{a aaa}:2  {aaaaaa  aaaaaaaaaa}:4
{{a aaa}:2  {aaaaaa  aaaaaaaaaa}:4}:9

the main difference is that aaaaaa is linked to aaaaaaaaaa instead of to the cluster {a aaa}. That's because the maximum distance of aaaaaa to an element of {a aaa} is 5, whereas the distance between aaaaaa and aaaaaaaaaa is only 4. The final link is at distance 9, which is the maximum distance between elements in the two clusters, specifically between a and aaaaaaaaaa.

Dendrograms to Clusterings

Extracting K Clusters

The step-by-step example of clustering above indicates one way to think about generating K clusters from a dendrogram. Simply run the agglomerative algorithm until there are K clusters remaining to cluster and stop. This same result may be computed directly from a dendrogram by cutting the most expensive dendrograms until there are exactly K dendrograms left.

Given a dendrogram, we may return the least cost K clusters by calling the method partitionK(int). For instance, consider the continuation of the example above in src/StringEdits.java: Here is the code to generate partitions of size K:

for (int k = 1; k <= clDendrogram.size(); ++k) {
    Set<Set<String>> clKClustering = clDendrogram.partitionK(k);
    System.out.println(k + "  " + slKClustering);
}

resulting in the following output for the complete link dendrogram listed in the above table:

Complete Link Clusterings
1  [[aaaaaa, aaa, a, aaaaaaaaaa]]
2  [[aaa, a], [aaaaaa, aaaaaaaaaa]]
3  [[aaa, a], [aaaaaa], [aaaaaaaaaa]]
4  [[aaaaaaaaaa], [aaaaaa], [aaa], [a]]

The single-link results are a bit different:

Single Link Clusterings
1  [[aaaaaa, aaa, a, aaaaaaaaaa]]
2  [[aaaaaa, aaa, a], [aaaaaaaaaa]]
3  [[aaa, a], [aaaaaaaaaa], [aaaaaa]]
4  [[aaaaaa], [aaaaaaaaaa], [aaa], [a]]

Note that these clusterings exactly mirror the steps in the construction of the complete-link and single-link clusterings outlined above.

Extracting Clusters by Distance

The second built-in way to convert a dendrogram into a partition is to cut the dendrogram at a given distance. That is, all clusters are retained that are formed at or below the given distance.

Configuring Hierarchical Clusterers

The hierarchical clustering interface returns a dendrogram, whereas the simple clustering interface returns a set of sets. Both single-link and complete-link clustering are configured so that their simple clustering behavior is to return all clusters that were formed at or below a specified maximum distance. This maximum distance is set in the constructor for single-link and complete-link clusterers.

Evaluating Clusters

The next part of this demo is about scoring clustering results. It is possible to score clusterings (sets of sets) derived from any two sources, called the reference and response partitions.

Scoring Demo

This section uses a running example found in src/ClusterScoreDemo.java.

Building Clusterings Directly

To show the generality of the clustering evaluation and to sidestep any concerns about particular clusterers, we build up a reference and response clustering by hand in the demo:

Set<String> referenceCluster1 = new HashSet<String>();
referenceCluster1.add("1");
referenceCluster1.add("2");
referenceCluster1.add("3");

Set<String> referenceCluster2 = new HashSet<String>();
referenceCluster2.add("4");
referenceCluster3.add("5");

Set<String> referenceCluster3 = new HashSet<String>();
referenceCluster3.add("6");

Set<Set<String>> referencePartition = new HashSet<Set<String>>();
referencePartition.add(referenceCluster1);
referencePartition.add(referenceCluster2);
referencePartition.add(referenceCluster3);

In set notation, the reference partition is {{1,2,3},{4},{5,6}}. Of course, this could be read from a file or created in other ways. We also create the response cluster directly.

Set<String> responseCluster1 = new HashSet<String>();
responseCluster1.add("1");
responseCluster1.add("2");

Set<String> responseCluster2 = new HashSet<String>();
responseCluster2.add("3");
responseCluster2.add("4");
responseCluster2.add("5");
responseCluster2.add("6");

HashSet<Set<String>> responsePartition = new HashSet<Set<String>>();
responsePartition.add(responseCluster1);
responsePartition.add(responseCluster2);

In set notation, the response is {{1,2},{3,4,5,6}}, which is a partition with two clusters. To summarize:

Reference: { {1,2,3}, {4}, {5,6} }
 Response: { {1,2}, {3,4,5,6} }

Scoring one clustering against the other is as simple as constructing a score object, which provides access to a broad range of clustering evaluations:

ClusterScore<String> score
    = new ClusterScore<String>(referencePartition,responsePartition);

Cluster scores can also be generated using sets of sets, instead of arrays of sets. Cluster scores have an extensive toString() method, which basically prints out the same thing the demo does with the following illustrative code.

Traditional Clustering Measures

Traditional cluster scoring is based on the agreement between the clusters as measured over whether they put pairs of elements (a,b) in the same cluster or not. Using these scoring metrics is straightforward:

PrecisionRecallEvaluation prEval
  = score.equivalenceEvaluation();
System.out.println(prEval.toString());

The method name equivalenceEvaluation() is based on the fact that a paritition derives and equivalence relation from which all of these measures are derived as we show below. The output is a classify.PrecisionRecallEvaluation. A precision-recall evaluation is based on the following counts for pairs of elements (a,b):

For instance, the pair (1,2) counts as a true positive, because the elements are in the same set of the reference partition, {1,2,3}, and in the same set in the response partition, {1,2}. The pair (2,1) also gets counted as a true positive, as does the pair (1,1). Thus there will be at least as many true positives as elements coming from each element being equal to itself. The other pairs then count double. The other two true positives are (5,6) and (6,5), for a total of ten true positives -- two pairs plus six singletons. The pair (3,4) is a false positive, as is (4,3). On the other hand, the pairs (1,3), (3,1) and (2,3) and (3,2) are the other false negatives. True negatives consist of the remaining cases, such as (1,6), which are in different sets in the reference and response partitions. This makes the total number of cases n2 where n is the number of objects being partitioned.

From the counts, the following results are returned by prEval.toString():

Equivalence Relation Evaluation
Total=36
True Positive=10
False Negative=4
False Positive=10
True Negative=12
Positive Reference=14
Positive Response=20
Negative Reference=22
Negative Response=16
Accuracy=0.6111111111111112
Recall=0.7142857142857143
Precision=0.5
Rejection Recall=0.5454545454545454
Rejection Precision=0.75
F(1)=0.588235294117647
Fowlkes-Mallows=16.73320053068151
Jaccard Coefficient=0.4166666666666667
Yule's Q=0.5
Yule's Y=0.2679491924311227
Reference Likelihood=0.3888888888888889
Response Likelihood=0.5555555555555556
Random Accuracy=0.4876543209876543
Random Accuracy Unbiased=0.5015432098765432
kappa=0.24096385542168683
kappa Unbiased=0.2198142414860682
kappa No Prevalence=0.22222222222222232
chi Squared=2.3376623376623376
phi Squared=0.06493506493506493
Accuracy Deviation=0.08124967025363077

These are all explained with formulas in the PrecisionRecallEvaluation class documentation.

MUC and B3 Measures

In addition to the traditional measures based on the underlying equivalence relations generated by partitions, the computational linguistics community has derived two classes of measures itself. The first is the scoring defined by Vilain et al. and used in the Message Understanding Conferences. This is accessed as follows:

System.out.println("\nMUC Measures");
System.out.println("  MUC Precision = "
                   + score.mucPrecision());
System.out.println("  MUC Recall = "
                   + score.mucRecall());
System.out.println("  MUC F(1) ="
                   + score.mucF());

produces the following output:

MUC Measures
MUC Precision = 0.5
MUC Recall = 0.6666666666666666
MUC F(1) = 0.5714285714285715

The other class of measures are the B3 measures of Bagga and Baldwin. These are accessed through:

System.out.println("\nB-Cubed Measures");
System.out.println("  Cluster Averaged Precision = "
                   + score.b3ClusterPrecision());
System.out.println("  Cluster Averaged Recall = "
                   + score.b3ClusterRecall());
System.out.println("  Cluster Averaged F(1) = "
                   + score.b3ClusterF());
System.out.println("  Element Averaged Precision = "
                   + score.b3ElementPrecision());
System.out.println("  Element Averaged Recall = "
                   + score.b3ElementRecall());
System.out.println("  Element Averaged F(1) = "
                   + score.b3ElementF());

and produce the output:

B-Cubed Measures
Cluster Averaged Precision = 0.6875
Cluster Averaged Recall = 0.8518518518518519
Cluster Averaged F(1) = 0.7609022556390977
Element Averaged Precision = 0.5833333333333333
Element Averaged Recall = 0.7777777777777777
Element Averaged F(1) = 0.6666666666666666

More information on the definitions of the MUC and B3 methods are provided in the ClusterScore class documentation.

Natural Language Clustering

In this section, we provide a widely used baseline approach to document clustering based on treating documents as term vectors and comparing them by cosine. The code may be found in src/TokenCosineDocCluster.java.

Sample Documents

As applications, we consider two tiny examples clustering newsgroup posts by topic and clustering documents referring to John Smiths by the particular John Smith to which they're referring. The data collections are as follows:

Our approach will differ from Bagga and Baldwin's in four important ways. First, we use whole documents rather than sentences linked by coreference; this was their baseline condition. Second, we use agglomerative clustering rather than the greedy method of Bagga and Baldwin. Third, we take the square root of raw counts as our term frequencies rathr than the raw counts. Fourth, we do not use inverse-document frequency normalization; inverse document frequency can be important for discrimination.

Documents and Token Counts

As described above, the basis of single-link and complete-link clustering is a distance measure over objects to cluster. To this end, we define a Document class to represent documents in src/TokenCosineDocCluster.java. A document stores four fields: the file from which it was generated, the chars making up its text, a token counter, and a length:

static class Document {

    final File mFile;
    final char[] mText; // don't really need to store
    final ObjectToCounterMap<String> mTokenCounter
        = new ObjectToCounterMap<String>();
    final double mLength;
    ...

The constructor simply sets these variables given a specified file from which to build the document.

Document(File file) throws IOException {
    mFile = file;
    mText = Files.readCharsFromFile(file,Strings.UTF8);
    Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(mText,0,mText.length);
    String token;
    while ((token = tokenizer.nextToken()) != null)
        mTokenCounter.increment(token.toLowerCase());
    mLength = length(mTokenCounter);
}

Note that a utility method in com.aliasi.util.Files is used to read the data. Length is computed using the length(ObjectToCounterMap) method. Tokenizers are extracted from the constant tokenizer factory.

The tokenizer factory is generated and assigned to a constant in the usual way:

    static final TokenizerFactory TOKENIZER_FACTORY = tokenizerFactory();

    static TokenizerFactory tokenizerFactory() {
        TokenizerFactory factory = IndoEuropeanTokenizerFactory.INSTANCE;
        // factory  = new LowerCaseTokenizerFactory(factory);
        // factory = new EnglishStopTokenizerFactory(factory);
        // factory = new PorterStemmerTokenizerFactory(factory);
        return factory;
    }
}

Although commented out, the subsequent filters may be applied: conversion to lowercase, removal of stop words and finally stemming. None of these operations improve performance in the case at hand, so we left them commented out.

Cosine Inter-document Distance

Our methods for computing lengths and cosine implicitly apply the standard "normalization" to term frequencies, namely taking their square root.

static double length(ObjectToCounterMap<String> otc) {
    double sum = 0.0;
    for (Counter counter : otc.values()) {
        double count = counter.doubleValue();
        sum += count;  // tf =sqrt(count); sum += tf * tf
    }
    return Math.sqrt(sum);
}

This is just standard vector distance where the vector's values are the square roots of the counts; we have implicitly taken term frequencies to be the square root of observed frequencies and unrolled the computation into the length method.

Computing cosines between two document vectors is also straightforward, as it's the usual vector product divided by the vectors' lengths:

double cosine(Document thatDoc) {
    return product(thatDoc) / (mLength * thatDoc.mLength);
}

Vector products are computed in the usual way:

double product(Document thatDoc) {
    double sum = 0.0;
    for (String token : mTokenCounter.keySet()) {
        int count = thatDoc.mTokenCounter.getCount(token);
        if (count == 0) continue;
        sum += Math.sqrt(count * mTokenCounter.getCount(token)); // tf = sqrt(count)
    }
    return sum;
}

As with length, we have converted counts into term frequencies by taking square roots.

Cosines measure the similarity of vector directions. Cosine is a proximity measure in that a higher cosine value between two vectors indicates that they are more in the same direction. Thus to define a distance, we subtract the cosine between document vectors by one:

static Distance<Document> COSINE_DISTANCE
    = new Distance<Document>() {
        public double distance(Document doc1, Document doc2) {
            return 1.0 - doc1.cosine(doc2);
        }
    };

Performing Clustering

Now that we have a way to construct documents and a distance measure, clustering itself is trivial. Suppose that docSet is a set of documents of type Set<Document>. The complete code for complete-link clustering is:

HierarchicalClusterer<Document> clClusterer
    = new CompleteLinkClusterer<Document>(COSINE_DISTANCE);
Dendrogram<Document> completeLinkDendrogram
    = clClusterer.hierarchicalCluster(docSet);

Reading Reference Document Collection

The rest of the code is just housekeeping: I/O and scoring. To begin, we assume the documents are arranged by true cluster in a document hierarchy. A specified top-level directory contains a single subdirectory for each cluster, and that subdirectory contains a set of text files, corresponding to members of that cluster. All of this is read into a reference clustering as a set of set of documents as follows:

public static void main(String[] args) throws Exception {
    File dir = new File(args[0]);

    Set<Set<Document>> referencePartition
        = new HashSet<Set<Document>>();

    for (File catDir : dir.listFiles()) {
        Set<Document> docsForCat = new HashSet<Document>();
        referencePartition.add(docsForCat);
        for (File file : catDir.listFiles()) {
            Document doc = new Document(file);
            docsForCat.add(doc);
        }
    }

From the reference partition, we can recover the complete set of docs to cluster, removing the truth from our input (and thus not cheating):

Set<Document> docSet = new HashSet<Document>();
for (Set<Document> cluster : referencePartition)
    docSet.addAll(cluster);

Clustering, Again

The actual clustering is performed as indicated earlier, repeated here for continuity:

HierarchicalClusterer<Document> clClusterer
    = new CompleteLinkClusterer<Document>(COSINE_DISTANCE);
Dendrogram<Document> completeLinkDendrogram
    = clClusterer.hierarchicalCluster(docSet);

HierarchicalClusterer<Document> slClusterer
    = new SingleLinkClusterer<Document>(COSINE_DISTANCE);
Dendrogram<Document> singleLinkDendrogram
    = slClusterer.hierarchicalCluster(docSet);

Scoring

The rest of the code is just scoring. We compute the precision/recall-based pairwise relational evaluation of the reference partition (aka the truth, or gold standard) versus complete link clustering results, reference versus single link, and finally, single link versus complete link. We do this by splitting the dendrograms into all possible numbers of clusters and seeing what we get. The loop to do all the evaluations is as follows:

for (int k = 1; k <= docSet.size(); ++k) {
    Set<Set<Document>> clResponsePartition
        = completeLinkDendrogram.partitionK(k);
    Set<Set<Document>> slResponsePartition
        = singleLinkDendrogram.partitionK(k);

    ClusterScore<Document> scoreCL
        = new ClusterScore<Document>(referencePartition,
                                     clResponsePartition);
    PrecisionRecallEvaluation clPrEval = scoreCL.equivalenceEvaluation();

    ClusterScore<Document> scoreSL
        = new ClusterScore<Document>(referencePartition,
                                     slResponsePartition);
    PrecisionRecallEvaluation slPrEval = scoreSL.equivalenceEvaluation();

    ClusterScore<Document> scoreX
        = new ClusterScore<Document>(clResponsePartition,
                                     slResponsePartition);
    PrecisionRecallEvaluation xPrEval = scoreX.equivalenceEvaluation();
    ...

In each case, we take the dendrogram produced by hierarchical clustering and split it into the top K clusters. We then create the three evaluations. The rest is I/O.

Results for John Smith

Here are the results for the John Smith case, which can be viewed by executing the ant target john-smith

C:\carp\mycvs\lingpipe\demos\tutorial\cluster>ant john-smith
Buildfile: build.xml

compile:
    [javac] Compiling 1 source file to C:\carp\mycvs\lingpipe\demos\tutorial\cluster\b

john-smith:
     [java] Category from file=..\..\data\johnSmith\0
     [java] Category from file=..\..\data\johnSmith\1
     [java] Category from file=..\..\data\johnSmith\10
     ...
     [java]
     [java]  --------------------------------------------------------
     [java] |  K  |  Complete      |  Single        |  Cross         |
     [java] |     |  P    R    F   |  P    R    F   |  P    R    F   |
     [java]  --------------------------------------------------------
     [java] |   1 | 0.23 1.00 0.38 | 0.23 1.00 0.38 | 1.00 1.00 1.00 |
     [java] |   2 | 0.45 0.96 0.61 | 0.23 1.00 0.38 | 0.50 0.99 0.66 |
     [java] |   3 | 0.45 0.96 0.61 | 0.24 1.00 0.38 | 0.50 0.99 0.66 |
     [java] |   4 | 0.54 0.96 0.69 | 0.24 1.00 0.39 | 0.42 0.99 0.59 |
     [java] |   5 | 0.44 0.55 0.49 | 0.24 1.00 0.39 | 0.30 0.97 0.45 |
     [java] |   6 | 0.44 0.55 0.49 | 0.24 0.99 0.39 | 0.30 0.97 0.45 |
     [java] |   7 | 0.42 0.48 0.45 | 0.24 0.99 0.39 | 0.27 0.95 0.42 |
     [java] |   8 | 0.42 0.47 0.44 | 0.25 0.98 0.39 | 0.27 0.95 0.42 |
     [java] |   9 | 0.40 0.43 0.42 | 0.25 0.98 0.40 | 0.25 0.94 0.40 |
     [java] |  10 | 0.40 0.43 0.42 | 0.25 0.98 0.40 | 0.25 0.92 0.40 |
     [java] |  15 | 0.55 0.32 0.40 | 0.27 0.98 0.42 | 0.14 0.92 0.25 |
     [java] |  20 | 0.65 0.29 0.41 | 0.28 0.95 0.43 | 0.12 0.91 0.21 |
     [java] |  25 | 0.80 0.28 0.42 | 0.29 0.91 0.44 | 0.10 0.92 0.18 |
     [java] |  30 | 0.82 0.23 0.36 | 0.29 0.84 0.43 | 0.09 0.89 0.16 |
     [java] |  35 | 0.83 0.23 0.36 | 0.31 0.83 0.45 | 0.09 0.88 0.16 |
     [java] |  40 | 0.82 0.18 0.30 | 0.31 0.79 0.45 | 0.08 0.91 0.15 |
     [java] |  45 | 0.83 0.17 0.28 | 0.58 0.79 0.67 | 0.14 0.93 0.24 |
     [java] |  50 | 0.85 0.17 0.28 | 0.60 0.77 0.67 | 0.14 0.93 0.25 |
     [java] |  60 | 0.87 0.15 0.25 | 0.77 0.68 0.72 | 0.18 0.95 0.30 |
     [java] |  61 | 0.87 0.15 0.25 | 0.77 0.68 0.72 | 0.18 0.95 0.30 |
     [java] |  62 | 0.88 0.15 0.25 | 0.77 0.68 0.72 | 0.18 0.95 0.30 |
     [java] |  63 | 0.85 0.12 0.21 | 0.83 0.68 0.75 | 0.17 0.94 0.28 |
     [java] |  64 | 0.85 0.12 0.21 | 0.86 0.68 0.76 | 0.16 0.89 0.26 |
     [java] |  65 | 0.85 0.11 0.20 | 0.87 0.68 0.76 | 0.15 0.88 0.26 |
     [java] |  66 | 0.85 0.11 0.20 | 0.87 0.68 0.76 | 0.15 0.89 0.26 |
     [java] |  67 | 0.85 0.11 0.20 | 0.93 0.68 0.79 | 0.15 0.85 0.26 |
     [java] |  68 | 0.85 0.11 0.20 | 0.93 0.66 0.77 | 0.16 0.84 0.26 |
     [java] |  69 | 0.85 0.11 0.20 | 0.93 0.66 0.77 | 0.15 0.84 0.26 |
     [java] |  70 | 0.86 0.11 0.20 | 0.93 0.62 0.74 | 0.16 0.83 0.27 |
     [java] |  71 | 0.86 0.10 0.18 | 0.92 0.52 0.67 | 0.17 0.82 0.29 |
     [java] |  72 | 0.85 0.10 0.18 | 0.92 0.51 0.65 | 0.17 0.80 0.28 |
     [java] |  73 | 0.85 0.10 0.17 | 0.91 0.42 0.58 | 0.20 0.80 0.32 |
     [java] |  80 | 0.92 0.09 0.17 | 0.94 0.40 0.56 | 0.21 0.87 0.33 |
     [java] |  90 | 0.98 0.09 0.16 | 0.97 0.30 0.46 | 0.27 0.94 0.42 |
     [java] | 100 | 0.98 0.07 0.14 | 0.96 0.19 0.32 | 0.36 0.95 0.52 |
     [java] | 125 | 1.00 0.06 0.12 | 1.00 0.07 0.13 | 0.85 0.96 0.90 |
     [java] | 150 | 1.00 0.04 0.08 | 1.00 0.04 0.08 | 0.93 0.97 0.95 |
     [java] | 175 | 1.00 0.03 0.06 | 1.00 0.03 0.06 | 1.00 1.00 1.00 |
     [java] | 197 | 1.00 0.02 0.04 | 1.00 0.02 0.04 | 1.00 1.00 1.00 |
     [java]  --------------------------------------------------------

BUILD SUCCESSFUL
Total time: 11 seconds

We have eliminated some of the less interesting lines. First, consider the case of K=35, which is the number of reference clusters. This is not the best scoring K for either single or complete-link clusterers. For complete-link clustering, the maximum F measure is 0.69, occurring at K=4; this result is somewhat of an anomoly, though, as the precision values are not monotonically increasing through this point (which is common for non-interpolated results such as these). Single-link clustering on the other hand has a maximum F measure of .79, occurring at K=67, which is roughly double the number of reference clusters. Further note that a precision of 0.25 is the precision expected by randomly clustering.

The complete-link and single-link clusterers present a range of possible operating points. For instance, complete-link is better for precision at high recall. Consider the best precision above 0.95 recall -- it's 0.54 for complete link and only 0.28 for single link. On the other hand, the best recall for 0.95 precision is reversed, with single-link providing a 0.30 recall at 0.95 precision, with complete link only having a 0.09 recall at 0.95 precision. Single link clearly dominates in total F measure as well as in balanced precision/recall (in the 0.70 range for single link versus the 0.40 range for complete link).

Note that the curves will also look different under different term frequency weightings, if stop lists or other normalizations are used, and so on. They are actually very sensitive to threshold here.

Results for Four Newsgroups

For completeness, here's the same table for the four newsgroup example:

C:\carp\mycvs\lingpipe\demos\tutorial\cluster>ant four-news
...
 --------------------------------------------------------
|  K  |  Complete      |  Single        |  Cross         |
|     |  P    R    F   |  P    R    F   |  P    R    F   |
 --------------------------------------------------------
|   1 | 0.25 1.00 0.40 | 0.25 1.00 0.40 | 1.00 1.00 1.00 |
|   2 | 0.31 0.84 0.45 | 0.25 0.99 0.40 | 0.68 0.99 0.81 |
|   3 | 0.32 0.67 0.43 | 0.25 0.98 0.40 | 0.52 0.98 0.68 |
|   4 | 0.32 0.64 0.43 | 0.25 0.97 0.40 | 0.51 0.98 0.67 |
|   5 | 0.32 0.64 0.43 | 0.25 0.96 0.40 | 0.51 0.98 0.67 |
|   6 | 0.34 0.47 0.40 | 0.25 0.95 0.40 | 0.35 0.98 0.52 |
|   7 | 0.34 0.44 0.38 | 0.25 0.94 0.40 | 0.35 0.98 0.51 |
|   8 | 0.34 0.44 0.38 | 0.25 0.93 0.40 | 0.35 0.99 0.52 |
|   9 | 0.33 0.42 0.37 | 0.25 0.92 0.40 | 0.35 0.99 0.52 |
|  10 | 0.34 0.31 0.33 | 0.25 0.91 0.39 | 0.25 0.98 0.40 |
|  15 | 0.33 0.27 0.30 | 0.25 0.85 0.39 | 0.24 0.98 0.39 |
|  20 | 0.36 0.26 0.31 | 0.26 0.80 0.39 | 0.23 0.97 0.37 |
|  30 | 0.38 0.19 0.25 | 0.29 0.69 0.41 | 0.20 0.96 0.33 |
|  40 | 0.45 0.12 0.19 | 0.31 0.63 0.42 | 0.13 0.96 0.22 |
|  50 | 0.51 0.10 0.17 | 0.32 0.55 0.40 | 0.11 0.94 0.19 |
|  75 | 0.64 0.07 0.13 | 0.35 0.33 0.34 | 0.11 0.88 0.19 |
| 100 | 0.82 0.05 0.10 | 0.38 0.21 0.27 | 0.11 0.91 0.20 |
| 125 | 0.93 0.04 0.08 | 0.47 0.10 0.16 | 0.19 0.88 0.31 |
| 150 | 0.98 0.03 0.06 | 0.86 0.03 0.07 | 0.79 0.97 0.87 |
| 178 | 1.00 0.02 0.04 | 1.00 0.02 0.04 | 1.00 1.00 1.00 |
 --------------------------------------------------------

Here the true number of clusters is 4, and again, this is not the optimal preformance point for either clusterer. Neither clusterer far exceeds the baseline result of just throwing everything in a single cluster (which, as indicated in the first line above, produces scores of P=0.25, R=1.00, F=0.40). By the time either approach has a precision that's much better than chance (0.25), recall is fairly poor. This is perhaps not surprising given the closeness of the topics in this document collection.

Clustering for Word Sense Bootstrapping

This idea of clustering documents referring to John Smith is implicitly doing a kind of unsupervised word sense disambiguation (WSD). For instance, rather than documents containing John Smith, one might choose documents containing other ambiguous words or phrases, either real or artificially generated.

Grouping together contexts for words in this way is often known as bootstrapping in the computational linguistics literature (not related to the statistical technique of bootstrap estimation). In the bootstrapping case, often a few supervised examples are provided as a starting point.

Latent Dirichlet Allocation

Latent Dirichlet allocation (LDA) is different than other document (or general multinomial) clustering techniques in that it assigns multiple topics to each document. For instance, an article about the steroid scandal in American baseball is naturally modeled as a mixture of topics involving baseball, the United States congress, scandals and illegal drug use. Although such topics may be correlated, covariance is not modeled in the basic LDA model.

The LDA model consists of a fixed number of topics, each of which is modeled as a distribution over words. A document under LDA is modeled as a distribution over topics. There is a Dirichlet prior on both the topic distributions over words and the document distributions over topics.

The implementation of LDA in LingPipe performs two basic functions. First, it estimates the parameters of an LDA model from an unlabeled corpus of documents given the two Dirichlet priors and a fixed number of topics. These estimates are performed by Gibbs sampling, which provides samples from the posterior distribution over topic distributions given the data. These samples may be used individually to construct an LDA model, or they may be used collectively to perform Bayesian reasoning over the uncertainty in the model estimation. The implementation of parameter estimation is handled by a single static method which returns a final model estimate and allows incremental sampling of model estimates and parameters as it goes.

The LDA class itself represents an estimated LDA model. Such a model may be used to infer the topic distribution for new documents given the document-topic Dirichlet prior. Thus it essentially functions as a multi-topic classifier over documents.

A given LDA model may also be used to do reasoning at the word level, estimating the likelihood of a word given a previous bag of words. This latter function may be used to model word associations. It may also be used to model collaborative filtering, where documents represent users and the words represent discrete ratings.

Synthetic LDA Example

We begin with an example drawn from Steyvers and Griffiths' paper on topic models (2007; see the references below). Synthetic examples play an important role in statistics, namely testing that the model can appropriately estimate parameters to recreate the data source that generated it. To do this, an artificial example case is generated using known parameters, and the algorithm attempts to estimate those parameters accurately. For LDA models, the parameters of interest here are the topic distributions over words, as we treat the Dirichlet priors as fixed.

This particular example was generated from two topics, whose distribution over words is given in the following table.

Word Probabilities in Topics
riverstreambankmoneyloan
Topic 0001/31/31/3
Topic 11/31/31/300

For instance, the probability of generating the word "stream" in topic 1 is 1/3, but it is 0 in topic 0. Both topics assign probability 1/3 to the word "bank". Topic 0 generates words having to do with banking, whereas topic 1 generates words about rivers. Both assign non-zero probability to the word "bank".

Running the Example

The example can be run from Ant:

> cd $LINGPIPE/demos/tutorial/cluster
> ant lda-fixed

Epoch=  0   elapsed time=:00
Epoch=  1   elapsed time=:00
Epoch=  2   elapsed time=:00
Epoch=  3   elapsed time=:00
Epoch=  4   elapsed time=:00
Epoch=  5   elapsed time=:00
Epoch=  6   elapsed time=:00
Epoch=  7   elapsed time=:00
Epoch=  8   elapsed time=:00
Epoch=  9   elapsed time=:00
Epoch= 10   elapsed time=:00
Epoch= 11   elapsed time=:00
Epoch= 12   elapsed time=:00
Epoch= 13   elapsed time=:00
Epoch= 14   elapsed time=:00
Epoch= 15   elapsed time=:00

Final Report
epoch=15
numDocs=16
numTokens=256
numWords=5
numTopics=2

TOPIC 0  (total count=101)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
     1           stream       42   0.416       4.3
     2             bank       32   0.317      -0.7
     0            river       27   0.267       3.3

TOPIC 1  (total count=155)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
     2             bank       63   0.406       0.7
     3            money       48   0.310       3.0
     4             loan       44   0.284       2.9

DOC 0
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

bank(1) loan(1) money(1) loan(1) loan(1) money(1) bank(1) loan(1) bank(1) loan(1) loan(1) money(1) bank(1) money(1) money(1) money(1)

DOC 1
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

loan(1) money(1) money(1) loan(1) bank(1) bank(1) money(1) bank(1) money(1) loan(1) money(1) money(1) bank(1) loan(1) bank(1) money(1)

DOC 2
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

loan(1) bank(1) money(1) money(1) bank(1) bank(1) loan(1) bank(1) money(1) bank(1) bank(1) loan(1) bank(1) money(1) money(1) loan(1)

DOC 3
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

loan(1) bank(1) money(1) bank(1) bank(1) money(1) bank(1) money(1) bank(1) money(1) bank(1) loan(1) money(1) loan(1) bank(1) money(1)

DOC 4
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

bank(1) money(1) bank(1) bank(1) loan(1) bank(1) loan(1) bank(1) loan(1) loan(1) loan(1) bank(1) loan(1) money(1) loan(1) bank(1)

DOC 5
TOPIC    COUNT    PROB
----------------------
    1       16   0.994

bank(1) bank(1) loan(1) money(1) bank(1) bank(1) bank(1) money(1) loan(1) loan(1) money(1) bank(1) loan(1) bank(1) bank(1) bank(1)

DOC 6
TOPIC    COUNT    PROB
----------------------
    1       15   0.932
    0        1   0.068

money(1) loan(1) money(1) bank(1) loan(1) money(1) river(0) money(1) bank(1) bank(1) loan(1) loan(1) bank(1) loan(1) money(1) money(1)

DOC 7
TOPIC    COUNT    PROB
----------------------
    1       13   0.809
    0        3   0.191

money(1) bank(1) stream(0) stream(0) bank(1) money(1) bank(1) loan(1) bank(1) loan(1) bank(1) money(1) loan(1) river(0) money(1) bank(1)

DOC 8
TOPIC    COUNT    PROB
----------------------
    1        9   0.562
    0        7   0.438

money(1) bank(0) money(1) loan(1) stream(0) bank(0) loan(1) bank(0) bank(1) stream(0) money(1) bank(1) bank(1) stream(0) river(0) money(1)

DOC 9
TOPIC    COUNT    PROB
----------------------
    1       10   0.623
    0        6   0.377

stream(0) bank(1) bank(1) loan(1) bank(0) stream(0) loan(1) bank(1) river(0) loan(1) money(1) stream(0) bank(1) river(0) bank(1) loan(1)

DOC 10
TOPIC    COUNT    PROB
----------------------
    1       10   0.623
    0        6   0.377

money(1) bank(1) bank(1) bank(1) loan(1) stream(0) river(0) river(0) bank(1) bank(1) bank(1) money(1) stream(0) stream(0) money(1) bank(0)

DOC 11
TOPIC    COUNT    PROB
----------------------
    0       15   0.932
    1        1   0.068

stream(0) stream(0) bank(0) stream(0) bank(0) stream(0) bank(0) bank(0) stream(0) money(1) river(0) river(0) river(0) bank(0) bank(0) stream(0)

DOC 12
TOPIC    COUNT    PROB
----------------------
    0       15   0.932
    1        1   0.068

bank(0) stream(0) bank(0) loan(1) river(0) stream(0) bank(0) bank(0) river(0) stream(0) bank(0) river(0) river(0) river(0) river(0) bank(0)

DOC 13
TOPIC    COUNT    PROB
----------------------
    0       16   0.994

bank(0) bank(0) stream(0) bank(0) stream(0) bank(0) stream(0) stream(0) stream(0) stream(0) bank(0) stream(0) river(0) bank(0) river(0) stream(0)

DOC 14
TOPIC    COUNT    PROB
----------------------
    0       16   0.994

river(0) stream(0) stream(0) stream(0) river(0) stream(0) stream(0) bank(0) bank(0) bank(0) bank(0) river(0) river(0) stream(0) bank(0) stream(0)

DOC 15
TOPIC    COUNT    PROB
----------------------
    0       16   0.994

stream(0) river(0) river(0) bank(0) stream(0) stream(0) stream(0) stream(0) bank(0) river(0) river(0) stream(0) bank(0) river(0) stream(0) bank(0)


Stepping through the output, we first see a report of each epoch and the elapsed time, the latter of which is negligible for this example. An epoch corresponds to a single Gibbs sample, which is itself an assignment of topics to each word in the corpus.

After the epoch-by-epoch report, we see a final report. This provides information about the final sample, in this case the one drawn at epoch 15 (the 16th epoch counting from 0). It reports the number of documents, tokens, words and topics.

Next up is a summary of the topic models, which are multinomials over words. Here we have two topic models, 0 and 1. At the top, there is a report of how many words were assigned to each topic in the corpus; topic 0 has 101 words and topic 1 has 155. Balance between the topics is influenced by the document-topic prior as well as the empirical distributions of words in documents. For each symbol with non-zero count in a topic, we provide the symbol identifier (mediated by a symbol table in the application), the word for the symbol, and the number of times the word was assigned to the specified topic. For instance, the word "stream" has symbol 1 and has 42 instances assigned to topic 0. The probability of the word is also reported; it results from a combination of the topic-word prior and the sample counts. For instance, the probability of generating the word "loan" from topic 1 is 0.284. We also report a binomial Z score, which measures the degree of independence of the word from the topic. Note that the word "bank" is the least dependent on topic, as is to be expected.

The question now arises as to whether our sample has recreated our underlying topic distributions. First off, they have the right words with non-zero probability, but the overall balance is not exactly uniform. Although they were generated from a uniform distribution by Steyvers and Griffiths, the empirical counts have many more instances of "stream" than "river" (42 instances of the former versus only 27 of the latter).

Finally, each document is reported at the level of topic distribution and then at the level of topic assigned to a word for each word in the corpus. For instance, in document 11, fifteen words were generated by topic 0 and one word by topic 1. The report at the word level shows the order of tokens in a document and then shows the topic assigned to each document. Here, the token "money" is assigned to topic 1, with all other tokens being assigned to topic 0. In document 1, all words were generated by topic 1, and in document 15, all words were generated by topic 0.

In document 8, we see a case where different tokens of the same word are assigned to different topic. In particular, three instances of "bank" are assigned to topic 0 and three to topic 1.

Code Walk Through

The code for the fixed example is in two files, namely src/LdaFixed.java, which contains the main, and the helper class src/LdaReportingHandler.java which does the reporting.

The code for the LdaFixed class begins by defining a bunch of static constants for the input corpus.

static String[] WORDS = new String[] {
    "river",  "stream",  "bank",  "money",  "loan"
};

static SymbolTable SYMBOL_TABLE = new MapSymbolTable();
static {
    for (String word : WORDS)
        SYMBOL_TABLE.getOrAddSymbol(word);
}

static final int[][] DOC_WORDS = new int[][] {
    { 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4 },
    { 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4 },
    { 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4 },
    { 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4 },
    { 2, 2, 2, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4 },
    { 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 4 },
    { 0, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4 },
    { 0, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4 },
    { 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4 },
    { 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4 },
    { 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4 },
    { 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3 },
    { 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2, 4 },
    { 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2 },
    { 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 },
    { 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2 }
};

static {
    for (int[] words : DOC_WORDS)
        Arrays.permute(words,RANDOM);
}

This defines an array of five words which are used to populate a symbol table. The symbol table and words are only used for output. The real input is the DOC_WORDS array which represents 16 documents of 16 tokens each. For instance, the first document contains four instances of token 2 ("bank"), six instances of token 3 ("money") and six instances of token 4 ("loan"). Note that we then permute the arrays using our randomizer to more naturally match real document input; nothing hinges on this randomization.

Next, the class includes static constants for the parameters of LDA and search.

static short NUM_TOPICS = 2;
static double DOC_TOPIC_PRIOR = 0.1;
static double TOPIC_WORD_PRIOR = 0.01;

static int BURNIN_EPOCHS = 0;
static int SAMPLE_LAG = 1;
static int NUM_SAMPLES = 16;

static Random RANDOM = new Random(43);

Finally, the main() method contains a call to the static LDA Gibbs sampler method given all of these parameters along with a newly constructed sample handler.

public static void main(String[] args) throws Exception {

    LdaReportingHandler handler
        = new LdaReportingHandler(SYMBOL_TABLE);

    LatentDirichletAllocation.GibbsSample sample
        = LatentDirichletAllocation
        .gibbsSampler(DOC_WORDS,
                      NUM_TOPICS,
                      DOC_TOPIC_PRIOR,
                      TOPIC_WORD_PRIOR,
                      BURNIN_EPOCHS,
                      SAMPLE_LAG,
                      NUM_SAMPLES,
                      RANDOM,
                      handler);

    handler.fullReport(sample,5,2,true);
}

The handler, whose code we turn to shortly, is constructed with a reference to the symbol table; this will be used for reporting. The static gibbsSampler() method is called with a slew of arguments: the data, the number of topics, the two priors, the burnin and lag for the sampler, the total number of samples to handle, a randomizer, and finally, the handler which receives samples as callbacks.

LDA Reporting Handler Code

The samples themselves are instances of the class com.aliasi.cluster.LatentDirichletAllocation.GibbsSample. The samples are passed to the specified handler by the static Gibbs sampler method. The handler implements com.aliasi.corpus.ObjectHandler<GibbsSample>, which specifies a single method, handle(GibbsSample).

The handler generating the reports in the demos in this tutorial is provided in src/LdaReportingHandler.java. After the relevant imports, the handler is declared by:

class LdaReportingHandler implements ObjectHandler<LatentDirichletAllocation.GibbsSample>

Instances are constructured with a reference to the symbol table that will be used to convert multinomial outcomes represented as integers to words.

LdaReportingHandler(SymbolTable symbolTable) {
    mSymbolTable = symbolTable;
}

We have ellided the time calculations, as they have nothing to do with the LDA interface per se. Again, not worrying about time, we have the handle() method defined by:

public void handle(LatentDirichletAllocation.GibbsSample sample) {
    System.out.printf("Epoch=%3d\n",sample.epoch());
}

This handler simply prints out the epoch number as we go along. More useful handlers might store samples in files or memory to be used later for collective reasoning, or they might store estimates which are themselves combined across samples.

Most of the work for the example is done in the fullReport(GibbsSample) method of the handler, which is not specified by the ObjectHandler interface. Recall that the LatentDirichletAllocation.gibbsSampler() method returns a sample. This sample is passed to the final report method in the handler to report on the sample. For instance, the final report could be generated for each sample to inspect how topic assignments evolve. This can be done by making a call to fullReport() for each sample sent to the handler.

The fullReport() method begins by reporting the top-level counts for the sample:

void fullReport(LatentDirichletAllocation.GibbsSample sample,
                int maxWordsPerTopic,
                int maxTopicsPerDoc,
                boolean reportTokens) {

    int numTopics = sample.numTopics();
    int numWords = sample.numWords();
    int numDocs = sample.numDocuments();
    int numTokens = sample.numTokens();
    ...

We leave out the print statements in this tutorial presentation, focusing on the structure of the sample. The two numerical arguments to fullReport() are used to limit the output for topics and document reports, and the final boolean argument determines whether a the topic assignment for each token in the corpus is reported.

The next thing the full report method does is report on the topic models.

    for (int topic = 0; topic < numTopics; ++topic) {
        int topicCount = sample.topicCount(topic);
        ObjectToCounterMap<Integer> counter = new ObjectToCounterMap<Integer>();
        for (int word = 0; word < numWords; ++word)
            counter.set(new Integer(word),sample.topicWordCount(topic,word));
        List<Integer> topWords = counter.keysOrderedByCountList();
    }

The loop iterates over the topics, first collecting the total count for the topic into variable topicCount. Next, an object-to-counter map (in com.aliasi.util) is used to map the word identifiers to their counts in the current topic. The resulting mapping is sorted on the identifiers based on their counts, from high to low, and assigned to a list of integers. Then we consider each rank from most frequent to less frequent.

        for (int rank = 0; rank < maxWordsPerTopic && rank < topWords.size(); ++rank) {
            int wordId = topWords.get(rank);
            String word = mSymbolTable.idToSymbol(wordId);
            int wordCount = sample.wordCount(wordId);
            int topicWordCount = sample.topicWordCount(topic,wordId);
            double topicWordProb = sample.topicWordProb(topic,wordId);
            double z = binomialZ(topicWordCount,
                                 topicCount,
                                 wordCount,
                                 numTokens);
        }

We retrieve the word identifier from the top words list and assign it to wordId, and then retrieve the underlying string for the word from the class's symbol table. Next, we retrieve the total count for the word in the sample, as well as the count for the word in the topic. Next, the probability of the word in the topic is computed from the sample, which uses the topic-word prior in this calculation as well as the overall topic count.

Finally, we compute a binomial Z score of the independence of the word from the topic. This is computed using the counts of the word in the topic, the count of all words in the topic, the count of the word in the corpus and the count of all words in the corpus.

Next, the full report method iterates over the documents in the corpus, finding the top ranked topics for each document.

    for (int doc = 0; doc < numDocs; ++doc) {
        int docCount = 0;
        for (int topic = 0; topic < numTopics; ++topic)
            docCount += sample.documentTopicCount(doc,topic);
        ObjectToCounterMap<Integer> counter = new ObjectToCounterMap<Integer>();
        for (int topic = 0; topic < numTopics; ++topic)
            counter.set(new Integer(topic),sample.documentTopicCount(doc,topic));
        List<Integer> topTopics = counter.keysOrderedByCountList();

We loop over the top-ranking topics in a document the same way as we looped over the top-ranking words in a topic, but here we calculate the count of the topic, the count of the topic in the document, and the estimate of the topic probability in the document; note that this latter calculation incorporates the document-topic prior.

        for (int rank = 0; rank < topTopics.size() && rank < maxTopicsPerDoc; ++rank) {
            int topic = topTopics.get(rank);
            int docTopicCount = sample.documentTopicCount(doc,topic);
            double docTopicPrior = sample.documentTopicPrior();
            double docTopicProb = (sample.documentTopicCount(doc,topic) + docTopicPrior)
                / (docCount + numTopics * docTopicPrior);

The final stage is the report of the topic assignment to each corpus in a document. This may be turned off using the reportTokens flag in the method call.

        if (!reportTokens) continue;
        int numDocTokens = sample.documentLength(doc);
        for (int tok = 0; tok < numDocTokens; ++tok) {
            int symbol = sample.word(doc,tok);
            String word = mSymbolTable.idToSymbol(symbol);
            short topic = sample.topicSample(doc,tok);
        }

Again, we elide the print statements to make the library calls clearer.

Corpus LDA Example: Wormbase Citations

Sampling for an LDA model over a corpus of natural language documents is almost as easy as running over synthetic multinomial data. The only difference is that the text needs to be parsed out of its source and then converted to symbol identifiers for the multinomial representation required by the LDA Gibbs sampler. Once the text is extracted, as a character sequence per document, there's a static convenience method in the LDA class itself to perform the conversion to integer symbols in a single step.

The Data

In this section, we show how to extract topics from a large corpus of documents about various aspects of the nematode worm (aka Caenorhabditis elegans, aka c. elegans). The Wormbase data repository distributes a curated bibliography that is regularly updated. You may upload the data we used in this demo by clicking on the following link and saving the gzipped file (you don't even need to unpack it):

The data file contains a (gzipped) sequence of references like the following one, with references separated by blank lines.

%0 Journal Article
%T Determination of life-span in Caenorhabditis elegans by four clock genes.
%A Lakowski, B
%A Hekimi, S
%D 1996
%V 272
%P 1010-1013
%J Science
%M WBPaper00002456
%X The nematode worm Caenorhabditis elegans is a model system for the study of the genetic basis of aging. Maternal-effect mutations in four genes--clk-1, clk-2, clk-3, and gro-1--interact genetically to determine both the duration of development and life-span. Analysis of the phenotypes of these mutants suggests the existence of a general physiological clock in the worm. Mutations in certain genes involved in dauer formation (an alternative larval stage induced by adverse conditions in which development is arrested) can also extend life-span, but the life extension of Clock mutants appears to be independent of these genes. The daf-2(e1370) clk-1(e2519) worms, which carry life-span-extending mutations from two different pathways, live nearly five times as long as wild-type worms.

The size of the Worm bibliography data set after pruning at a mininum token count of 2 is 26,888 documents, 52,437 words, and 3,862,644 total tokens. We used a topic-word prior of 0.01 and a document-topic prior of 0.1. We ran to 2000 samples, at which point, the samples appeared to be reasonably stable in terms of conditional log likelihood given the topic and document multinomials.

Running the Demo

The demo may be run from Ant using target lda-worm. The gzipped data must be provided through the property wormbase.gz. Here's how we ran it:

> ant -Dwormbase.gz=e:\data\wormbase\dist\2007-12-01-wormbase-literature.endnote.gz lda-worm

The output is almost identical to the fixed case example, but there are many more topics, many more words, and the documents are much longer. As an example, here's the output for the document whose citation was given above. As before, document output includes the document topic distribution followed by a listing of each token extracted from the tokenizer and the topic to which it was assigned.

DOC 2366
TOPIC    COUNT    PROB
----------------------
   11       29   0.378
   30       15   0.196
   42       14   0.183
   46       11   0.144
   20        1   0.014
   21        1   0.014
   45        1   0.014

determination(46) life-span(11) caenorhabditi(46) elegan(46) clock(21)
gene(11) nematode(11) worm(11) caenorhabditi(11) elegan(11) model(46)
system(46) study(46) genetic(42) basi(30) aging(11)
maternal-effect(30) mutation(30) clk-2(11) clk-3(11) genetically(30)
determine(11) duration(11) development(11) life-span(11) analysi(42)
phenotype(30) mutant(30) suggest(11) existence(46) general(46)
physiological(11) clock(11) worm(11) mutation(30) certain(30) gene(11)
involved(46) dauer(42) formation(42) alternative(42) larval(42)
stage(42) induced(42) adverse(42) condition(11) development(46)
arrested(42) extend(11) life-span(11) life(11) extension(11) clock(11)
mutant(30) appear(30) independent(30) gene(30) daf-2(42) e1370(42)
clk-1(11) e2519(11) worm(30) carry(46) mutation(30) different(20)
pathway(42) live(11) nearly(42) time(45) long(11) wild-type(30)
worm(11)

This document is modeled as being generated from a mixture of seven topics. 37.8% of the tokens were generated by topic 11, whereas the last three topics accounted for only one token each. There is substantial random variation from the sampling approach. This is intentional and an intrinsic property of Gibbs sampling. The reason sampling produces results like the one above is that it allows the variance of posterior statistics to be computed and used for Bayesian inference. The downside is that the topic distributions are not maximum likelihood estimates, and thus not our best estimates of the assignment of topics to a document. In this example, the token "clock" is assigned first to topic 21 and then topic 11. It would almost certainly be more likely for all the token instances of a given word to be generated from the same topic. With our prior of 0.1 for topics in documents, we expect peaky posterior distributions over topics.

The topical structure is much richer when run over a substantial real data set. Here's the dominant topic from the above document.

TOPIC 11  (total count=76046)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
 26488             life     2204   0.029      41.7
  9966             span     2150   0.028      43.9
 13546             stre     2056   0.027      42.7
  6545            aging     1916   0.025      41.6
   516           elegan     1547   0.020       2.4
 10576           mutant     1297   0.017       3.2
 27002         lifespan     1124   0.015      24.5
 22483             heat      866   0.011      23.7
 16877         response      847   0.011       9.4
 12775        oxidative      836   0.011      28.0
...
  5763        longevity      794   0.010      22.1
  9229           oxygen      757   0.010      26.5
  7932    mitochondrial      710   0.009      19.1
  9336        increased      698   0.009      14.1
 18944              age      668   0.009      22.1
  9899            shock      657   0.009      20.8
  7982       resistance      650   0.009      14.5
 21754         increase      634   0.008      11.9
  7633            clk-1      569   0.007      23.4
  6230             rate      551   0.007      10.0
  1938            age-1      433   0.006      13.5
 17131        extension      371   0.005      11.1
 21056           damage      350   0.005      10.2
 27155       long-lived      307   0.004      15.9
 16730        metabolic      302   0.004      12.2
 26006           ageing      247   0.003      14.0
  2988       superoxide      244   0.003      15.3
  1321               ro      234   0.003      13.5
 16080        life-span      220   0.003      13.7
 18997            mev-1      209   0.003      14.2
 26823            hif-1      205   0.003      14.0
   848            hsp70      200   0.003      13.8
 12054          dietary      194   0.003      11.5
 22639              sod      183   0.002      13.3
  8991          hypoxia      179   0.002      12.9
 18945           anoxia      168   0.002      12.7
 19807       senescence      162   0.002      12.0
  5026      age-related      158   0.002      12.3
 13166         reactive      158   0.002      11.3
 23352      antioxidant      156   0.002      12.2
 15571            mtdna      151   0.002      12.0
 10198        dismutase      135   0.002      11.4
 25739          radical      134   0.002      11.3
 18860  thermotolerance      133   0.002      11.3
  5474         paraquat      120   0.002      10.7
 13258          caloric      114   0.001      10.5

most frequent terms with z-scores greater than 10. This topic is clearly related to longevity. The genes appearing in the topic, such as clk-1 are known to be aging related.

We've only listed the ten most frequent terms and then any term in the top 500 most frequent terms for that topic with z-scores greater than 10 (the program actually prints out all 500 of the most frequent terms per topic).

The next topic is about inherited mutations and their effects.

TOPIC 30  (total count=166558)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
 15117         mutation    11964   0.072      74.1
 10576           mutant     8769   0.053      39.0
 18989           allele     7912   0.047      74.6
 21785        phenotype     7696   0.046      55.6
 16249             gene     5346   0.032      11.4
  8150           defect     3114   0.019      26.7
 20706       suppressor     2583   0.015      38.4
  4999           animal     2387   0.014      13.7
  8313             null     1759   0.011      30.8
 26652            cause     1499   0.009      18.3
...
 11491        wild-type     1317   0.008      16.5
   680         isolated     1299   0.008      14.7
 15960           suppre     1277   0.008      25.1
 14850         dominant     1222   0.007      28.5
 10544        lethality     1145   0.007      22.4
  8234           lethal     1131   0.007      19.1
  3835           double     1053   0.006      19.2
 19226      temperature      974   0.006      15.5
 25127        recessive      961   0.006      26.0
 17768        homozygou      905   0.005      22.3
 21647              cla      871   0.005      10.7
 22116           rescue      869   0.005      12.4
 17444        defective      795   0.005      11.8
 10263           affect      773   0.005      10.3
  2396      suppression      757   0.005      21.5
 16981          exhibit      658   0.004      11.1
  5635  loss-of-function      620   0.004      14.6
  7597           viable      597   0.004      18.7
 20959             locu      554   0.003      12.5
 21884         abnormal      541   0.003      10.4
 20834       suppressed      535   0.003      14.5
  4665          display      532   0.003      10.5
  3940        sensitive      529   0.003      12.4
 20064          rescued      518   0.003      12.9
 20318           severe      515   0.003      14.4
 19470             fail      510   0.003      10.0
 27607           strong      485   0.003      10.2
  7463             weak      479   0.003      14.1
 26052  temperature-sensitive      473   0.003      17.2
 17043       complement      471   0.003      14.7
 18265          sterile      462   0.003      11.8
  1240               ts      440   0.003      17.5
 21186              egl      421   0.003      17.3
 15536       homozygote      399   0.002      15.8
 23504          partial      391   0.002      10.7
  2127         missense      362   0.002      15.1
 16225        penetrant      339   0.002      13.0
  2645        revertant      320   0.002      10.7
 15098           unc-93      311   0.002      15.8
 12188           lesion      266   0.002      11.1
 12506  maternal-effect      243   0.001      10.7
  5130         nonsense      237   0.001      12.0
 26793       penetrance      237   0.001      10.6
  8756            sup-9      236   0.001      14.7
  6077       intragenic      235   0.001      13.7
 23991       extragenic      227   0.001      13.6
  3594      heterozygou      215   0.001      10.0
 23025      hypomorphic      204   0.001      11.3
  2387            amber      191   0.001      12.5
   953           sup-10      190   0.001      12.6
 27062    semi-dominant      170   0.001      12.4
  2957      restrictive      163   0.001      10.1

The next topic has to do with larval worms and genetic control of development. The daf family of genes is highly represented in this topic.

TOPIC 42  (total count=45197)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
 12927            dauer     3749   0.082      57.5
   597        formation     1254   0.028      17.1
  5220         receptor     1172   0.026      11.0
 10576           mutant      817   0.018       2.2
 24793           daf-12      777   0.017      27.5
  4312           larvae      758   0.017      15.9
 14289            stage      704   0.015       7.5
 26805          nuclear      686   0.015      10.9
 15093            daf-c      668   0.015      25.5
...
 20294          hormone      667   0.015      24.0
  2198            larva      520   0.011      20.8
  7427        pheromone      507   0.011      21.0
 14990            daf-7      507   0.011      21.8
 16249             gene      461   0.010      -1.0
 14995            daf-9      305   0.007      17.3
 15001            daf-3      258   0.006      15.9
 14991            daf-4      250   0.005      11.1
 15003            daf-1      235   0.005      14.4
  3903            fax-1      225   0.005      14.1
 21443          steroid      189   0.004      12.7
 14992            daf-5      182   0.004      13.3
 24705           daf-21      176   0.004      13.1
 11543  dauer-constitutive      172   0.004      13.0
  8124           nhr-25      157   0.003      12.4
 24794           daf-11      155   0.003      11.1
   774            hsp90      132   0.003      10.1
 23507              nhr      127   0.003      11.1
 24796           daf-14      127   0.003      11.0
 15106            daf-d      124   0.003      11.0
 10372             nhrs      123   0.003      11.0
 14993            daf-8      123   0.003      11.0
 25298  dauer-defective      119   0.003      10.8
   918               nr      109   0.002      10.3
 24449         ecdysone      108   0.002      10.3

Finally, the last substantially represented topic inferred for the document in the sample we took is the following:

TOPIC 46  (total count=141190)
SYMBOL             WORD    COUNT   PROB          Z
--------------------------------------------------
   516           elegan     5953   0.042      18.2
 23217          genetic     2883   0.020      22.8
 26479           system     2660   0.019      27.8
  5428         nematode     2267   0.016      17.1
 10322            model     2214   0.016      26.6
  3467    caenorhabditi     2161   0.015      14.2
 22362         organism     2082   0.015      29.9
 22518        molecular     1897   0.013      21.1
 26901        mechanism     1874   0.013      19.6
 13542             stud     1756   0.012      22.1
...
 11886            proce     1565   0.011      15.4
 10250            study     1291   0.009      18.1
  4888    understanding      983   0.007      22.4
 22068          provide      977   0.007      15.9
 12684         cellular      930   0.007      14.8
 27589    developmental      860   0.006      11.1
 12749           recent      770   0.005      19.2
  4633       biological      656   0.005      14.6
 25751          biology      564   0.004      14.5
 25358          insight      548   0.004      15.9
 11333           review      544   0.004      22.4
 13151         research      471   0.003      13.3
 24528           simple      446   0.003      13.5
 23483        approache      399   0.003      11.5
 20181              key      386   0.003      10.1
 18148             tool      362   0.003      11.6
  1587         question      361   0.003      10.2
 26428     invertebrate      339   0.002      10.5
  9406            discu      327   0.002      14.6
  2563       underlying      311   0.002      10.1
 19230             year      310   0.002      10.6
 24613     experimental      299   0.002      10.3
   173    multicellular      282   0.002      11.2
 10916      fundamental      265   0.002      11.9
  2683        knowledge      245   0.002      11.5
 16253         powerful      237   0.002      11.8
   196        discovery      215   0.002      10.9
 13782        excellent      182   0.001      10.2
  2382            offer      180   0.001      10.7
  3564     manipulation      180   0.001      10.9
  5930       physiology      180   0.001      10.4
 13283          advance      162   0.001      10.6

This topic is interesting because of its genericness. General words related to abstract-level research are highly represented here.

Sizing and Scaling LDA Gibbs Sampling

Note that the demo took several hours to run (on a notebook computer with good performance as of December 2006). This is partly because we ran to 2000 epochs to mirror Steyvers and Griffiths' large corpus examples. We only ran 100 topics, as that's what Blei et al. found worked well for their nematode bibliography.

Run time is directly proportional to the number of topics times the number of tokens in the corpus. The numbers of words and documents don't really matter. Although LDA is highly scalable, it is fairly slow per epoch. Thus scaling an order of magnitude (to 40 million tokens, or about 250MB of ASCII text before tokenization and stoplisting) would require a day of computer time to run. Scaling beyond beyond 50M tokens would be prohibitive without some kind of parallelization.

Code Walkthrough

The code is in src/LdaWormbase.java. It is mostly like the fixed example code. It begins to differ when reading the corpus:

public static void main(String[] args) throws Exception {
    File corpusFile = new File(args[0]);
    ...
    CharSequence[] articleTexts = readCorpus(corpusFile);

The details may be found in the implementation of the method readCorpus() in the source. It basically reads in the titles and bodies of the citations (the contents of the %T and %X fields).

The critical new step is converting the raw text into a sequence of words and then into a multinomial representation. A tokenizer converts text to a sequence of tokens (word instances), and a symbol table takes care of the conversion from words to numerical identifiers as required by the LDA Gibbs sampler.

    SymbolTable symbolTable
        = new MapSymbolTable();
    TokenizerFactory tokenizerFactory
        = new WormTokenizerFactory();
    int[][] docTokens
        = LatentDirichletAllocation
        .tokenizeDocuments(articleTexts,
                           tokenizerFactory,
                           symbolTable,
                           minTokenCount);

The static tokenizeDocuments() method in the LDA class takes in the raw texts (an array of character sequences), a tokenizer factory with which to convert the texts into token streams, a symbol table to convert tokens to integer representations. There is also a minimum count for tokens; all tokens occurring fewer times than this count in the corpus are removed. These tend not to be too useful for topical analysis, though the topic distributions of even singletons may be estimated using LDA, so in some cases, no pruning is necessary and the minimum token count can be set to 1.

The tokenizer factory was designed to mirror the one used by Blei et al., to the extent that it was described (they only sketched their stoplisting). Tokenization is based on a regex-based tokenizer factory that is fairly coarse grained so as to leave gene names like "clk-2" together as single tokens. There is a stop filter to remove stoplisted words and non-alpha-numeric tokens. There is also a simple plural stemmer as described in the paper, and all strings are normalized to lower case. The details of the tokenizer may may be found in the source file.

The rest of the code is exactly the same as for the fixed example. There is an additional static method reportCorpus(). which may be used to do a token count report on a corpus; we used this method in determining the form of the stoplist to use and also to get a feel for token distributions.

Topic Stability Evaluation

We replicate Steyvers and Griffiths' (2007) comparison of topic stability. They first drew two Gibbs samples from different random initializations (in theory these could just come from a large enough sample lag on the same chain). The basis for topic comparison is the symmetrized form of Kullback-Leibler divergence. Steyvers and Griffiths ran this over the discrete distributions over words in a topic determined by the sample.

Running the Demo

the topic evaluation may be run from ant with the target lda-topic-sim with property wormbase.gz set to the path for the data.

The output very closely mirrors that in the Steyvers and Griffiths paper, though their graphical presentation by color intensity is rather difficult to accurately decipher. Here we just list the divergences in ascending order:.

Epoch=1860   elapsed time=3:30:45
Epoch=1998   elapsed time=3:30:45
Epoch=1999   elapsed time=3:30:51
      log2 p(corpus|phi,theta)=-3.83662707520383E7     token cross-entropy rate=10.101570740784956
Epoch=1861   elapsed time=3:30:58
...
      log2 p(corpus|phi,theta)=-3.836600079880293E7     token cross-entropy rate=10.101499663986237
...
Epoch=1900   elapsed time=3:35:11
      log2 p(corpus|phi,theta)=-3.8365097949822724E7     token cross-entropy rate=10.101261950164616
...
Epoch=1950   elapsed time=3:40:41
      log2 p(corpus|phi,theta)=-3.836379238634189E7     token cross-entropy rate=10.10091820443172
...
Epoch=1990   elapsed time=3:45:04
      log2 p(corpus|phi,theta)=-3.8363853398416765E7     token cross-entropy rate=10.100934268484291
...
Epoch=1999   elapsed time=3:46:04

Computing Greedy Aligned Symmetrized KL Divergences
   0           0.339
   1           0.417
   2           0.529
   3           0.552
   4           0.604
   5           0.622
   6           0.646
   7           0.715
   8           0.727
   9           0.743
  10           0.756
  11           0.792
  12           0.820
  13           0.823
  14           0.923
  15           0.941
  16           1.127
  17           1.249
  18           1.400
  19           1.402
  20           1.419
  21           1.463
  22           1.484
  23           1.534
  24           1.638
  25           1.665
  26           1.708
  27           1.730
  28           1.869
  29           1.875
  30           2.037
  31           2.322
  32           2.347
  33           2.574
  34           2.658
  35           2.783
  36           3.270
  37           3.519
  38           3.568
  39           3.591
  40           3.737
  41           4.122
  42           4.809
  43           5.159
  44           6.169
  45           7.338
  46           8.924
  47           9.784
  48           9.923
  49          11.714

The initial set of values indicates a very good match. We're looking at multinomials over thousands of words with very low KL divergences. As the ordering decreases, the divergences grow larger.

Multi-Threaded Code

The code for the evaluation is in src/LdaTopicSimilarity.java. The first thing to note about it is that it runs in multiple threads. The texts and their multinomial representations are created as before. But now, rather than calling the Gibbs sampling method directly, we start two threads:

LdaRunnable runnable1 = new LdaRunnable(docTokens,
                                        new LdaReportingHandler(symbolTable),
                                        new Random());

LdaRunnable runnable2 = new LdaRunnable(docTokens,
                                        new LdaReportingHandler(symbolTable),
                                        new Random());

Thread thread1 = new Thread(runnable1);
Thread thread2 = new Thread(runnable2);

thread1.start();
thread2.start();

thread1.join();
thread2.join();

LatentDirichletAllocation lda0 = runnable1.mLda;
LatentDirichletAllocation lda1 = runnable2.mLda;

First, we create the two runnables, which get the document-token arrays, their own handlers, and their own fresh randomizers. Then the runnables are embedded in threads and started. In order to wait for the threads to finish running, the current (calling) thread, calls join() on each thread in turn. After both methods have completed, we'll know both threads have finished. At that point, we pull the LDA objects out of the runnables.

The runnable's code just does the obvious sampling.

static class LdaRunnable implements Runnable {
    LatentDirichletAllocation mLda;
    final int[][] mDocTokens;
    final ObjectHandler<LatentDirichletAllocation.GibbsSample> mHandler;
    final Random mRandom;

    LdaRunnable(int[][] docTokens,
                ObjectHandler<LatentDirichletAllocation.GibbsSample> handler,
                Random random) {
        mDocTokens = docTokens;
        mHandler = handler;
        mRandom = random;
    }

    public void run() {
        mLda = sample(mDocTokens,mHandler,mRandom);
    }
}

The sample() method itself just runs sampling and then converts the sample to an LDA model:

LatentDirichletAllocation.GibbsSample sample
    = LatentDirichletAllocation
      .gibbsSampler(docTokens,
                    ...

return sample.lda();

We then take the two LDA instances and compare them for topic similarity using greedy alignment. Specifically, we compare each pair of topic distributions using symmetrized KL divergence. Then we continue to take pairs of topics that have the highest scores but have not been used already, until we are out of topics.

For the sake of understanding LDA, the only part of the implementation that's important is the computation of divergence:

for (int i = 0; i < numTopics; ++i) {
    for (int j = 0; j < numTopics; ++j) {
        if (i == j) continue;
        double divergence
            = Statistics
              .symmetrizedKlDivergence(lda0.wordProbabilities(i),
                                       lda1.wordProbabilities(j));
        ...

The static divergence method is in the com.aliasi.stats.Statistics.

LDA Document Similarity: More Like This

Just as we compared topics using KL divergence, we can compare documents by measuring the divergence of their distributions over topics. Within the sampling process, we have access to both the Dirichlet posterior parameters and thus the discrete distribution representing its mean. These may be used to compare the two documents.

Alternatively, an LDA model may be used for a pair of fresh documents in the following way. This merely requires first estimating their topic distribution, which may be done by sampling or by taking an aggregate mean over samples. Both approaches are represented within the LDA class as methods.

References