com.aliasi.cluster
Class LatentDirichletAllocation.GibbsSample

java.lang.Object
  extended by com.aliasi.cluster.LatentDirichletAllocation.GibbsSample
Enclosing class:
LatentDirichletAllocation

public static class LatentDirichletAllocation.GibbsSample
extends Object

The LatentDirichletAllocation.GibbsSample class encapsulates all of the information related to a single Gibbs sample for latent Dirichlet allocation (LDA). A sample consists of the assignment of a topic identifier to each token in the corpus. Other methods in this class are derived from either the topic samples, the data being estimated, and the LDA parameters such as priors.

Instances of this class are created by the sampling method in the containing class, LatentDirichletAllocation. For convenience, the sample includes all of the data used to construct the sample, as well as the hyperparameters used for sampling.

As described in the class documentation for the containing class LatentDirichletAllocation, the primary content in a Gibbs sample for LDA is the assignment of a single topic to each token in the corpus. Cumulative counts for topics in documents and words in topics as well as total counts are also available; they do not entail any additional computation costs as the sampler maintains them as part of the sample.

The sample also contains meta information about the state of the sampling procedure. The epoch at which the sample was produced is provided, as well as an indication of how many topic assignments changed between this sample and the previous sample (note that this is the previous sample in the chain, not necessarily the previous sample handled by the LDA handler; the handler only gets the samples separated by the specified lag.

The sample may be used to generate an LDA model. The resulting model may then be used for estimation of unseen documents. Typically, models derived from several samples are used for Bayesian computations, as described in the class documentation above.

Since:
LingPipe3.3
Version:
3.3.0
Author:
Bob Carpenter

Method Summary
 double corpusLog2Probability()
          Returns an estimate of the log (base 2) likelihood of the corpus given the point estimates of topic and document multinomials determined from this sample.
 int documentLength(int doc)
          Returns the length of the specified document in tokens.
 int documentTopicCount(int doc, int topic)
          Returns the number of times the specified topic was assigned to the specified document in this sample.
 double documentTopicPrior()
          Returns the uniform Dirichlet concentration hyperparameter α for document distributions over topics from which this sample was produced.
 double documentTopicProb(int doc, int topic)
          Returns the estimate of the probability of the topic being assigned to a word in the specified document given the topic * assignments in this sample.
 int epoch()
          Returns the epoch in which this sample was generated.
 LatentDirichletAllocation lda()
          Returns a latent Dirichlet allocation model corresponding to this sample.
 int numChangedTopics()
          Returns the total number of topic assignments to tokens that changed between the last sample and this one.
 int numDocuments()
          Returns the number of documents on which the sample was based.
 int numTokens()
          Returns the number of tokens in documents on which the sample was based.
 int numTopics()
          Returns the number of topics for this sample.
 int numWords()
          Returns the number of distinct words in the documents on which the sample was based.
 int topicCount(int topic)
          Returns the total number of tokens assigned to the specified topic in this sample.
 short topicSample(int doc, int token)
          Returns the topic identifier sampled for the specified token position in the specified document.
 int topicWordCount(int topic, int word)
          Returns the number of times tokens for the specified word were assigned to the specified topic.
 double topicWordPrior()
          Returns the uniform Dirichlet concentration hyperparameter β for topic distributions over words from which this sample was produced.
 double topicWordProb(int topic, int word)
          Returns the probability estimate for the specified word in the specified topic in this sample.
 int word(int doc, int token)
          Returns the word identifier for the specified token position in the specified document.
 int wordCount(int word)
          Returns the number of times tokens of the specified word appeared in the corpus.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

epoch

public int epoch()
Returns the epoch in which this sample was generated.

Returns:
The epoch for this sample.

numDocuments

public int numDocuments()
Returns the number of documents on which the sample was based.

Returns:
The number of documents for the sample.

numWords

public int numWords()
Returns the number of distinct words in the documents on which the sample was based.

Returns:
The number of words underlying the model.

numTokens

public int numTokens()
Returns the number of tokens in documents on which the sample was based. Each token is an instance of a particular word.


numTopics

public int numTopics()
Returns the number of topics for this sample.

Returns:
The number of topics for this sample.

topicSample

public short topicSample(int doc,
                         int token)
Returns the topic identifier sampled for the specified token position in the specified document.

Parameters:
doc - Identifier for a document.
token - Token position in the specified document.
Returns:
The topic assigned to the specified token in this sample.
Throws:
IndexOutOfBoundsException - If the document identifier is not between 0 (inclusive) and the number of documents (exclusive), or if the token is not between 0 (inclusive) and the number of tokens (exclusive) in the specified document.

word

public int word(int doc,
                int token)
Returns the word identifier for the specified token position in the specified document.

Parameters:
doc - Identifier for a document.
token - Token position in the specified document.
Returns:
The word found at the specified position in the specified document.
Throws:
IndexOutOfBoundsException - If the document identifier is not between 0 (inclusive) and the number of documents (exclusive), or if the token is not between 0 (inclusive) and the number of tokens (exclusive) in the specified document.

documentTopicPrior

public double documentTopicPrior()
Returns the uniform Dirichlet concentration hyperparameter α for document distributions over topics from which this sample was produced.

Returns:
The document-topic prior.

topicWordPrior

public double topicWordPrior()
Returns the uniform Dirichlet concentration hyperparameter β for topic distributions over words from which this sample was produced.


documentTopicCount

public int documentTopicCount(int doc,
                              int topic)
Returns the number of times the specified topic was assigned to the specified document in this sample.

Parameters:
doc - Identifier for a document.
topic - Identifier for a topic.
Returns:
The count of the topic in the document in this sample.
Throws:
IndexOutOfBoundsException - If the document identifier is not between 0 (inclusive) and the number of documents (exclusive) or if the topic identifier is not between 0 (inclusive) and the number of topics (exclusive).

documentLength

public int documentLength(int doc)
Returns the length of the specified document in tokens.

Parameters:
doc - Identifier for a document.
Returns:
The length of the specified document in tokens.
Throws:
IndexOutOfBoundsException - If the document identifier is not between 0 (inclusive) and the number of documents (exclusive).

topicWordCount

public int topicWordCount(int topic,
                          int word)
Returns the number of times tokens for the specified word were assigned to the specified topic.

Parameters:
topic - Identifier for a topic.
word - Identifier for a word.
Returns:
The number of tokens of the specified word assigned to the specified topic.
Throws:
IndexOutOfBoundsException - If the specified topic is not between 0 (inclusive) and the number of topics (exclusive), or if the word is not between 0 (inclusive) and the number of words (exclusive).

topicCount

public int topicCount(int topic)
Returns the total number of tokens assigned to the specified topic in this sample.

Parameters:
topic - Identifier for a topic.
Returns:
The total number of tokens assigned to the specified topic.
Throws:
IllegalArgumentException - If the specified topic is not between 0 (inclusive) and the number of topics (exclusive).

numChangedTopics

public int numChangedTopics()
Returns the total number of topic assignments to tokens that changed between the last sample and this one. Note that this is the last sample in the chain, not the last sample necessarily passed to a handler, because handlers may not be configured to handle every * sample.

Returns:
The number of topics assignments that changed in this sample relative to the previous sample.

topicWordProb

public double topicWordProb(int topic,
                            int word)
Returns the probability estimate for the specified word in the specified topic in this sample. This value is calculated as a maximum a posteriori estimate computed as described in the class documentation for LatentDirichletAllocation using the topic assignment counts in this sample and the topic-word prior.

Parameters:
topic - Identifier for a topic.
word - Identifier for a word.
Returns:
The probability of generating the specified word in the specified topic.
Throws:
IndexOutOfBoundsException - If the specified topic is not between 0 (inclusive) and the number of topics (exclusive), or if the word is not between 0 (inclusive) and the number of words (exclusive).

wordCount

public int wordCount(int word)
Returns the number of times tokens of the specified word appeared in the corpus.

Parameters:
word - Identifier of a word.
Returns:
The number of tokens of the word in the corpus.
Throws:
IndexOutOfBoundsException - If the word identifier is not between 0 (inclusive) and the number of words (exclusive).

documentTopicProb

public double documentTopicProb(int doc,
                                int topic)
Returns the estimate of the probability of the topic being assigned to a word in the specified document given the topic * assignments in this sample. This is the maximum a posteriori estimate computed from the topic assignments * as described in the class documentation for LatentDirichletAllocation using the topic assignment counts in this sample and the document-topic prior.

Parameters:
doc - Identifier of a document.
topic - Identifier for a topic.
Returns:
An estimate of the probabilty of the topic in the document.
Throws:
IndexOutOfBoundsException - If the document identifier is not between 0 (inclusive) and the number of documents (exclusive) or if the topic identifier is not between 0 (inclusive) and the number of topics (exclusive).

corpusLog2Probability

public double corpusLog2Probability()
Returns an estimate of the log (base 2) likelihood of the corpus given the point estimates of topic and document multinomials determined from this sample.

This likelihood calculation uses the methods documentTopicProb(int,int) and topicWordProb(int,int) for estimating likelihoods according the following formula:

 corpusLog2Probability()
 = Σdoc,i log2 Σtopic p(topic|doc) * p(word[doc][i]|topic)

Note that this is not the complete corpus likelihood, which requires integrating over possible topic and document multinomials given the priors.

Returns:
The log (base 2) likelihood of the training corpus * given the document and topic multinomials determined by this sample.

lda

public LatentDirichletAllocation lda()
Returns a latent Dirichlet allocation model corresponding to this sample. The topic-word probabilities are calculated according to topicWordProb(int,int), and the document-topic prior is as specified in the call to LDA that produced this sample.

Returns:
The LDA model for this sample.