com.aliasi.cluster
Class LatentDirichletAllocation

java.lang.Object
  extended by com.aliasi.cluster.LatentDirichletAllocation
All Implemented Interfaces:
Serializable

public class LatentDirichletAllocation
extends Object
implements Serializable

A LatentDirichletAllocation object represents a latent Dirichlet allocation (LDA) model. LDA provides a Bayesian model of document generation in which each document is generated by a mixture of topical multinomials. An LDA model specifies the number of topics, a Dirichlet prior over topic mixtures for a document, and a discrete distribution over words for each topic.

A document is generated from an LDA model by first selecting a multinomial over topics given the Dirichlet prior. Then for each token in the document, a topic is generated from the document-specific topic distribution, and then a word is generated from the discrete distribution for that topic. Note that document length is not generated by this model; a fully generative model would need a way of generating lengths (e.g. a Poisson distribution) or terminating documents (e.g. a disginuished end-of-document symbol).

An LDA model may be estimated from an unlabeled training corpus (collection of documents) using a second Dirichlet prior, this time over word distributions in a topic. This class provides a static inference method that produces (collapsed Gibbs) samples from the posterior distribution of topic assignments to words, any one of which may be used to construct an LDA model instance.

An LDA model can be used to infer the topic mixture of unseen text documents, which can be used to compare documents by topical similarity. A fixed LDA model may also be used to estimate the likelihood of a word occurring in a document given the other words in the document. A collection of LDA models may be used for fully Bayesian reasoning at the corpus (collection of documents) level.

LDA may be applied to arbitrary multinomial data. To apply it to text, a tokenizer factory converts a text document to bag of words and a symbol table converts these tokens to integer outcomes. The static method tokenizeDocuments(CharSequence[],TokenizerFactory,SymbolTable,int) is available to carry out the text to multinomial conversion with pruning of low counts.

LDA Generative Model

LDA operates over a fixed vocabulary of discrete outcomes, which we call words for convenience, and represent as a set of integers:

 words = { 0, 1, ..., numWords-1 }

A corpus is a ragged array of documents, each document consisting of an array of words:

 int[][] words = { words[0], words[1], ..., words[numDocs-1] }
A given document words[i], i < numDocs, is itself represented as a bag of words, each word being represented as an integer:
 int[] words[i] = { words[i][0], words[i][1], ..., words[i][words[i].length-1] }
The documents do not all need to be the same length, so the two-dimensional array words is ragged.

A particular LDA model is defined over a fixed number of topics, also represented as integers:

 topics = { 0, 1, ..., numTopics-1 }

For a given topic topic < numTopics, LDA specifies a discrete distribution φ[topic] over words:

 φ[topic][word] >= 0.0

 Σword < numWords φ[topic][word] = 1.0

In an LDA model, each document in the corpus is generated from a document-specific mixture of topics θ[doc]. The distribution θ[doc] is a discrete distribution over topics:

 θ[doc][topic] >= 0.0

 Σtopic < numTopics θ[doc][topic] = 1.0

Under LDA's generative model for documents, a document-specific topic mixture θ[doc] is generated from a uniform Dirichlet distribution with concentration parameter α. The Dirichlet is the conjugate prior for the multinomial in which α acts as a prior count assigned to a topic in a document. Typically, LDA is run with a fairly diffuse prior with concentration α < 1, leading to skewed posterior topic distributions.

Given a topic distribution θ[doc] for a document, tokens are generated (conditionally) independently. For each token in the document, a topic topics[doc][token] is generated according to the topic distribution θ[doc], then the word instance words[doc][token] is generated given the topic using the topic-specific distribution over tokens φ[topics[doc][token]].

For estimation purposes, LDA places a uniform Dirichlet prior with concentration parameter β on each of the topic distributions φ[topic]. The first step in modeling a corpus is to generate the topic distributions φ from a Dirichlet parameterized by β.

In sampling notation, the LDA generative model is expressed as follows:

 φ[topic]           ~ Dirichlet(β)
 θ[doc]             ~ Dirichlet(α)
 topics[doc][token] ~ Discrete(θ[doc])
 words[doc][token]  ~ Discrete(φ[topic[doc][token]])

A generative Bayesian model is based on a the joint probablity of all observed and unobserved variables, given the priors. Given a text corpus, only the words are observed. The unobserved variables include the assignment of a topic to each word, the assignment of a topic distribution to each document, and the assignment of word distributions to each topic. We let topic[doc][token], doc < numDocs, token < docLength[doc], be the topic assigned to the specified token in the specified document. This leaves two Dirichlet priors, one parameterized by α for topics in documents, and one parameterized by β for words in topics. These priors are treated as hyperparameters for purposes of estimation; cross-validation may be used to provide so-called empirical Bayes estimates for the priors.

The full LDA probability model for a corpus follows the generative process outlined above. First, topic distributions are chosen at the corpus level for each topic given their Dirichlet prior, and then the remaining variables are generating given these topic distributions:

 p(words, topics, θ, &phi | α, &beta)
 = Πtopic < numTopics
        p(φ[topic] | β)
        * p(words, topics, θ | α, φ)
Note that because the multinomial parameters for φ[topic] are continuous, p(φ[topic] | β) represents a density, not a discrete distribution. Thus it does not make sense to talk about the probability of a given multinomial φ[topic]; non-zero results only arise from integrals over measurable subsets of the multinomial simplex. It is possible to sample from a density, so the generative model is well founded.

A document is generated by first generating its topic distribution given the Dirichlet prior, and then generating its topics and words:

 p(words, topics, θ | α, φ)
 = Πdoc < numDocs
        p(θ[doc] | α)
        * p(words[doc], topics[doc] | θ[doc], φ)
The topic and word are generated from the multinomials θ[doc] and the topic distributions φ using the chain rule, first generating the topic given the document's topic distribution and then generating the word given the topic's word distribution.
 p(words[doc], topics[doc] | θ[doc], φ)
 = Πtoken < words[doc].length
        p(topics[doc][token] | θ[doc])
        * p(words[doc][token] | φ[topics[doc][token]])

Given the topic and document multinomials, this distribution is discrete, and thus may be evaluated. It may also be marginalized by summation:

 p(words[doc] | θ[doc], φ)
 = Π token < words[doc].length
       Σtopic < numTopics p(topic | θ[doc]) * p(words[doc][token] | φ[topic])

Conditional probablities are computed in the usual way by marginalizing other variables through integration. Unfortunately, this simple mathematical operation is often intractable computationally.

Estimating LDA with Collapsed Gibbs Sampling

This class uses a collapsed form of Gibbs sampling over the posterior distribution of topic assignments given the documents and Dirichlet priors:

 p(topics | words, α β)

This distribution may be derived from the joint distribution by marginalizing (also known as "collapsing" or "integrating out") the contributions of the document-topic and topic-word distributions.

The Gibbs sampler used to estimate LDA models produces samples that consist of a topic assignment to every token in the corpus. The conjugacy of the Dirichlet prior for multinomials makes the sampling straightforward.

An initial sample is produced by randomly assigning topics to tokens. Then, the sampler works iteratively through the corpus, one token at a time. At each token, it samples a new topic assignment to that token given all the topic assignments to other tokens in the corpus:

 p(topics[doc][token] | words, topics')
The notation topics' represents the set of topic assignments other than to topics[doc][token]. This collapsed posterior conditional is estimated directly:
 p(topics[doc][token] = topic | words, topics')
 = (count'(doc,topic) + α) / (docLength[doc]-1 + numTopics*α)
 * (count'(topic,word) + β) / (count'(topic) + numWords*β)
The counts are all defined relative to topics'; that is, the current topic assignment for the token being sampled is not considered in the counts. Note that the two factors are estimates of θ[doc] and φ[topic] with all data other than the assignment to the current token. Note how the prior concentrations arise as additive smoothing factors in these estimates, a result of the Dirichlet's conjugacy to the multinomial. For the purposes of sampling, the document-length normalization in the denominator of the first term is not necessary, as it remains constant across topics.

The posterior Dirichlet distributions may be computed using just the counts. For instance, the posterior distribution for topics in documents is estimated as:

 p(&theta[doc]|α, β, words)
 = Dirichlet(count(doc,0)+β, count(doc,1)+β, ..., count(doc,numTopics-1)+β)

The sampling distribution is defined from the maximum a posteriori (MAP) estimates of the multinomial distribution over topics in a document:

 θ*[doc] = ARGMAXθ[doc] p(θ[doc] | α, β, words)
which we know from the Dirichlet distribution is:
 θ*[doc][topic]
 = (count(doc,topic) + α) / (docLength[doc] + numTopics*α)
By the same reasoning, the MAP word distribution in topics is:
 φ*[topic][word]
 = (count(topic,word) + β) / (count(topic) + numWords*β)

A complete Gibbs sample is represented as an instance of LatentDirichletAllocation.GibbsSample, which provides access to the topic assignment to every token, as well as methods to compute θ* and φ* as defined above. A sample also maintains the original priors and word counts. Just the estimates of the topic-word distributions φ[topic] and the prior topic concentration α are sufficient to define an LDA model. Note that the imputed values of θ*[doc] used during estimation are part of a sample, but are not part of the LDA model itself. The LDA model contains enough information to estimate θ* for an arbitrary document, as described in the next section.

The Gibbs sampling algorithm starts with a random assignment of topics to words, then simply iterates through the tokens in turn, sampling topics according to the distribution defined above. After each run through the entire corpus, a callback is made to a handler for the samples. This setup may be configured for an initial burnin period, essentially just discarding the first batch of samples. Then it may be configured to sample only periodically thereafter to avoid correlations between samples.

LDA as Multi-Topic Classifier

An LDA model consists of a topic distribution Dirichlet prior α and a word distribution φ[topic] for each topic. Given an LDA model and a new document words = { words[0], ..., words[length-1] } consisting of a sequence of words, the posterior distribution over topic weights is given by:

 p(θ | words, α, φ)
Although this distribution is not solvable analytically, it is easy to estimate using a simplified form of the LDA estimator's Gibbs sampler. The conditional distribution of a topic assignment topics[token] to a single token given an assignment topics' to all other tokens is given by:
 p(topic[token] | topics', words, α, φ)
 ∝ p(topic[token], words[token] | topics', α φ)
 = p(topic[token] | topics', α) * p(words[token] | φ[topic[token]])
 = (count(topic[token]) + α) / (words.length - 1 + numTopics * α)
   * p(words[token] | φ[topic[token]])
This leads to a straightforward sampler over posterior topic assignments, from which we may directly compute the Dirichlet posterior over topic distributions or a MAP topic distribution.

This class provides a method to sample these topic assignments, which may then be used to form Dirichlet distributions or MAP point estimates of θ* for the document words.

LDA as a Conditional Language Model

An LDA model may be used to estimate the likelihood of a word given a previous bag of words:

 p(word | words, α, φ)
 = p(word | θ, φ) p(θ | words, α, φ) dθ
This integral is easily evaluated using sampling over the topic distributions p(θ | words, α, φ) and averaging the word probability determined by each sample. The word probability for a sample θ is defined by:
 p(word | θ, φ)
 = Σtopic < numTopics p(topic | θ) * p(word | φ[topic])
Although this approach could theoretically be applied to generate the probability of a document one word at a time, the cost would be prohibitive, as there are quadratically many samples required because samples for the n-th word consist of topic assignments to the previous n-1 words.

Bayesian Calculations and Exchangeability

An LDA model may be used for a variety of statistical calculations. For instance, it may be used to determine the distribution of topics to words, and using these distributions, may determine word similarity. Similarly, document similarity may be determined by the topic distributions in a document.

Point estimates are derived using a single LDA model. For Bayesian calculation, multiple samples are taken to produce multiple LDA models. The results of a calculation on these models is then averaged to produce a Bayesian estimate of the quantity of interest. The sampling methodology is effectively numerically computing the integral over the posterior.

Bayesian calculations over multiple samples are complicated by the exchangeability of topics in the LDA model. In particular, there is no guarantee that topics are the same between samples, thus it is not acceptable to combine samples in topic-level reasoning. For instance, it does not make sense to estimate the probability of a topic in a document using multiple samples.

Non-Document Data

The "words" in an LDA model don't necessarily have to represent words in documents. LDA is basically a multinomial mixture model, and any multinomial outcomes may be modeled with LDA. For instance, a document may correspond to a baseball game and the words may correspond to the outcomes of at-bats (some might occur more than once). LDA has also been used for gene expression data, where expression levels from mRNA microarray experiments is quantized into a multinomial outcome.

LDA has also been applied to collaborative filtering. Movies act as words, with users modeled as documents, the bag of words they've seen. Given an LDA model and a user's films, the user's topic distribution may be inferred and used to estimate the likelihood of seeing unseen films.

References

Since:
LingPipe3.3
Version:
3.9.2
Author:
Bob Carpenter
See Also:
Serialized Form

Nested Class Summary
static class LatentDirichletAllocation.GibbsSample
          The LatentDirichletAllocation.GibbsSample class encapsulates all of the information related to a single Gibbs sample for latent Dirichlet allocation (LDA).
 
Constructor Summary
LatentDirichletAllocation(double docTopicPrior, double[][] topicWordProbs)
          Construct a latent Dirichelt allocation (LDA) model using the specified document-topic prior and topic-word distributions.
 
Method Summary
 double[] bayesTopicEstimate(int[] tokens, int numSamples, int burnin, int sampleLag, Random random)
          Return the Bayesian point estimate of the topic distribution for a document consisting of the specified tokens, using Gibbs sampling with the specified parameters.
 double documentTopicPrior()
          Returns the concentration value of the uniform Dirichlet prior over topic distributions for documents.
static Iterator<LatentDirichletAllocation.GibbsSample> gibbsSample(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, Random random)
          Return an iterator over Gibbs samples for the specified document-word corpus, number of topics, priors and randomizer.
static LatentDirichletAllocation.GibbsSample gibbsSampler(int[][] docWords, short numTopics, double docTopicPrior, double topicWordPrior, int burninEpochs, int sampleLag, int numSamples, Random random, ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
          Run Gibbs sampling for the specified multinomial data, number of topics, priors, search parameters, randomization and callback sample handler.
 double[] mapTopicEstimate(int[] tokens, int numSamples, int burnin, int sampleLag, Random random)
          Deprecated. Use bayesTopicEstimate(int[],int,int,int,Random) instead.
 int numTopics()
          Returns the number of topics in this LDA model.
 int numWords()
          Returns the number of words on which this LDA model is based.
 short[][] sampleTopics(int[] tokens, int numSamples, int burnin, int sampleLag, Random random)
          Returns the specified number of Gibbs samples of topics for the specified tokens using the specified number of burnin epochs, the specified lag between samples, and the specified randomizer.
static int[] tokenizeDocument(CharSequence text, TokenizerFactory tokenizerFactory, SymbolTable symbolTable)
          Tokenizes the specified text document using the specified tokenizer factory returning only tokens that exist in the symbol table.
static int[][] tokenizeDocuments(CharSequence[] texts, TokenizerFactory tokenizerFactory, SymbolTable symbolTable, int minCount)
          Tokenize an array of text documents represented as character sequences into a form usable by LDA, using the specified tokenizer factory and symbol table.
 double[] wordProbabilities(int topic)
          Returns an array representing of probabilities of words in the specified topic.
 double wordProbability(int topic, int word)
          Returns the probability of the specified word in the specified topic.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

LatentDirichletAllocation

public LatentDirichletAllocation(double docTopicPrior,
                                 double[][] topicWordProbs)
Construct a latent Dirichelt allocation (LDA) model using the specified document-topic prior and topic-word distributions.

The topic-word probability array topicWordProbs represents a collection of discrete distributions topicwordProbs[topic] for topics, and thus must satisfy:

 topicWordProbs[topic][word] >= 0.0

 Σword < numWords topicWordProbs[topic][word] = 1.0

Warning: These requirements are not checked by the constructor.

See the class documentation above for an explanation of the parameters and what can be done with a model.

Parameters:
docTopicPrior - The document-topic prior.
topicWordProbs - Array of discrete distributions over words, indexed by topic.
Throws:
IllegalArgumentException - If the document-topic prior is not finite and positive, or if the topic-word probabilities arrays are not all the same length with entries between 0.0 and 1.0 inclusive.
Method Detail

numTopics

public int numTopics()
Returns the number of topics in this LDA model.

Returns:
The number of topics in this model.

numWords

public int numWords()
Returns the number of words on which this LDA model is based.

Returns:
The numbe of words in this model.

documentTopicPrior

public double documentTopicPrior()
Returns the concentration value of the uniform Dirichlet prior over topic distributions for documents. This value is effectively a prior count for topics used for additive smoothing during estimation.

Returns:
The prior count of topics in documents.

wordProbability

public double wordProbability(int topic,
                              int word)
Returns the probability of the specified word in the specified topic. The values returned should be non-negative and finite, and should sum to 1.0 over all words for a specifed topic.

Parameters:
topic - Topic identifier.
word - Word identifier.
Returns:
Probability of the specified word in the specified topic.

wordProbabilities

public double[] wordProbabilities(int topic)
Returns an array representing of probabilities of words in the specified topic. The probabilities are indexed by word identifier.

The returned result is a copy of the underlying data in the model so that changing it will not change the model.

Parameters:
topic - Topic identifier.
Returns:
Array of probabilities of words in the specified topic.

sampleTopics

public short[][] sampleTopics(int[] tokens,
                              int numSamples,
                              int burnin,
                              int sampleLag,
                              Random random)
Returns the specified number of Gibbs samples of topics for the specified tokens using the specified number of burnin epochs, the specified lag between samples, and the specified randomizer. The array returned is an array of samples, each sample consisting of a topic assignment to each token in the specified list of tokens. The tokens must all be in the appropriate range for this class

See the class documentation for more information on how the samples are computed.

Parameters:
tokens - The tokens making up the document.
numSamples - Number of Gibbs samples to return.
burnin - The number of samples to take and throw away during the burnin period.
sampleLag - The interval between samples after burnin.
random - The random number generator to use for this sampling process.
Returns:
The selection of topic samples generated by this sampler.
Throws:
IndexOutOfBoundsException - If there are tokens whose value is less than zero, or whose value is greater than the number of tokens in this model.
IllegalArgumentException - If the number of samples is not positive, the sample lag is not positive, or if the burnin period is negative. if the number of samples, burnin, and lag are not positive numbers.

mapTopicEstimate

@Deprecated
public double[] mapTopicEstimate(int[] tokens,
                                            int numSamples,
                                            int burnin,
                                            int sampleLag,
                                            Random random)
Deprecated. Use bayesTopicEstimate(int[],int,int,int,Random) instead.

Replaced by method bayesTopicEstimate() because of original misnaming.

Warning: This is actually not a maximum a posterior (MAP) estimate as suggested by the name.

Parameters:
tokens - The tokens making up the document.
numSamples - Number of Gibbs samples to return.
burnin - The number of samples to take and throw away during the burnin period.
sampleLag - The interval between samples after burnin.
random - The random number generator to use for this sampling process.
Returns:
The selection of topic samples generated by this sampler.
Throws:
IndexOutOfBoundsException - If there are tokens whose value is less than zero, or whose value is greater than the number of tokens in this model.
IllegalArgumentException - If the number of samples is not positive, the sample lag is not positive, or if the burnin period is negative.

bayesTopicEstimate

public double[] bayesTopicEstimate(int[] tokens,
                                   int numSamples,
                                   int burnin,
                                   int sampleLag,
                                   Random random)
Return the Bayesian point estimate of the topic distribution for a document consisting of the specified tokens, using Gibbs sampling with the specified parameters. The Gibbs topic samples are simply averaged to produce the Bayesian estimate, which minimizes expected square loss.

See the method sampleTopics(int[],int,int,int,Random) and the class documentation for more information on the sampling procedure.

Parameters:
tokens - The tokens making up the document.
numSamples - Number of Gibbs samples to return.
burnin - The number of samples to take and throw away during the burnin period.
sampleLag - The interval between samples after burnin.
random - The random number generator to use for this sampling process.
Returns:
The selection of topic samples generated by this sampler.
Throws:
IndexOutOfBoundsException - If there are tokens whose value is less than zero, or whose value is greater than the number of tokens in this model.
IllegalArgumentException - If the number of samples is not positive, the sample lag is not positive, or if the burnin period is negative.

gibbsSampler

public static LatentDirichletAllocation.GibbsSample gibbsSampler(int[][] docWords,
                                                                 short numTopics,
                                                                 double docTopicPrior,
                                                                 double topicWordPrior,
                                                                 int burninEpochs,
                                                                 int sampleLag,
                                                                 int numSamples,
                                                                 Random random,
                                                                 ObjectHandler<LatentDirichletAllocation.GibbsSample> handler)
Run Gibbs sampling for the specified multinomial data, number of topics, priors, search parameters, randomization and callback sample handler. Gibbs sampling provides samples from the posterior distribution of topic assignments given the corpus and prior hyperparameters. A sample is encapsulated as an instance of class LatentDirichletAllocation.GibbsSample. This method will return the final sample and also send intermediate samples to an optional handler.

The class documentation above explains Gibbs sampling for LDA as used in this method.

The primary input is an array of documents, where each document is represented as an array of integers representing the tokens that appear in it. These tokens should be numbered contiguously from 0 for space efficiency. The topic assignments in the Gibbs sample are aligned as parallel arrays to the array of documents.

The next three parameters are the hyperparameters of the model, specifically the number of topics, the prior count assigned to topics in a document, and the prior count assigned to words in topics. A rule of thumb for the document-topic prior is to set it to 5 divided by the number of topics (or less if there are very few topics; 0.1 is typically the maximum value used). A good general value for the topic-word prior is 0.01. Both of these priors will be diffuse and tend to lead to skewed posterior distributions.

The following three parameters specify how the sampling is to be done. First, the sampler is "burned in" for a number of epochs specified by the burnin parameter. After burn in, samples are taken after fixed numbers of documents to avoid correlation in the samples; the sampling frequency is specified by the sample lag. Finally, the number of samples to be taken is specified. For instance, if the burnin is 1000, the sample lag is 250, and the number of samples is 5, then samples are taken after 1000, 1250, 1500, 1750 and 2000 epochs. If a non-null handler object is specified in the method call, its handle(GibbsSample) method is called with each the samples produced as above.

The final sample in the chain of samples is returned as the result. Note that this sample will also have been passed to the specified handler as the last sample for the handler.

A random number generator must be supplied as an argument. This may just be a new instance of Random or a custom extension. It is used for all randomization in this method.

Parameters:
docWords - Corpus of documents to be processed.
numTopics - Number of latent topics to generate.
docTopicPrior - Prior count of topics in a document.
topicWordPrior - Prior count of words in a topic.
burninEpochs - Number of epochs to run before taking a sample.
sampleLag - Frequency between samples.
numSamples - Number of samples to take before exiting.
random - Random number generator.
handler - Handler to which the samples are sent.
Returns:
The final Gibbs sample.

gibbsSample

public static Iterator<LatentDirichletAllocation.GibbsSample> gibbsSample(int[][] docWords,
                                                                          short numTopics,
                                                                          double docTopicPrior,
                                                                          double topicWordPrior,
                                                                          Random random)
Return an iterator over Gibbs samples for the specified document-word corpus, number of topics, priors and randomizer. These are the same Gibbs samples as wold be produced by the method gibbsSampler(int[][],short,double,double,int,int,int,Random,ObjectHandler). See that method and the class documentation for more details.

Parameters:
docWords - Corpus of documents to be processed.
numTopics - Number of latent topics to generate.
docTopicPrior - Prior count of topics in a document.
topicWordPrior - Prior count of words in a topic.
random - Random number generator.

tokenizeDocuments

public static int[][] tokenizeDocuments(CharSequence[] texts,
                                        TokenizerFactory tokenizerFactory,
                                        SymbolTable symbolTable,
                                        int minCount)
Tokenize an array of text documents represented as character sequences into a form usable by LDA, using the specified tokenizer factory and symbol table. The symbol table should be constructed fresh for this application, but may be used after this method is called for further token to symbol conversions. Only tokens whose count is equal to or larger the specified minimum count are included. Only tokens whose count exceeds the minimum are added to the symbol table, thus producing a compact set of symbol assignments to tokens for downstream processing.

Warning: With some tokenizer factories and or minimum count thresholds, there may be documents with no tokens in them.

Parameters:
texts - The text corpus.
tokenizerFactory - A tokenizer factory for tokenizing the texts.
symbolTable - Symbol table used to convert tokens to identifiers.
minCount - Minimum count for a token to be included in a document's representation.
Returns:
The tokenized form of a document suitable for input to LDA.

tokenizeDocument

public static int[] tokenizeDocument(CharSequence text,
                                     TokenizerFactory tokenizerFactory,
                                     SymbolTable symbolTable)
Tokenizes the specified text document using the specified tokenizer factory returning only tokens that exist in the symbol table. This method is useful within a given LDA model for tokenizing new documents into lists of words.

Parameters:
text - Character sequence to tokenize.
tokenizerFactory - Tokenizer factory for tokenization.
symbolTable - Symbol table to use for converting tokens to symbols.
Returns:
The array of integer symbols for tokens that exist in the symbol table.