A part of my dissertation is based on the Latent Dirichlet Allocation Model and I have extensively searched for played with various LDA implementations.
Gibbs Sampling implementation of LDA in C++
The input format is basically a single text file that contains one line for each document. Each line contains the actual terms. I found this input format to be bulky. The code itself is very clean and easy to understand. Gibbs Sampling by itself is a little slow and nothing can be done about it.
Gibbs Sampling Implementation of LDA in Java
Written by the same folks who wrote the c++ implementation. Leaving the programming language aside everything else is kept same.
Variational Inference of LDA in Matlab and C
This is easy to use code that uses the input format as that of SVMLight software, which is basically a text file with each line containing a sequence of tuples of the format: where the feature_id is the word_id based on a dictionary and count is the number of times it appears in the corpus. Implementation wise, I have one comment though. I was hoping that the algorithm would include estimation of the \eta parameter, the smoothing parameter on \beta.
MALLET toolbox
This is written in Java and uses plain text or SVM light format as input. It implements the Gibbs Sampling for LDA and allows for optimization of the hyper parameters \alpha and \eta after burn-in iterations. This allows us to exercise greater control over the impact of topics on the entire collection. Given that LDA itself is slow and that for estimation of hyper parameters we need to wait for burn-in number of iterations, I feel discouraged to use this tool.
Hybrid approach to LDA inference
This approach apparently uses both Variational and Gibbs Sampling based approach to learn the parameters. The author also presents a vanilla LDA implementation.
I will keep updating this post as and when I find more useful stuff.
Gibbs Sampling implementation of LDA in C++
The input format is basically a single text file that contains one line for each document. Each line contains the actual terms. I found this input format to be bulky. The code itself is very clean and easy to understand. Gibbs Sampling by itself is a little slow and nothing can be done about it.
Gibbs Sampling Implementation of LDA in Java
Written by the same folks who wrote the c++ implementation. Leaving the programming language aside everything else is kept same.
Variational Inference of LDA in Matlab and C
This is easy to use code that uses the input format as that of SVMLight software, which is basically a text file with each line containing a sequence of tuples of the format
MALLET toolbox
This is written in Java and uses plain text or SVM light format as input. It implements the Gibbs Sampling for LDA and allows for optimization of the hyper parameters \alpha and \eta after burn-in iterations. This allows us to exercise greater control over the impact of topics on the entire collection. Given that LDA itself is slow and that for estimation of hyper parameters we need to wait for burn-in number of iterations, I feel discouraged to use this tool.
Hybrid approach to LDA inference
This approach apparently uses both Variational and Gibbs Sampling based approach to learn the parameters. The author also presents a vanilla LDA implementation.
I will keep updating this post as and when I find more useful stuff.
What about Gensim? Have you tried out?
ReplyDelete