Skip to content

jonaschn/awesome-topic-models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Topic Models Awesome

A curated list of amazing topic modelling libraries.

Contents

Libraries & Toolkits

  • gensim - Python library for topic modelling GitHub Repo stars
  • scikit-learn - Python library for machine learning GitHub Repo stars
  • tomotopy - Python extension for Gibbs sampling based tomoto which is written in C++ GitHub Repo stars
  • tomoto - Ruby extension for Gibbs sampling based tomoto which is written in C++ GitHub Repo stars
  • OCTIS - Python package to integrate, optimize and evaluate topic models GitHub Repo stars
  • tmtoolkit - Python topic modeling toolkit with parallel processing power GitHub Repo stars
  • Mallet - Java-based package for topic modeling GitHub Repo stars
  • TopicModel4J - Java-based package for topic modeling GitHub Repo stars
  • BIDMach - CPU and GPU-accelerated machine learning library GitHub Repo stars
  • BigARTM - Fast topic modeling platform GitHub Repo stars
  • TopicNet - A high-level Python interface for BigARTM library GitHub Repo stars
  • stm - R package for the Structural Topic Model GitHub Repo stars
  • RMallet - R package to interface with the Java machine learning tool MALLET GitHub Repo stars
  • R-lda - R package for topic modelling (LDA, sLDA, corrLDA, etc.) GitHub Repo stars
  • topicmodels - R package with interface to C code for LDA and CTM GitHub Repo stars
  • lda++ - C++ library for LDA and (fast) supervised LDA (sLDA/fsLDA) using variational inference GitHub Repo stars

Models

There are huge differences in performance and scalability as well as the support of advanced features as hyperparameter tuning or evaluation capabilities.

Truncated Singular Value Decomposition (SVD) / Latent Semantic Analysis (LSA) / Latent Semantic Indexing (LSI)

Non-Negative Matrix Factorization (NMF or NNMF)

Latent Dirichlet Allocation (LDA) 📄

  • scikit-learn - Python implementation using online variational Bayes inference 📄
  • lda - Python implementation using collapsed Gibbs sampling which follows scikit-learn interface 📄
  • lda-gensim - Python implementation using online variational inference 📄
  • ldamulticore-gensim - Parallelized Python implementation using online variational inference 📄
  • GibbsSamplingLDA-TopicModel4J - Java implementation using collapsed Gibbs sampling 📄
  • CVBLDA-TopicModel4J - Java implementation using collapsed variational Bayesian (CVB) inference 📄
  • Mallet - Parallelized Java implementation using Gibbs sampling 📄📄
  • gensim-wrapper-Mallet - Python wrapper for Mallet's implementation 📄📄
  • PartiallyCollapsedLDA - Various fast parallelized samplers for LDA, including Partially Collapsed LDA, LightLDA, Partially Collapsed Light LDA and a very efficient Polya-Urn LDA
  • Vowpal Wabbit - C++ implementaion using online variational Bayes inference 📄
  • tomotopy - Python binding for C++ implementation using Gibbs sampling and different term-weighting options 📄
  • topicmodel-lib - Cython library for online/streaming LDA (Online VB, Online CVB0, Online CGS, Online OPE, Online FW, Streaming VB, Streaming OPE, Streaming FW, ML-OPE, ML-CGS, ML-FW)
  • jsLDA - JavaScript implementation of LDA topic modeling in the browser
  • lda-nodejs - Node.js implementation of LDA topic modeling
  • lda-purescript - PureScript, browser-based implementation of LDA topic modeling
  • TopicModels.jl - Julia implementation of LDA
  • turicreate - C++ LDA and aliasLDA implementation with export to Apple's Core ML for use in iOS, macOS, watchOS, and tvOS apps
  • MeTA - C++ implementation of (parallel) collapsed Gibbs sampling, CVB0 and SCVB
  • Fugue - Java implementation of collapsed Gibbs sampling with slice sampling for hyper-parameter optimization

Hyperparameter optimization

  • GA-LDA - R scripts using Genetic Algorithms (GA) for hyper-paramenter optimization, based on Panichella 📄
  • Search-Based-LDA - R scripts using Genetic Algorithms (GA) for hyper-paramenter optimization by Panichella 📄
  • Dodge - Python tuning tool that ignores redundant tunings 📄
  • LDADE - Python tuning tool using differential evolution 📄
  • ldatuning - R package to find optimal number of topics for LDA 📄
  • Scalable - Scalable Hyperparameter Selection for LDA 📄

Evaluation

CPU-based high performance implementations

  • LDA* - Tencent's hybrid sampler that uses different samplers for different types of documents in combination with an asymmetric parameter server 📄
  • FastLDA - C++ implementation of LDA 📄
  • dmlc - Single-and multi-threaded C++ implementations of lightLDA, F+LDA, AliasLDA, forestLDA and many more
  • SparseLDA - Java algorithm and data structure for evaluating Gibbs sampling distributions used in Mallet 📄
  • warpLDA - C++ cache efficient LDA implementation which samples each token in O(1) 📄
  • lightLDA - C++ implementation using O(1) Metropolis-Hastings sampling 📄
  • F+LDA - C++ implementation of F+LDA using an appropriately modified Fenwick tree 📄
  • AliasLDA - C++ implemenation using Metropolis-Hastings and alias method📄
  • Yahoo-LDA - Yahoo!'s topic modelling framework 📄
  • PLDA+ - Google's C++ implementation using data placement and pipeline processing 📄
  • Familia - A toolkit for industrial topic modeling (LDA, SentenceLDA and Topical Word Embedding) ⚠️ 📄

GPU-based high performance implementations

  • SaberLDA - GPU-based system that implements a sparsity-aware algorithm to achieve sublinear time complexity
  • GS-LDA-BIDMach - CPU and GPU-accelerated Scala implementation using Gibbs sampling
  • VB-LDA-BIDMach - CPU and GPU-accelerated Scala implementation using online variational Bayes inference

Hierarchical Dirichlet Process (HDP) 📄

  • gensim - Python implementation using online variational inference 📄
  • tomotopy - Python extension for C++ implementation using Gibbs sampling 📄
  • Mallet - Java-based package for topic modeling using Gibbs sampling
  • TopicModel4J - Java implementation using Gibbs sampling based on Chinese restaurant franchise metaphor
  • hca - C implementation using Gibbs sampling with/without burstiness modelling
  • bnp - Cython reimplementation based on online-hdp following scikit-learn's API.
  • Scalable HDP - interesting paper

Hierarchical LDA (hLDA) 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling
  • Mallet - Java implementation using Gibbs sampling
  • hlda - Python package based on Mallet's Gibbs sampler having a fixed depth on the nCRP tree
  • hLDA - C implementation of hierarchical LDA by David Blei

Dynamic Topic Model (DTM) 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling based on FastDTM
  • FastDTM - Scalable C++ implementation using Gibbs sampling with Stochastic Gradient Langevin Dynamics (MCMC-based) 📄
  • ldaseqmodel-gensim - Python implementation using online variational inference 📄
  • dtm-BigTopicModel - C++ engine for running large-scale topic models
  • tca - C implementation using Gibbs sampling with/without burstiness modelling 📄
  • DETM - Python implementation of the Dynamic Embedded Topic Model 📄

Author-topic Model (ATM) 📄

Labeled Latent Dirichlet Allocation (LLDA, Labeled-LDA, L-LDA) 📄

Partially Labeled Dirichlet Allocation (PLDA) / Dirichlet Process (PLDP) 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling
  • TopicModel4J - Java implementation using collapsed Gibbs sampling
  • STMT - Scala implementation of PLDA & PLDP by Daniel Ramage

Dirichlet Multinomial Regression (DMR) topic model 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling
  • Mallet - Java-based package for topic modeling

Generalized Dirichlet Multinomial Regression (g-DMR) topic model 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling

Link LDA

  • PTM - implemented as benchmark 📄
  • TopicModel4J - Java implementation using collapsed Gibbs sampling

Correlated Topic Model (CTM) a.k.a. logistic-normal topic models

  • tomotopy - Python extension for C++ implementation using Gibbs sampling 📄
  • ctm-c - Original C implementation of the correlated topic model by David Blei 📄
  • BigTopicModel - C++ engine for running large-scale DTM 📄
  • stm - R package for the Structural Topic Model (CTM in case of no covariates) 📄

Relational Topic Model (RTM)

Supervised LDA (sLDA) 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling
  • R-lda - R implementation using collapsed Gibbs sampling
  • slda - Cython implementation of Gibbs sampling for LDA and various sLDA variants
    • supervised LDA (linear regression)
    • binary logistic supervised LDA (logistic regression)
    • binary logistic hierarchical supervised LDA (trees)
    • generalized relational topic models (graphs)
  • YWWTools - Java implementation using Gibbs sampling for LDA and various sLDA variants:
    • BS-LDA: Binary SLDA
    • Lex-WSB-BS-LDA: BS-LDA with Lexcial Weights and Weighted Stochastic Block Priors
    • Lex-WSB-Med-LDA: Lex-WSB-BS-LDA with Hinge Loss
  • sLDA - C++ implementation of supervised topic models with a categorical response

Topic Models for short documents

Sentence-LDA / SentenceLDA / Sentence LDA 📄

  • TopicModel4J - Java implementation of Sentence-LDA using collapsed Gibbs sampling
  • Familia - Apply inference on pre-trained SentenceLDA models ⚠️ 📄

Dirichlet Multinomial Mixture Model (DMM) 📄

  • GPyM_TM - Python implementation of DMM and Poisson model
  • TopicModel4J - Java implementation using collapsed Gibbs sampling 📄
  • jLDADMM - Java implementation using collapsed Gibbs sampling 📄

Dirichlet Process Multinomial Mixture Model (DPMM)

Pseudo-document-based Topic Model (PTM) 📄

  • tomotopy - Python extension for C++ implementation using Gibbs sampling
  • TopicModel4J - Java implementation using collapsed Gibbs sampling

Biterm topic model (BTM)

  • TopicModel4J - Java implementation using collapsed Gibbs sampling
  • BTM - Original C++ implementation using collapsed Gibbs sampling 📄
  • BurstyBTM - Original C++ implementation of the Bursty BTM (BBTM) 📄
  • OnlineBTM - Original C++ implementation of online BTM (oBTM) and incremental BTM (iBTM) :page_facing_up
  • R-BTM - R package wrapping the C++ code from BTM

Others

  • STTM - Java implementation and evaluation of DMM, WNTM, PTM, ETM, GPU-DMM, GPU-DPMM, LF-DMM 📄
  • SATM - Java implementation of Self-Aggregation Topic Model 📄
  • shorttext - Python implementation of various algorithms for Short Text Mining

Miscellaneous topic models

  • trLDA - Python implementation of streaming LDA based on trust-regions 📄
  • Logistic LDA - Tensorflow implementation of Discriminative Topic Modeling with Logistic LDA 📄
  • EnsTop - Python implementation of ENSemble TOPic modelling with pLSA
  • Dual-Sparse Topic Model - implemented in TopicModel4J using collapsed variational Bayes inference 📄
  • Multi-Grain-LDA - MG-LDA implemented in tomotopy using collapsed Gibbs sampling 📄
  • lda++ - C++ library for LDA and (fast) supervised LDA (sLDA/fsLDA) using variational inference 📄 📄
  • discLDA - C++ implementation of discLDA based on GibbsLDA++ 📄
  • GuidedLDA - Python implementation that can be guided by setting some seed words per topic (using Gibbs sampling) 📄
  • seededLDA - R package that implements seeded-LDA for semi-supervised topic modeling
  • keyATM - R package for Keyword Assisted Topic Models.
  • hca - C implementation of non-parametric topic models (HDP, HPYP-LDA, etc.) with focus on hyperparameter tuning
  • BayesPA - Python interface for streaming implementation of MedLDA, maximum entropy discrimination LDA (max-margin supervised topic model) 📄
  • sailing-pmls - Parallel LDA and medLDA implementation
  • BigTopicModel - C++ engine for running large-scale MedLDA models 📄
  • DAPPER - Python implementation of Dynamic Author Persona (DAP) topic model 📄
  • ToT - Python implementation of Topics Over Time (A Non-Markov Continuous-Time Model of Topical Trends) 📄
  • MLTM - C implementation of multilabel topic model (MLTM) 📄
  • sequence-models - Java implementation of block HMM and the mixed membership Markov model (M4)
  • Entropy-Based Topic Modeling - Java implementation of Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections
  • ST-LDA - ST-LDA: Single Topic LDA 📄
  • MTM - Java implementation of Multilingual Topic Model 📄
  • YWWTools - Java-based package for various topic models by Weiwei Yang

Exotic models

  • TEM - Topic Expertise Model 📄
  • PTM - Prescription Topic Model for Traditional Chinese Medicine Prescriptions 📄 (interesting benchmark models)
  • KGE-LDA - Knowledge Graph Embedding LDA 📄
  • LDA-SP - A Latent Dirichlet Allocation Method for Selectional Preferences 📄
  • LDA+FFT - LDA and FFTs (Fast and Frugal Trees) for better comprehensibility 📄

Embedding based Topic Models

  • BERTopic - BERTopic supports guided, (semi-) supervised, and dynamic topic modeling and visualization 📄
  • CTM - CTMs combine contextualized embeddings (e.g., BERT) with topic models
  • ETM - Embedded Topic Model 📄
  • D-ETM - Dynamic Embedded Topic Model 📄
  • ProdLDA - Original TensorFlow implementation of Autoencoding Variational Inference (AEVI) for Topic Models 📄
  • pytorch-ProdLDA - PyTorch implementation of ProdLDA 📄
  • CatE - Discriminative Topic Mining via Category-Name Guided Text Embedding 📄
  • Top2Vec - Python implementation that learns jointly embedded topic, document and word vectors 📄
  • lda2vec - Mixing dirichlet topic models and word embeddings to make lda2vec 📄
  • lda2vec-pytorch - PyTorch implementation of lda2vec
  • G-LDA - Java implementation of Gaussian LDA using word embeddings 📄
  • MG-LDA - Python implementation of (Multi-lingual) Gaussian LDA 📄
  • MetaLDA - Java implementation using Gibbs sampling that leverages document metadata and word embeddings 📄
  • LFTM - Java implementation of latent feature topic models (improving LDA and DMM with word embeddings) 📄
  • CorEx - Recover latent factors with Correlation Explanation (CorEx) 📄
  • Anchored CorEx - Hierarchical Topic Modeling with Minimal Domain Knowledge 📄
  • Linear CorEx - Latent Factor Models Based on Linear Total CorEx 📄

Probabilistic Programming Languages (PPL) (a.k.a. Build your own Topic Model)

  • Stan - Platform for statistical modeling and high-performance statistical computation, e.g., LDA 📄
  • PyMC3 - Python package for Bayesian statistical modeling and probabilistic machine learning, e.g., LDA 📄
  • Turing.jl - Julia library for general-purpose probabilistic programming 📄
  • TFP - Probabilistic reasoning and statistical analysis in TensorFlow, e.g., LDA 📄
  • edward2 - Simple PPL with core utilities in the NumPy and TensorFlow ecosystem 📄
  • pyro - PPL built on PyTorch, e.g., prodLDA 📄
  • edward - A PPL built on TensorFlow, e.g., LDA 📄
  • ZhuSuan - A PPL for Bayesian deep learning, generative models, built on Tensorflow, e.g., LDA 📄

Research Implementations

  • lda-c - C implementation using variational EM by David Blei
  • sLDA - C++ implementation of supervised topic models with a categorical response.
  • onlineldavb - Python online variational Bayes implementation by Matthew Hoffman 📄
  • HDP - C++ implementation of hierarchical Dirichlet processes by Chong Wang
  • online-hdp - Python implementation of online hierarchical Dirichlet processes by Chong Wang
  • ctr - C++ implementation of collaborative topic models by Chong Wang
  • dtm - C implementation of dynamic topic models by David Blei & Sean Gerrish
  • ctm-c - C implementation of the correlated topic model by David Blei
  • diln - C implementation of Discrete Infinite Logistic Normal (with HDP option) by John Paisley
  • hLDA - C implementation of hierarchical LDA by David Blei
  • turbotopics - Python implementation that finds significant multiword phrases in topics by David Blei
  • Stanford Topic Modeling Toolbox - Scala implementation of LDA, labeledLDA, PLDA, PLDP by Daniel Ramage and Evan Rosen
  • LDAGibbs - Java implementation of LDA using Gibbs sampling by Liu Yang
  • Matlab Topic Modeling Toolbox - Matlab implementations of LDA, ATM, HMM-LDA, LDA-COL (Collocation) models by Mark Steyvers and Tom Griffiths
  • cvbLDA - Python C extension implementation of collapsed variational Bayesian inference for LDA
  • fast - A Fast And Scalable Topic-Modeling Toolbox (Fast-LDA, CVB0) by Arthur Asuncion and colleagues 📄

Popular Implementations (but not maintained anymore)

Learning Implementations (hopefully easy to understand)

  • topic_models - Python implementation of LSA, PLSA and LDA
  • Topic-Model - Python implementation of LDA, Labeled LDA, ATM, Temporal Author-Topic Model using Gibbs sampling

Visualizations

  • LDAvis - R package for interactive topic model visualization
  • pyLDAvis - Python library for interactive topic model visualization
  • scalaLDAvis - Scala port of pyLDAvis
  • dtmvisual - Python package for visualizing DTM (trained with gensim)
  • TMVE online - Online Django variant of topic model visualization engine (TMVE)
  • TMVE - Original topic model visualization engine (LDA trained with lda-c) 📄
  • topicmodel-lib - Python wrapper for TMVE for visualizing LDA (trained with topicmodel-lib)
  • wordcloud - Python package for visualizing topics via word_cloud
  • Mallet-GUI - GUI for creating and analyzing topic models produced by MALLET
  • TWiC - Topic Words in Context is a highly-interactive, browser-based visualization for MALLET topic models
  • dfr-browser - Explore Mallet's topic models of texts in a web browser
  • Termite - Explore topic models using term-topic matrix, group-in-a-box visualization or scatter plot.
  • Topics - Python library for topic modeling and visualization
  • TopicsExplorer - Explore your own text collection with a topic model – without prior knowledge 📄
  • topicApp - A Simple Shiny App for Topic Modeling
  • stminsights - A Shiny Application for Inspecting Structural Topic Models

Dirichlet hyperparameter optimization techniques

Resources

  • David Blei - David Blei's Homepage with introductory materials

Related awesome lists

Contribute

Contributions welcome! Read the contribution guidelines first.

License

CC0

To the extent possible under law, Jonathan Schneider has waived all copyright and related or neighboring rights to this work.