Skip to content
Surf Wiki
Save to docs
technology/computing

From Surf Wiki (app.surf) — the open knowledge base

Topic model

Statistical model


Statistical model

In statistics and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear approximately equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words. The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is.

Topic models are also referred to as probabilistic topic models, which refers to statistical algorithms for discovering the latent semantic structures of an extensive text body. In the age of information, the amount of the written material we encounter each day is simply beyond our processing capacity. Topic models can help to organize and offer insights for us to understand large collections of unstructured text bodies. Originally developed as a text-mining tool, topic models have been used to detect instructive structures in data such as genetic information, images, and networks. They also have applications in other fields such as bioinformatics and computer vision.

History

An early topic model was described by Papadimitriou, Raghavan, Tamaki and Vempala in 1998.{{cite book |chapter-url = http://www.cs.berkeley.edu/~christos/ir.ps |chapter-format = Postscript |access-date = 2012-04-17 |archive-date = 2013-05-09 |archive-url = https://web.archive.org/web/20130509130907/http://www.cs.berkeley.edu/%7Echristos/ir.ps |archive-url = https://web.archive.org/web/20101214074049/http://www.cs.brown.edu/~th/papers/Hofmann-SIGIR99.pdf |archive-date = 2010-12-14

Topic models for context information

Approaches for temporal information include Block and Newman's determination of the temporal dynamics of topics in the Pennsylvania Gazette during 1728–1800. Griffiths & Steyvers used topic modeling on abstracts from the journal PNAS to identify topics that rose or fell in popularity from 1991 to 2001 whereas Lamba & Madhusushan used topic modeling on full-text research articles retrieved from DJLIT journal from 1981 to 2018. In the field of library and information science, Lamba & Madhusudhan applied topic modeling on different Indian resources like journal articles and electronic theses and resources (ETDs). Nelson has been analyzing change in topics over time in the Richmond Times-Dispatch to understand social and political changes and continuities in Richmond during the American Civil War. Yang, Torget and Mihalcea applied topic modeling methods to newspapers from 1829 to 2008. Mimno used topic modelling with 24 journals on classical philology and archaeology spanning 150 years to look at how topics in the journals change over time and how the journals become more different or similar over time.

Yin et al. introduced a topic model for geographically distributed documents, where document positions are explained by latent regions which are detected during inference.

Chang and Blei included network information between linked documents in the relational topic model, to model the links between websites.

The author-topic model by Rosen-Zvi et al. models the topics associated with authors of documents to improve the topic detection for documents with authorship information.

HLTA was applied to a collection of recent research papers published at major AI and Machine Learning venues. The resulting model is called The AI Tree. The resulting topics are used to index the papers at aipano.cse.ust.hk to help researchers track research trends and identify papers to read, and help conference organizers and journal editors identify reviewers for submissions.

To improve the qualitative aspects and coherency of generated topics, some researchers have explored the efficacy of "coherence scores", or otherwise how computer-extracted clusters (i.e. topics) align with a human benchmark. Coherence scores are metrics for optimising the number of topics to extract from a document corpus.

Algorithms

In practice, researchers attempt to fit appropriate model parameters to the data corpus using one of several heuristics for maximum likelihood fit. A survey by D. Blei describes this suite of algorithms.{{cite journal | url-access = subscription}} Several groups of researchers starting with Papadimitriou et al. have attempted to design algorithms with provable guarantees. Assuming that the data were actually generated by the model in question, they try to design algorithms that probably find the model that was used to create the data. Techniques used here include singular value decomposition (SVD) and the method of moments. In 2012 an algorithm based upon non-negative matrix factorization (NMF) was introduced that also generalizes to topic models with correlations among topics.{{Cite arXiv

Since 2017, neural networks has been leveraged in topic modeling in order to improve the speed of inference, and leading to further advancements like vONTSS, which allows humans to incorporate domain knowledge via weakly supervised learning.

In 2018, a new approach to topic models was proposed based on the stochastic block model.{{cite journal | article-number=eaaq1360

Topic modeling has leveraged LLMs through contextual embedding and fine tuning.

Applications of topic models

To quantitative biomedicine

Topic models are being used also in other contexts. For examples uses of topic models in biology and bioinformatics research emerged.{{cite journal | doi-access = free }} Recently topic models has been used to extract information from dataset of cancers' genomic samples.{{cite journal In this case topics are biological latent variables to be inferred.

To analysis of music and creativity

Topic models can be used for analysis of continuous signals like music. For instance, they were used to quantify how musical styles change in time, and identify the influence of specific artists on later music creation.

References

References

  1. (April 2012). "Probabilistic Topic Models". Communications of the ACM.
  2. Cao, Liangliang, and Li Fei-Fei. "[http://www.ifp.illinois.edu/~cao4/papers/CaoFei-Fei_ICCV2007_final.pdf Spatially coherent latent topic model for concurrent segmentation and classification of objects and scenes]." 2007 IEEE 11th International Conference on Computer Vision. IEEE, 2007.
  3. Lamba, Manika jun. (2019). "Mapping of topics in DESIDOC Journal of Library and Information Technology, India: a study". Scientometrics.
  4. Lamba, Manika jun. (2019). "Metadata Tagging and Prediction Modeling: Case Study of DESIDOC Journal of Library and Information Technology (2008-2017)". World Digital Libraries.
  5. Lamba, Manika may. (2019). "Author-Topic Modeling of DESIDOC Journal of Library and Information Technology (2008-2017), India". Library Philosophy and Practice.
  6. Lamba, Manika sep. (2018). "Metadata Tagging of Library and Information Science Theses: Shodhganga (2013-2017)".
  7. "Mining the Dispatch". Digital Scholarship Lab, University of Richmond.
  8. Yin, Zhijun. (2011). "Proceedings of the 20th international conference on World wide web".
  9. Chang, Jonathan. (2009). "Relational Topic Models for Document Networks". Aistats.
  10. Rosen-Zvi, Michal. (2004). "The author-topic model for authors and documents". Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence.
  11. Nikolenko, Sergey. (2017). "Topic modelling for qualitative studies". Journal of Information Science.
  12. Reverter-Rambaldi, Marcel. (2022). "Topic Modelling in Spontaneous Speech Data". Australian National University.
  13. Newman, David. (2010). "Automatic evaluation of topic coherence". Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  14. (2017). "Discovering Discrete Latent Topics with Neural Variational Inference". PMLR.
  15. (2023). "vONTSS: vMF based semi-supervised neural topic modeling with optimal transport". Association for Computational Linguistics.
  16. (2021). "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)". Association for Computational Linguistics.
  17. (2023). "DeTiME: Diffusion-Enhanced Topic Modeling using Encoder-decoder based LLM". Association for Computational Linguistics.
  18. (2013-05-13). "Modeling Musical Influence with Topic Models". PMLR.
Info: Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

Want to explore this topic further?

Ask Mako anything about Topic model — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report