Organizing Knowledge Using Topic Models

Last modified by Michael Hamann on 2023/02/03 17:20

Estimated workload

350 hours (Large size project)




A topic model can be used to discover the topics in a collection of documents. The idea here is that a document consists of different topics, or rather, that its words are drawn from different topics. A topic again is a collection of words, though also with a frequency (probability) distribution such that different words can have different importance. Analyzing the topics that are covered in different documents of a wiki can help organizing the wiki by, e.g., assigning tags to documents of a topic or grouping documents of a topic in a space (a concept in XWiki that is similar to a directory in a filesystem). You can read more about how topic discovery can be applied for knowledge management in this article.

The idea of this project is to integrate an existing Java library that implements a topic model like LDA in XWiki. As, in particular for large wikis, the analysis will take quite some time, the idea would be to have a background job to run the analysis that writes the result (e.g., to some document, details to be discussed) and then to have a UI for inspecting the result. This UI should offer various options like assigning a tag to all documents of a topic or moving all documents of a topic to a space.

As task to get started, you should find a library that implements LDA or a similar topic model, create a prototype with a macro in Java that takes a list of documents, gets their text content (find out how to render to plain text or just take the text without rendering) and then feeds these documents into the library. It should then display the result of the analysis in textual form. This doesn't need to be clean and nice, but it should show us that you're able to work with XWiki's Java API and that the library you chose is working as expected.

In your proposal, you should detail how you want to transform this prototype into a background job for the analysis and a UI for presenting the result. Depending on the details the implementation could become a lot of work, one suggestion would be to keep the UI parts as simple as possible for now as advanced users could always write their own scripts to further process the results.

If you like this idea of using machine learning to organize the content of the wiki but don't want to use topic models, have a look at the related project about AI based tagging of pages.

Developer profile

Web developer who knows Java and ideally also has some background in natural language processing.





Get Connected