30 Mar 2021

[Dissertation] Conversational Pattern Mining using Motif Detection

Masters dissertation by Nicolle Garber, Faculty of Engineering, Built Environment and Information Technology University of Pretoria, Pretoria


Nicolle Garber, MITC Big Data Science


Dr. Vukosi Marivate


Conversational mining has become a subject of great interest due to the explosion of the consumption and generation of social and other related online media. Supplementing this is the advancement in pre-trained language models, and other embedding techniques which have helped us to leverage these important sources of information. Conversation is an interesting domain to analyze in terms of its complexity and value in society. Complexity arises because a conversation can be asynchronous and can involve multiple parties. Additionally, it is computationally intensive to process. A lot of value can be derived from mining conversations, particularly in goal-oriented domains. We use unsupervised methods in our work in order to develop a conversational pattern mining technique. This negates time consuming, knowledge demanding and resource intensive labeling exercises. The aim of our work is to extract short segments of repeating ideas in conversations, also known as motifs. This can be useful for searching for particular conversational cues as well as clustering, understanding and characterizing sets of conversations. The task of identifying repeating patterns in sequences is well researched in the field of Bioinformatics. In our work, we adapt this to the field of Natural Language Processing and make several extensions to a motif detection algorithm. We split the work into two major parts. The first is sequence creation, which entails converting sequences of text (conversations) into meaningful numeric sequences. The second is the emphasis of the work, whereby we focus on the explanation of the motif detection algorithm, its adaptation and extension. In this research, we use used the Gibbs Sampling algorithm. In order to use this algorithm in a language setting, we modify the algorithm to cater for vector-valued sequences. These sequences are representationsof conversations. They are constructed by grouping phrase-level embeddings via a community detection algorithm on a graph, in which relationships between phrases are edge weights. These groups of phrases form logical phrase abstractions which we call phrase classes. In order to use phrase classes as a base for the motif-detection algorithm, we reduced the phrase class dimension using the UMAP algorithm. We demonstrated the application of the algorithm on a dynamic, real world open-source film script data set. In this study we ran an exploratory investigation into the types of motifs we were able to mine