Automated extraction of discussed topics using topic modelling
This post looks at an unsupervised approach to extracting discussed topics. It is an approach to create a summary of what was discussed within our social media dataset. In our case it also helps in better navigating the data in the pursuit of understanding what misinformation may have been created during the lead up to the 2021 South African Municipal Elections.
- Prior Posts
- What is topic modelling?
- LDA vs NMF
- Understanding the NMF Topic Modelling
- Resources and References
What is topic modelling?
Topic modelling and topic extraction is the ability to identify, isolate, and tally a specific topic using a variety of approaches. To be specific, topic modelling is the analytics that is performed on text data by identifying the frequency and relationships of the text to a given concept. Topic extraction on the other hand is the ability to identify topics from text, and label them accordingly. To do both of these a few things are needed, starting with the data. In the previous blog posts we extracted and processed social media data using the Twint and Twarc tools in Python.
We also combined the data we extracted into one consolidated data frame. If you recall, a data frame is a useful way to categorise and structure information into a table with rows and columns. In this post, we are going to run some code in Python on the data we had already extracted so that we can perform topic modelling and topic extraction. What we want to do specifically is introduce the concept of Non-Negative Matrix Factorization (or NMF for short). NMF is an unsupervised technique, which means that there is no prior labeling of topics the model is trained on. Furthermore, NMF makes use of what we call factorisation of "high dimentional vectors" into "low dimentional vectors". This approach can work by using a variety of different variable types as the vectors, including things like a count of a specific word, or a specific distance between associated words.
What is important about the approach we are taking is that the code an be put into an application so that topic modelling and topic extraction can take place in an automated manner.
LDA vs NMF
We actually also provide a seperate notebook that goes through using LDA for topic modelling, but as mentioned in the introduction, what we are trying to do is use an unsupervised technique to model topics associated to the way in which the topics are extracted and categorised. For this, vital data preprocessing steps are needed. As mentioned in a previous post, the data preprocessing steps allow for ease of use of the data downstream and for ease of the analysis.
Data Preprocessing
- We remove all URLS (web links).
- We remove any mentions [@username] from all of the Twitter social media posts.
- We remove hashtags.
- All words are changed to lower case.
- You can view our other blog post on discovering patterns in URLs.
- We limit ourselves to the top 20,000 occuring words in the full dataset.
- We only keep tweets that have more than 10 words.
As mentioned in a previous post, we need to look at how much of our data has more than 10 words? True below is a count of how many have more than 10 words after the cleaning discussed earlier.
df['long_text'].value_counts()
Understanding the NMF Topic Modelling
Parameters
- 100 Topics
- 15 top words for visualisation
- term frequency–inverse document frequency (TFIDF) Representation
- Nonnegative Double Singular Value Decomposition (NNDSVD) initialization
Look at the results
Here we show the top 15 words for the 100 topics. See if you can classify each topic given the words.
For example we can give the topics labels. Here are a few
- Topic 3: Go Vote!
- Topic 8: Vote EFF
- Topic 25: Vote ANC
- Topic 35: DA Mayoral Candidate for Johannesburg
These are just some examples. It is important that you understand that the labeling and classifications using NMF occurs after the process. Take a look and see if you can interpret the topics below.
print("NMF Topics - TFIDF")
df_topics_nmf_tfidf = display_topics_wrapper()
print("--------------")
Zoom out - How are the topics Distributed?
We can look at how many of the Twitter social media posts fall within which topic. We do this by calculating the weighting. Here is an example of a an Twitter post from the IEC
1/2 The number of voters on voters’ roll decreases between general elections after rate of mortality is taken into account (+-30k deaths/month pre-COVID) & in absence of a registration event ahead of #LGE2021 on 27 Oct there has not been an opportunity to reverse the decline.
— IEC South Africa (@IECSouthAfrica) September 3, 2021
You can then see our cleaned version below
We can then check which topic this Text mostly falls within by checking the topic weighting. The plot below shows the weighting of the topics for the IEC example Twitter post.
From this we can see that topic 71 has the highest weighting.
Topic 71 has these words
voters days left eligible till urges 2021 local urge anclge2021 1st roll voteanc elections 25
We can deduce that this topic has to do with General Election Information, exactly what we would expect from the IEC.
Expanding to all Twitter posts
We can do this now for all Twitter posts in our data and show a summed result. See below.
These are the top topics by weighting. That is, they have high summed weights across all the collected Twitter social media posts.
Let us now see how the top 10 topics change over time. See the plot belw. Note how the daily topic graph is increasing over time as more Twitter posts are created closer to the election day (1 November 2021)
We can also normalise all of the topics daily (that is we weight them between 0 and 1 daily) and see them in a more flat view instead of increasing. See below. It should now be easier to see on a daily basis which topic dominates and changes over time.
Sentiment per topic
One of the other things we can do is try to understand what the sentiment per topic is. For this we will use a pre trained sentiment model. We will use XLM-T - A Multilingual Language Model Toolkit for Twitter URL. This model takes in a Twitter posts and returns if the post is Neutral, Positive or Negative. This is a form of opinion mining/sentiment analysis.
Note: We will be using this as a heuristic, these models have not been finetuned for South African English or code mixing
With the model we can calculate the sentiment of all of our Twitter social media posts in the dataset. We keep to the social media posts that have 10 or more words. We use the orinal text (not the cleaned one). Here is the overall view of the sentiment.
We can look back at our Topic 71, the general election information topic. We note that it has more Neutral posts.
Let us look again at that IEC example from earlier. What was its sentiment?
Actually, one of the things we note about IEC is that they have very neutral Twitter posts. See below
iec_example_sentiment.sentiment_pred_label.value_counts()
If we now g to topic 68, a topic that is mostly about corruption, we now notice a different pattern (see below). Most of the Twitter social media posts about corruption are Negative. This would be expected.
We can check if this might be a pattern with such topics. Let us check Topic 4
We find the same pattern. Lets look at Topics 25 and 35 (both connected to individual political party campaigning, so informational)
We now notice that these topics are more information and are more neutral.
Resources and References
- Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]
- V Marivate, A Moodley, A Saba. Extracting and categorising the reactions to COVID-19 by the South African public -- A social media study. Proceedings of IEEE AFRICON 2021 (To Appear) [Preprint]