Prior Posts

You can also look at our earlier posts (in decreasing relevance to this post) on:

What is topic modelling?

Topic modelling and topic extraction is the ability to identify, isolate, and tally a specific topic using a variety of approaches. To be specific, topic modelling is the analytics that is performed on text data by identifying the frequency and relationships of the text to a given concept. Topic extraction on the other hand is the ability to identify topics from text, and label them accordingly. To do both of these a few things are needed, starting with the data. In the previous blog posts we extracted and processed social media data using the Twint and Twarc tools in Python.

We also combined the data we extracted into one consolidated data frame. If you recall, a data frame is a useful way to categorise and structure information into a table with rows and columns. In this post, we are going to run some code in Python on the data we had already extracted so that we can perform topic modelling and topic extraction. What we want to do specifically is introduce the concept of Non-Negative Matrix Factorization (or NMF for short). NMF is an unsupervised technique, which means that there is no prior labeling of topics the model is trained on. Furthermore, NMF makes use of what we call factorisation of "high dimentional vectors" into "low dimentional vectors". This approach can work by using a variety of different variable types as the vectors, including things like a count of a specific word, or a specific distance between associated words.

What is important about the approach we are taking is that the code an be put into an application so that topic modelling and topic extraction can take place in an automated manner.

LDA vs NMF

We actually also provide a seperate notebook that goes through using LDA for topic modelling, but as mentioned in the introduction, what we are trying to do is use an unsupervised technique to model topics associated to the way in which the topics are extracted and categorised. For this, vital data preprocessing steps are needed. As mentioned in a previous post, the data preprocessing steps allow for ease of use of the data downstream and for ease of the analysis.

Data Preprocessing

  • We remove all URLS (web links).
  • We remove any mentions [@username] from all of the Twitter social media posts.
  • We remove hashtags.
  • All words are changed to lower case.
  • You can view our other blog post on discovering patterns in URLs.
  • We limit ourselves to the top 20,000 occuring words in the full dataset.
  • We only keep tweets that have more than 10 words.

As mentioned in a previous post, we need to look at how much of our data has more than 10 words? True below is a count of how many have more than 10 words after the cleaning discussed earlier.

df['long_text'].value_counts()
True     838315
False    379844
Name: long_text, dtype: int64

Understanding the NMF Topic Modelling

Parameters

  1. 100 Topics
  2. 15 top words for visualisation
  3. term frequency–inverse document frequency (TFIDF) Representation
  4. Nonnegative Double Singular Value Decomposition (NNDSVD) initialization

Look at the results

Here we show the top 15 words for the 100 topics. See if you can classify each topic given the words.

For example we can give the topics labels. Here are a few

  • Topic 3: Go Vote!
  • Topic 8: Vote EFF
  • Topic 25: Vote ANC
  • Topic 35: DA Mayoral Candidate for Johannesburg

These are just some examples. It is important that you understand that the labeling and classifications using NMF occurs after the process. Take a look and see if you can interpret the topics below.

print("NMF Topics - TFIDF")
df_topics_nmf_tfidf = display_topics_wrapper()
print("--------------")

NMF Topics - TFIDF
Topic 0:
𝚈𝙾𝚄𝚁 formalise forgotte forgotten foriegners fork forked form formal formally forgiving format formation formations formed
Topic 1:
community meeting addressing landandjobsmanje castle address botshabelo sports enjoli mdantsane bothaville park madibeng mangaung cic
Topic 2:
elections local entrances shut eswatini border sol head mbalula fikile upcoming breaking coming called municipal
Topic 3:
vote registered cast secret special station voted remember wisely register monday nto future pitched granted
Topic 4:
anc mediocrity voteanc disrespect corrupt mbalula voted fikile iec called failed voteeff youth listen problem
Topic 5:
africa south wildest entrances border eswatini shut sol johnvuligate vula gate enemy cities foreigners world
Topic 6:
judge jsc underway interviews court currently ncic division gauteng high happening asks constitutional rammaka dod
Topic 7:
eff manifesto home case members municipalities command leadership fighter statement students bounds harassment missed emails
Topic 8:
voteeff landandjobsmanje addressing https kzn fighters voteeffon1november effredfriday home red enew manje fikile efftshelathuparally mzansi
Topic 9:
ready regulations lead lockdown march level adjusted spreader roads ngovernment end super response landand 19
Topic 10:
don forget care miss understand worry lie investment live platform home wait problem money means
Topic 11:
ward candidate councillor concluded seshego 22 11 coordinators pr conveners 60 johannesburg cllr 14 mangaung
Topic 12:
november 1st 2021 voteactionsa cut non neck remember 01 future voteanc voteifp come date nov
Topic 13:
president mr deputy zuma mbeki comrade shikwambana mandla thabo address blf trail cyrilramaphosa mabuza fighter
Topic 14:
time wildest long try ve election talk boo discredit 25 got roll cut waste idea
Topic 15:
da led municipalities run racist mayoral coalition candidate posters things phoenix getthingsdonerally record town conditions
Topic 16:
special salutation regional commissars revolution salute provincial branch leadership amp votes secretary chairperson pitched granted
Topic 17:
party ruling agents agent political governing things voteifp record gets opposition guarding track racist saying
Topic 18:
cic moves dance northern bushbuckridge wraps addressed hold definitely nwatch rally managed telling watch gathered
Topic 19:
happening program welcome revolutionary ncic arrives lge continues address ncommunity eningi ndp mali wa gathered
Topic 20:
sa action voteactionsa mayor jhb mashaba clean justice nvote law letsfixsouthafrica putsouthafricansfirst citizens enforcement foreigners
Topic 21:
just imagine does guys campaigning point remind tense using thing old getting away money sleepless
Topic 22:
cape town western northern eastern metro khayelitsha trouble phokwane bombarding difficulties experienced province program cmsr
Topic 23:
leadership drivers truck providing illega fruits sacrifices enjoyed efftshelathuparally black dancing eningi mali big zulu
Topic 24:
voteanc campaign trail buildingbettercommunities anclge2021 comrade ancintshwane communities ancinwesterncape ancinjoburg ancingqeberha siyanqoba push final building
Topic 25:
door campaign volunteers conducting region tshwane encouraging tg fighters sub postering ward spend 53 41
Topic 26:
voting station stations special registered tomorrow check pitched nto granted remember told votes voted monday
Topic 27:
africans south head mbalula fikile called breaking nretweet voteeff voteactionsa fellow tension dividing embraced causing
Topic 28:
want white afri borders votes talk open nonsense live tension dividing embraced world causing nas
Topic 29:
know week didn case really probably hey man works guides negotiable dont r15500 connect tried
Topic 30:
like looks look nthis ve got try powerful discredit boo things mitigation 25 sentencing matters
Topic 31:
today program earlier lge mpumalanga continues joined campaign trail ndp state tg conversation sabc hani
Topic 32:
going johnvuligate vula gate jub lorch khune frog tembisa10 efftshelathuparally bring hire degree matric landandjobsma
Topic 33:
pictures addressing sg efftshelathuparally commissar dp jub stage gathered victory head addressed volunteers tg northern
Topic 34:
malema julius sello children called crowd commander chief matric praising acts discrimination house degree future
Topic 35:
voteda drmphoformayor things thedagetsthingsdone gets johnvuligate gate vula dagetsthingsdone blue viva getthingsdonerally residents ready join
Topic 36:
government local comrade mpumalanga demolished leads apartheid 2021 houses program upcoming enemy till lead urges
Topic 37:
poor long unemployment services tolerated exploita nfor healthcare high advert watch elections rate eff money
Topic 38:
drive nicol william mandela winnie bye renamed stal struggle renaming welcomes hello madikizela nelson billboard
Topic 39:
wrong cut fool decisively went mockery dealt bloody landandjobsmanje theirs deal monday voteeff voteactionsa deported
Topic 40:
let fix clear tomorrow future fighters enemy honest chance continue act credit clean lend children
Topic 41:
people white houses listen love living telling true longer conditions care intentions hide clarion happy
Topic 42:
years 27 past ago later promises old deliver ve remember water failed majority 35 20
Topic 43:
make sure trading helped 000 trade profit r800 expert weeks introducing bitcoin money believe sense
Topic 44:
2021 october program lge local continues ncic manifesto alert 10th ndp 01 29th ngovernment results
Topic 45:
ballot free paper fair dear refusal nif taking state ni stan iec papers logo feature
Topic 46:
million seen officially crew mec rank ec tsomo opening just r15 taxi ve stadium r4
Topic 47:
lge2021 voteblf voteforyourlife blf sabcnews voter registered live manifesto sadecides2021 check newzroom405 kingmaker year soweto
Topic 48:
municipality led madibeng saldahna stadium conditions run living water local 38 mgijima enoch 2016 address
Topic 49:
good morning money investing governance lovers platform life house investment beautiful hands luck fighters madikizela
Topic 50:
service delivery march poor umhlathuze leading sg complain thusi pearl straight important areas nit municipalities
Topic 51:
city johannesburg inner buildings building hope story hijacked ide nfrom york 𝘁𝗵𝗲 joburg buffalo house
Topic 52:
national chairperson provincial leading song ahead kzn congress manifesto accompanied iyakhala neff commissar manje progressive
Topic 53:
need understand really njulius vibe music doesn user testing cc twitter le rebuild help leaders
Topic 54:
say follow sunday reality opened leads 1000 refuse iec nas times welcome fair doesn powerful
Topic 55:
land jobs manje create clear blf expropriation children return owners andile freedom mngxitama compensation indignity
Topic 56:
come support open world nonsense borders afri nas numbers south happy mess large continue listen
Topic 57:
country foreigners fix run reality opened cities follow sunday leads best towns illegal rebuild mess
Topic 58:
political parties education established getting year giving sleepless nights old start imagine head deployee cct
Topic 59:
day special hey wait votes second nyou khayelitsha choice everyday ve new job sleep smallanyana
Topic 60:
did countries point future campaigning tense using tell pay apartheid zuma council municipal exist ask
Topic 61:
media social tonight platforms tv advert called th 20h00 shared branch live statement alert drop
Topic 62:
domestic flights doesn sa airways saa british baffles britain cu workers matter mean care airline
Topic 63:
election stop logo pr wildest yes disappointed nour ruling campaign best thee friend congratulations brother
Topic 64:
candidates councillor candidate ekurhuleni region mayoral sekhukhune addressing councilor pledge mopani encouraging nanc volunteers meeting
Topic 65:
thank hope red team message brigade delivering crisscrossing courageously selflessly country amp flag fly ama2000
Topic 66:
right thing choice complain exercise basics voteifp democratic constitutional thusi pearl choose straight democracy heard
Topic 67:
power cut voteactionsa voted mediocrity disrespect listen monday theirs loadshedding socialist dark fourth scene sitting
Topic 68:
corrupt useless incompetent looting lying thieving dishonourable violent wisely sa voetsek scum thug cadres vote
Topic 69:
african south congress gqeberha law enforcement agree anti taxi ngizwe mchunu driver national countries lovers
Topic 70:
better communities building life build pledge serve deliver future voteifp ancinjoburg ancfreestate roll phofung non
Topic 71:
voters days left eligible till urges 2021 local urge anclge2021 1st roll voteanc elections 25
Topic 72:
nthe kzn enemy clear municipalities wraps foreigners red excitement overwhelming mpumalanga receive attem absolutely hold
Topic 73:
help way god thanks blessing explain able living stop bitcoin twitter bless tha account vaccinecertificate
Topic 74:
think really marketing doesn tweet reply stupid bad belongs hard makes hight thread rosebank group
Topic 75:
king dalindyebo buyelekhaya addressed speaking meeting abathembu majesty community nation humbled attended breaking efftshelathuparally cic
Topic 76:
watch says committing artists nhe newzroom405 wants dp relationship artist asks reporter koko tg vinny
Topic 77:
cde members candidate region knowyourcouncillorcandidate deputy provincial member nec regional interacting chairperson rec anclge2021 voteanc
Topic 78:
video duma ntando vinny uncle cooper fifi tshedi mholo cornet mamabolo joined stage alaska stellenbosch
Topic 79:
nwe numbers efftshelathuparally love ekurhuleni getting nlet sleepless nights trending year established shocked yellow giving
Topic 80:
member build took stage asked thokozile mazibuko community hou houses house home loadshedding nec living
Topic 81:
black white racist person man child heroes blf anti majority whites racism blacks ngizwe mchunu
Topic 82:
arrived meeting councillors coordinators conveners th accompanied stellenbosch region kayamnandi structures landandjobsmanje concluded umdoni ugu
Topic 83:
actionsa voteactionsa mashaba iec contesting logo herman viva love proud letsfixsouthafrica tshwane mr didn podcasta
Topic 84:
university src victory won rally final results venda effsc nelson place neffsc mandela campus campuses
Topic 85:
leader politics reporter area newzroom405 poster mooidraai village house senior challenges explain john called steenhuisen
Topic 86:
ground forces sports fighters new look seshego morning 14 beautiful valley overcome 42 mother consolidating
Topic 87:
said listen welcome times refuse 1000 nas shedding load comment 2015 shut entrances eswatini border
Topic 88:
corruption news breaking exposed 847 sap 115 software licences excl r58 95 end municipalities fighting
Topic 89:
work hard ethic thank working gracing inspiration presence continue doesn leaders speak bitcoin place vacc
Topic 90:
change voteifp living bring conditions lives tha waiting life adding mitchelsplain things abolishment unveiling winds
Topic 91:
young old youth water students ruling tshwane man command women certify future electric achieved rate
Topic 92:
west north home nnational chair numbers madibeng 38 province program lie continues rand sello tlokwe
Topic 93:
amp trust ur used beyin samuel loan entrepreneur bought haibo capital herman white organisations managed
Topic 94:
covid believe 19 vaccinated vaccinate force science really trials vaccine vaccines vaccination mandatory passports forced
Topic 95:
doing job understand njulius vibe dog issue music shampoo huh selling walkabout pamphleteering life youth
Topic 96:
general secretary commissar deputy treasurer comrade head duarte jessie dr omphile maotwe leading desk provincial
Topic 97:
ll reason year life issue dog selling shampoo huh ve voteanc amazed rained posters washed
Topic 98:
ramaphosa cyril members ethekwini tshwane nation zuma supporters pres efolweni unhappy arrested citizens 250 months
Topic 99:
voetsekanc remember efftshelathuparally nonsense voetsekeff voteeff voteactionsa voetsek demolished houses reply voetsekramaphosa belongs mpumalanga lousy
--------------

Zoom out - How are the topics Distributed?

We can look at how many of the Twitter social media posts fall within which topic. We do this by calculating the weighting. Here is an example of a an Twitter post from the IEC

You can then see our cleaned version below

'1/2 the number of voters on voters’ roll decreases between general elections after rate of mortality is taken into account (+-30k deaths/month pre-covid) & in absence of a registration event ahead of #lge2021 on 27 oct there has not been an opportunity to reverse the decline.'

We can then check which topic this Text mostly falls within by checking the topic weighting. The plot below shows the weighting of the topics for the IEC example Twitter post.

From this we can see that topic 71 has the highest weighting.

71

Topic 71 has these words

voters days left eligible till urges 2021 local urge anclge2021 1st roll voteanc elections 25

We can deduce that this topic has to do with General Election Information, exactly what we would expect from the IEC.

Expanding to all Twitter posts

We can do this now for all Twitter posts in our data and show a summed result. See below.

These are the top topics by weighting. That is, they have high summed weights across all the collected Twitter social media posts.

[41, 4, 7, 3, 30, 93, 21, 13, 15, 10]

Topic <> Word Table

We can now make it easier to look at the topics as a table. Let us see the Topic (rows) <> Words (Columns) of our topics).

So the first row is Topic 41, and the columns in that row are the top 15 words in that topic.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Topic 41 people white houses listen love living telling true longer conditions care intentions hide clarion happy
Topic 4 anc mediocrity voteanc disrespect corrupt mbalula voted fikile iec called failed voteeff youth listen problem
Topic 7 eff manifesto home case members municipalities command leadership fighter statement students bounds harassment missed emails
Topic 3 vote registered cast secret special station voted remember wisely register monday nto future pitched granted
Topic 30 like looks look nthis ve got try powerful discredit boo things mitigation 25 sentencing matters
Topic 93 amp trust ur used beyin samuel loan entrepreneur bought haibo capital herman white organisations managed
Topic 21 just imagine does guys campaigning point remind tense using thing old getting away money sleepless
Topic 13 president mr deputy zuma mbeki comrade shikwambana mandla thabo address blf trail cyrilramaphosa mabuza fighter
Topic 15 da led municipalities run racist mayoral coalition candidate posters things phoenix getthingsdonerally record town conditions
Topic 10 don forget care miss understand worry lie investment live platform home wait problem money means

Word <> Topic Table

We can also look at the Word (rows) <> Topic (column) visualisation. We do this so you can get an intuition of what a topic model gives us.

Topic 41 Topic 4 Topic 7 Topic 3 Topic 30 Topic 93 Topic 21 Topic 13 Topic 15 Topic 10
0 people anc eff vote like amp just president da don
1 white mediocrity manifesto registered looks trust imagine mr led forget
2 houses voteanc home cast look ur does deputy municipalities care
3 listen disrespect case secret nthis used guys zuma run miss
4 love corrupt members special ve beyin campaigning mbeki racist understand
5 living mbalula municipalities station got samuel point comrade mayoral worry
6 telling voted command voted try loan remind shikwambana coalition lie
7 true fikile leadership remember powerful entrepreneur tense mandla candidate investment
8 longer iec fighter wisely discredit bought using thabo posters live
9 conditions called statement register boo haibo thing address things platform
10 care failed students monday things capital old blf phoenix home
11 intentions voteeff bounds nto mitigation herman getting trail getthingsdonerally wait
12 hide youth harassment future 25 white away cyrilramaphosa record problem
13 clarion listen missed pitched sentencing organisations money mabuza town money
14 happy problem emails granted matters managed sleepless fighter conditions means

Let us now see how the top 10 topics change over time. See the plot belw. Note how the daily topic graph is increasing over time as more Twitter posts are created closer to the election day (1 November 2021)

We can also normalise all of the topics daily (that is we weight them between 0 and 1 daily) and see them in a more flat view instead of increasing. See below. It should now be easier to see on a daily basis which topic dominates and changes over time.

Sentiment per topic

One of the other things we can do is try to understand what the sentiment per topic is. For this we will use a pre trained sentiment model. We will use XLM-T - A Multilingual Language Model Toolkit for Twitter URL. This model takes in a Twitter posts and returns if the post is Neutral, Positive or Negative. This is a form of opinion mining/sentiment analysis.

Note: We will be using this as a heuristic, these models have not been finetuned for South African English or code mixing

With the model we can calculate the sentiment of all of our Twitter social media posts in the dataset. We keep to the social media posts that have 10 or more words. We use the orinal text (not the cleaned one). Here is the overall view of the sentiment.

Negative    369531
Neutral     312577
Positive    156207
Name: sentiment_pred_label, dtype: int64

We can look back at our Topic 71, the general election information topic. We note that it has more Neutral posts.

Neutral     3095
Negative    1917
Positive     981
Name: sentiment_pred_label, dtype: int64

Let us look again at that IEC example from earlier. What was its sentiment?

Twitter Post: @DawieScholtz 1/2 The number of voters on voters’ roll decreases between general elections after rate of mortality is taken into account (+-30k deaths/month pre-COVID) &amp; in absence of a registration event ahead of #LGE2021 on 27 Oct there has not been an opportunity to reverse the decline. 
Label: Neutral

Actually, one of the things we note about IEC is that they have very neutral Twitter posts. See below

iec_example_sentiment.sentiment_pred_label.value_counts()
Neutral     861
Negative    203
Positive     84
Name: sentiment_pred_label, dtype: int64

If we now g to topic 68, a topic that is mostly about corruption, we now notice a different pattern (see below). Most of the Twitter social media posts about corruption are Negative. This would be expected.

Topic: 68
Negative    5145
Neutral       58
Positive      54
Name: sentiment_pred_label, dtype: int64

We can check if this might be a pattern with such topics. Let us check Topic 4

Topic: 4
Negative    6881
Neutral     1755
Positive    1200
Name: sentiment_pred_label, dtype: int64

We find the same pattern. Lets look at Topics 25 and 35 (both connected to individual political party campaigning, so informational)

Topic: 25
Neutral     5839
Positive     971
Negative     920
Name: sentiment_pred_label, dtype: int64
Topic: 35
Neutral     3390
Positive    2746
Negative    1338
Name: sentiment_pred_label, dtype: int64

We now notice that these topics are more information and are more neutral.

Resources and References

  • Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]
  • V Marivate, A Moodley, A Saba. Extracting and categorising the reactions to COVID-19 by the South African public -- A social media study. Proceedings of IEEE AFRICON 2021 (To Appear) [Preprint]