Number of Duplicates by content:  12246

In order to find the relevant information, a variety of different search terms and microblog characteristics were used to collect the data. Below are the search terms we used to gather the microblog (Twitter) social media posts.

Twint vs. Twarc data collection

We collected data using both the Twint Tool as well as the Twarc Tool. Each of these tools use different methods to extract microblog information from the Twitter platform.

Twint

Firstly, we needed to establish the number of social media posts that we collected using the Twint Tool. In total, we extracted more than 500k microblog posts.

Number of social media posts collected using Twint:  572213

Next, we needed to stratify this information to contextualise it in terms of the number of daily social media posts.

Text(0.5, 0.75, 'https://dsfsi.github.io/zaelection2021/')

Twarc collected data

Just like before, we needed to perform the same data collection strategies, but using the Twarc Tool in Python. Firstly, we managed to extract more than 900K microblog posts.

Number of social media posts collected using Twarc: 939164

Number of Duplicates by content twarc:  487102

Number of social media posts collected using Twarc: 452062

Once established, we needed to stratify the information according to the number of daily microblog posts. Now, let's visualise the daily number of social media posts.

Text(0.5, 0.75, 'https://dsfsi.github.io/zaelection2021/')

Discussion

As we can see above, the Twint Tool gives us less daily social media posts but we were able to collect from 1 September to 31 October 2021. With the Twarc tool we were able to collect more data but from 1-31 October 2021 only. This meant that the Twint Tool gave us larger coverage in terms of duration, and the Twarc tool gave us larger quantities but for a shorter period of time. This difference will be useful since we are able to see how the frequency of related microblog posts changed over time.

Combine Dataset

To ensure that we contextualised the information, we needed to create a consolidated dataframe using both datasets from the Twint and Twarc Tools. This gave us a combined dataset of more than 1.2 Million microblog posts to work with.

Total Twitter social media posts:  727083

User Information

One of the first analyses that needed to be performed was identifying the number of unique users. This is important as some users may have posted multiple times, and contextualising this information in the analysis will add value in terms of understanding the number of original posts per unique user, among others. In total, there were more than 130k different unique users that contributed toward the dataset.

Number of unique users:  100645

Next, we needed to identify the number of microblog posts generated by the top users. This was important for us as it is an additional way aggregate the data for us to understand how many of the posts in the dataset came from the top users we identified in a previous blog post. Let us see how many Twitter posts did the top users send.

Note: In this work we will only reveal names of users if we take them as public persons OR public figure OR organisation See Guide

Below we show some of the public persons/organisations in the top 100

Resources and References

Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]

	party	party_twitter_search	leader_search	hashtag_search
1	ANC	myanc	CyrilRamaphosa	voteanc
2	DA	our_da	jsteenhuisen	voteda
3	EFF	effsouthafrica	Julius_S_Malema	voteeff
4	IFP	IFPinParliament	MkhulekoHlengwa	voteifp
5	FFPlus	VFPlus	pieter_mulder	StemVFPlus,voteffplus
6	ACDP	A_C_D_P	RevMeshoe	voteacdp
7	UDM	UDmRevolution	BantuHolomisa	voteudm
8	ATM	ATMovement_SA	ZungulaVuyo	voteatm
9	GOOD	forgoodza	PatriciaDeLille	votegood
10	Plan of Action Party	partyofaction	Billy Nyaku
11	IEC South Africa	IECSouthAfrica
12	ActionSA	Action4SA	hermanmashaba	voteactionsa

	username	count
1	effsouthafrica	2411
10	ancncape	1526
11	myanc	1485
12	our_da	1395
14	cyrilramaphosa	1286
17	iecsouthafrica	1160
22	action4sa	936
29	sabcnews	882