Number of Duplicates by content:  12246

In order to find the relevant information, a variety of different search terms and microblog characteristics were used to collect the data. Below are the search terms we used to gather the microblog (Twitter) social media posts.

party party_twitter_search leader_search hashtag_search
1 ANC myanc CyrilRamaphosa voteanc
2 DA our_da jsteenhuisen voteda
3 EFF effsouthafrica Julius_S_Malema voteeff
4 IFP IFPinParliament MkhulekoHlengwa voteifp
5 FFPlus VFPlus pieter_mulder StemVFPlus,voteffplus
6 ACDP A_C_D_P RevMeshoe voteacdp
7 UDM UDmRevolution BantuHolomisa voteudm
8 ATM ATMovement_SA ZungulaVuyo voteatm
9 GOOD forgoodza PatriciaDeLille votegood
10 Plan of Action Party partyofaction Billy Nyaku
11 IEC South Africa IECSouthAfrica
12 ActionSA Action4SA hermanmashaba voteactionsa

Twint vs. Twarc data collection

We collected data using both the Twint Tool as well as the Twarc Tool. Each of these tools use different methods to extract microblog information from the Twitter platform.

Twint

Firstly, we needed to establish the number of social media posts that we collected using the Twint Tool. In total, we extracted more than 500k microblog posts.

Number of social media posts collected using Twint:  572213

Next, we needed to stratify this information to contextualise it in terms of the number of daily social media posts.

Text(0.5, 0.75, 'https://dsfsi.github.io/zaelection2021/')

Twarc collected data

Just like before, we needed to perform the same data collection strategies, but using the Twarc Tool in Python. Firstly, we managed to extract more than 900K microblog posts.

Number of social media posts collected using Twarc: 939164
Number of Duplicates by content twarc:  487102
Number of social media posts collected using Twarc: 452062

Once established, we needed to stratify the information according to the number of daily microblog posts. Now, let's visualise the daily number of social media posts.

Text(0.5, 0.75, 'https://dsfsi.github.io/zaelection2021/')

Discussion

As we can see above, the Twint Tool gives us less daily social media posts but we were able to collect from 1 September to 31 October 2021. With the Twarc tool we were able to collect more data but from 1-31 October 2021 only. This meant that the Twint Tool gave us larger coverage in terms of duration, and the Twarc tool gave us larger quantities but for a shorter period of time. This difference will be useful since we are able to see how the frequency of related microblog posts changed over time.

Combine Dataset

To ensure that we contextualised the information, we needed to create a consolidated dataframe using both datasets from the Twint and Twarc Tools. This gave us a combined dataset of more than 1.2 Million microblog posts to work with.

Total Twitter social media posts:  727083

User Information

One of the first analyses that needed to be performed was identifying the number of unique users. This is important as some users may have posted multiple times, and contextualising this information in the analysis will add value in terms of understanding the number of original posts per unique user, among others. In total, there were more than 130k different unique users that contributed toward the dataset.

Number of unique users:  100645

Next, we needed to identify the number of microblog posts generated by the top users. This was important for us as it is an additional way aggregate the data for us to understand how many of the posts in the dataset came from the top users we identified in a previous blog post. Let us see how many Twitter posts did the top users send.

Note: In this work we will only reveal names of users if we take them as public persons OR public figure OR organisation See Guide

Below we show some of the public persons/organisations in the top 100

username count
1 effsouthafrica 2411
10 ancncape 1526
11 myanc 1485
12 our_da 1395
14 cyrilramaphosa 1286
17 iecsouthafrica 1160
22 action4sa 936
29 sabcnews 882

Resources and References

  • Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]