Loading our data

To understand the type of information that we are working with in terms of the most frequent words and phrases, the information needs to be processed. To do so, there are a few data cleaning steps that need to take place. Firstly, let us remove all the duplications from the data. If we do not do this step, then we will not get an accurate representation of the unique words and phrases in terms of their frequency, as a phrase or word may be repeated thousands of times over a series of online replies.

Next, we would need to load our data in a data frame. A data frame is a structured way to represent data using rows and columns. Furthermore, a data frame is one of the most common ways to store data, and is a useful structure to store data if further analytics is to be performed from the data to gain insights.

Now let us see what is in the data frame by looking at the variables and some of the general structural features of the data.

Number of Duplicates:  0

Visualise Textual Data

Although significant time is spent on the data loading and data processing, it is important to visualise the data so that the information makes sense. Data visualisation is a way to represent the information from the data frame so that the best possible representation and understanding can be gained from the dataset at a glance. It is important to note that there are multiple ways to present data, beyond using traditional data visualisation techniques.

Text(0.8, 0.2, 'https://dsfsi.github.io/zaelection2021/')
Text(0.8, 0.15, 'https://dsfsi.github.io/zaelection2021/')

Visualising Twitter Posts from 10 parties

For this specific visualisation, we randomly sampled 2000 posts from 10 different political parties. In our visualisation, we used TSNE to show the data in two dimensions (because the dataset is complex, and can be illustrated in a three dimensional space). To navigate the visualisation, you can hover on the graph to see the different posts represented by each node.


An example of what you can get if you hover over a Twitter post (a point on the graph) is shown below.

Screenshot at Nov 08 11-06-43.png

One can see that this was a Twitter post by ATM and also what the text is.

What can one read from such a visualisation?

We can look at Twitter posts that are similar (they will be near each other) and look for patterns about how each party has differences in content in their posts. You can see clusters that each party dominates. See below example circled in red. That is an EFF cluster.


Now you can navigate the full dataset below.

Resources and References

  • Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]
  • V Marivate, A Moodley, A Saba. Extracting and categorising the reactions to COVID-19 by the South African public -- A social media study. Proceedings of IEEE AFRICON 2021 (To Appear) [Preprint]