A first dive into our text data
In this specific post, we will explore what information and contexts are present within our most frequent words and most frequent phrases from the data. To do so, we would need to look into how we obtained the data, how we visualised the information, and use a visualisation technique to illustrate posts from 10 different political parties. **Note:** *This post was updated on 1 March 2022 to take into account removal of retweeted content*
- Loading our data
- Visualise Textual Data
- Visualising Twitter Posts from 10 parties
- Resources and References
Loading our data
To understand the type of information that we are working with in terms of the most frequent words and phrases, the information needs to be processed. To do so, there are a few data cleaning steps that need to take place. Firstly, let us remove all the duplications from the data. If we do not do this step, then we will not get an accurate representation of the unique words and phrases in terms of their frequency, as a phrase or word may be repeated thousands of times over a series of online replies.
Next, we would need to load our data in a data frame. A data frame is a structured way to represent data using rows and columns. Furthermore, a data frame is one of the most common ways to store data, and is a useful structure to store data if further analytics is to be performed from the data to gain insights.
Now let us see what is in the data frame by looking at the variables and some of the general structural features of the data.
Visualise Textual Data
Although significant time is spent on the data loading and data processing, it is important to visualise the data so that the information makes sense. Data visualisation is a way to represent the information from the data frame so that the best possible representation and understanding can be gained from the dataset at a glance. It is important to note that there are multiple ways to present data, beyond using traditional data visualisation techniques.
Visualising Twitter Posts from 10 parties
For this specific visualisation, we randomly sampled 2000 posts from 10 different political parties. In our visualisation, we used TSNE to show the data in two dimensions (because the dataset is complex, and can be illustrated in a three dimensional space). To navigate the visualisation, you can hover on the graph to see the different posts represented by each node.
An example of what you can get if you hover over a Twitter post (a point on the graph) is shown below.
One can see that this was a Twitter post by ATM and also what the text is.
What can one read from such a visualisation?
We can look at Twitter posts that are similar (they will be near each other) and look for patterns about how each party has differences in content in their posts. You can see clusters that each party dominates. See below example circled in red. That is an EFF cluster.
Now you can navigate the full dataset below.
Resources and References
- Moodley, V Marivate. Topic Modelling of News Articles for Two Consecutive Elections in South Africa. 2019 6th International Conference on Soft Computing & Machine Intelligence (ISCMI). [Paper URL][Preprint]
- V Marivate, A Moodley, A Saba. Extracting and categorising the reactions to COVID-19 by the South African public -- A social media study. Proceedings of IEEE AFRICON 2021 (To Appear) [Preprint]