Checking our data
In this post, we will go through loading our data sets, and checking our sample size as well as combine different data sets. This means that we are going to check what is in our data, and whether or not the data itself will meet our expectations when we perform the analysis. We collected microblog post data using both Twint and Twarc tools in the Python language, and once the information is deemed satisfactory and meets all the requirements, we will combine this data into a single data object for further analysis. **Note:** *This post was updated on 1 March 2022 to take into account removal of retweeted content*
In order to find the relevant information, a variety of different search terms and microblog characteristics were used to collect the data. Below are the search terms we used to gather the microblog (Twitter) social media posts.
Twint vs. Twarc data collection
We collected data using both the Twint Tool as well as the Twarc Tool. Each of these tools use different methods to extract microblog information from the Twitter platform.
Twint
Firstly, we needed to establish the number of social media posts that we collected using the Twint Tool. In total, we extracted more than 500k microblog posts.
Next, we needed to stratify this information to contextualise it in terms of the number of daily social media posts.
Once established, we needed to stratify the information according to the number of daily microblog posts. Now, let's visualise the daily number of social media posts.
Discussion
As we can see above, the Twint Tool gives us less daily social media posts but we were able to collect from 1 September to 31 October 2021. With the Twarc tool we were able to collect more data but from 1-31 October 2021 only. This meant that the Twint Tool gave us larger coverage in terms of duration, and the Twarc tool gave us larger quantities but for a shorter period of time. This difference will be useful since we are able to see how the frequency of related microblog posts changed over time.
User Information
One of the first analyses that needed to be performed was identifying the number of unique users. This is important as some users may have posted multiple times, and contextualising this information in the analysis will add value in terms of understanding the number of original posts per unique user, among others. In total, there were more than 130k different unique users that contributed toward the dataset.
Next, we needed to identify the number of microblog posts generated by the top users. This was important for us as it is an additional way aggregate the data for us to understand how many of the posts in the dataset came from the top users we identified in a previous blog post. Let us see how many Twitter posts did the top users send.
Note: In this work we will only reveal names of users if we take them as public persons OR public figure OR organisation See Guide
Below we show some of the public persons/organisations in the top 100