(1) Overview

Introduction

Social media data can provide insights into people’s perception or preferences of specific topics [], and thus have the possibility to impact many aspects of our society such as policies and infrastructure designs []. While getting a representative sample is often difficult or sometimes impossible [, , , , ], the text data available in platforms like Twitter can be valuable for research, especially given that many entities, from politicians and companies to influential individuals, use this platform to spread ideas, strategies, plans, and proposals. To make this data available to researchers, Twitter developed its own API [], which is free for academic research. However, dealing with authentication, API calls, and data response handling can be overwhelming for researchers that have little to no experience in coding but who would still benefit from this type of data. In addition, the geographical component associated with geo-tagged tweets needs careful manipulation given that the text attributes of the tweets come packed in one list of tweets in the data entry of the response and the geographical attributes come in a separate list. For these reasons, we have developed GTdownloader, a high-level package that offers easy access to the full-archive-search Twitter API endpoint and compiles the retrieved data in standard formats so it can be easily manipulated and analyzed.

Although other interfaces exist to retrieve and analyze data from Twitter [, ], they are either not available in Python or they are not compatible with the current version of the API. The closest package identified in our search is TTLocVis [], which also offers geographical data pre-visualization; however, for the most part, it offers static visualizations, and its main focus is topic modeling, which is out of the scope of GTdownloader.

Implementation and architecture

The software implementation can be understood in terms of its two main classes: the TweetDownloader() class and the GeoMethods() class. The TweetDownloader class serves as the interface between the user and the API. Instead of having to build the entire body of the request, the user can interact with the API by means of Python functions that also follow the Python paradigm. For instance, a query in the form:


{‘query’: query,
‘start_time’: start_time,
‘end_time’: end_time,
‘expansions’: ‘geo.place_id,author_id’,
‘place.fields’: contained_within,country,country_code,full_name,geo,id’
‘tweet.fields’: ‘created_at,author_id,id,public_metrics,conversation_id’,
‘user.fields’: ‘id,location,name,username,public_metrics’,
‘max_results’: max_page}

will be equivalent to the following implementation of a class method:


get_tweets(query=query, start_time=start_time,
        end_time=end_time, max_tweets=max_tweets
        )

Some of the query parameters of the request body are not part of the Python arguments. We implemented it this way because we consider these to be crucial for research and hence, are included in all requests in this package. In addition to simplifying the body of the requests into the get_tweets() method, the query optional parameters are also included as arguments in the same method. One way to illustrate this is the following query searching for tweets mentioning the FIFA World Cup that are just in English and retrieves only original tweets (i.e., re-tweets are not allowed):


query = “(FIFA World Cup) – is:retweet lang:en”

Using GTdownloader, it would translate into this:


get_tweets(“FIFA World Cup”, lang=“en”, include_retweets=False)

The Twitter API can retrieve a maximum of 500 tweets per call. If more tweets are needed, it is necessary to handle the API response pagination to get one page of results at a time by keeping track of the token generated in each response to identify the corresponding next page. GTdownloader takes care of this process so the user does not have to deal with this limitation regardless of the desired number of tweets.

Figure 1 illustrates how both classes interact with each other and the API, and it lists the possible outputs form their methods.

Figure 1 

Package logical structure and outputs.

Aside from providing output files in standard formats for data analysis and Geographic Information Systems (GIS) post-processing, we leverage matplotlib [] and Plotly [] to include methods that would allow a user to visualize the tweets and their location in both static and interactive graphs. Further, we make use of the Wordcloud package [] to generate a list of the most commonly used words and that takes stopwords (words not intended for the plot) as an argument of the plotting function.

Demonstration of functionality

Before downloading tweets from the API, users must ensure they have access to the API user keys provided by Twitter for researchers. Academic Researcher access applications can be submitted through the Twitter developer portal. Applicants must provide details of the project they wish to use the data for and demonstrate they are academic researchers. Once the access is granted, Twitter keys can be obtained from the developer portal. GTdownloader reads the access tokens from a .yaml file so the keys are not exposed in the code.


from gtdownloader import TweetDownloader
 
gtd = TweetDownloader(credentials='twitter_keys.yaml', name=Bike_commuting_project')
gtd.get_tweets(
        query='bike commuting',
        lang='en',
        start_time='01/01/2019',
        end_time='12/31/2021',
        max_tweets=2000
        )

After importing GTdownloader and creating a TweetDownloader() instance by passing the path to the Twitter keys, the get_tweets() method can be executed with its corresponding query and arguments. In this demonstration, we intend to retrieve tweets in English on bike commuting that were generated between January 1st, 2019 and December 31st, 2021. We set the maximum number of tweets to 2000.

While executing the method, the console prompts progress messages indicating the downloaded pages, the next page token, the number of tweets gathered, and the name of the file containing the downloaded tweets. A sample of this displayed below:


Downloading tweets…
Current progress saved at: Bike_commuting_downloads\temp_Bike_commuting_project_08032022_171311.csv
Ending page 1 with next_token=b26v89c19zqg8o3fpz2m17r4qqlvzhsejuwhysusao1a5. 496 tweets retrieved (496 total)
Current progress saved at: Bike_commuting_downloads\temp_ Bike_commuting_project_08032022_171311.csv
Tweets download done. A total of 766 tweets were retrieved.
csv files: Tennis_project_downloads\Tennis_players_project_tweets_08032022_171311.csv, Tennis_project_downloads\Tennis_players_project_places_08032022_171311.csv, and Tennis_project_downloads\Tennis_players_project_authors_08032022_171311.csv were generated

After downloading the tweets, the centroids of the tweets can be quickly reviewed and mapped (see Figure 2) by calling the preview_tweet_locations() method:

Figure 2 

Simple map pre-visualization of tweets using bounding box centroids as geographical unit.


gtd.preview_tweet_locations()

The interactive map displays a map in which tweet data such as text and location are displayed upon hovering. Panning, zooming in and out, and snapshot saving are available in the animation. Figure 3 shows an aggregated version of the interactive map that groups the tweets by location and displays bubbles whose size depends on the number of tweets at the given location:

Figure 3 

Interactive visualization displaying aggregated tweets using point size to represent tweet counts per location.


gtd.interactive_map_agg()

As shown in Figure 4, a time_unit (year, month, day, hour, minute, or second) can be selected as well as the time unit in a map-based animation of the tweets.

Figure 4 

Interactive animation using a user-defined time unit to display the evolution of tweets aggregated per location.


gtd.map_animation(time_unit=’month’)

Finally, as shown in Figure 5, a Wordcloud function lets us visualize the most common words in the downloaded tweets. Notice that we make use of the custom_stopwords parameter to exclude the query words and the http and https tags that may arise from url posting.

Figure 5 

Wordcloud generated from the tweets and excluding user defined stopwords.


gtd.wordcloud(
        custom_stopwords=[‘bike’, ‘commuting’, ‘http’, ‘https’],
        background_color=’white’)

Quality control

GTdownloader is tested under the unit test framework. Unit tests have been included in the software repository and made available to all users. The tests are contained in two main components: API transactions and data exports. The first one tests that all queries are performed correctly from the input parameters and that a successful response is obtained. The second tests component ensures the response is being handled correctly and that the output formats can be built correctly.

(2) Availability

Operating system

Works in all operating systems supporting Python.

Programming language

Python 3.5 or higher

Dependencies

searchtweets-v2, Plotly, Geopandas, Wordcloud

List of contributors

Juan Acosta-Sequeda, Sybil Derrible

Software location

Archive

Name: PyPi

Persistent identifier: https://pypi.org/project/gtdownloader/

Licence: MIT

Publisher: Juan Acosta- Sequeda

Version published: 0.1.19

Date published: 12/10/22

Name: conda-forge

Persistent identifier: https://anaconda.org/conda-forge/gtdownloader

Licence: MIT

Publisher: Juan Acosta- Sequeda

Version published: 0.1.15

Date published: 13/10/22

Code repository

Name: Zenodo

Identifier: 10.5281/zenodo.7710329

Licence: MIT

Date published: 03/09/23

Name: GitHub

Identifier: https://github.com/jugacostase/gtdownloader

Licence: MIT

Date published: 02/08/22

Language

English

(3) Reuse potential

This package is useful for researchers seeking to make use of the Twitter API to retrieve tweets from the full Twitter archive. The geographical data associated with each tweet enables both the display of map visualization and the use of geostatistics for an in-depth analysis of the data with inference potential.

Given that GTdowloader offers high-level methods and access to the query parameters through function arguments, it is suitable to be incorporated into more complex pipelines that might include automated searches, text and sentiment analysis models, metrics tracking, and geographical dashboards.

The full documentation and reference of the package is provided online as well as installation instructions. Moreover, as authors we will do our best to provide support from users requests, which can be submitted in the form of GitHub issues.