• Home /
  • Articles /
  • Consuming, Parsing, and Processing Tweets with Python

  • Consuming, Parsing, and Processing Tweets with Python

    posted on 27 October 2014 by Jeff Kolb


    Introduction

    Python is a powerful, easily readable, and well-documented scripting
    language that is well suited for data exploration and analysis.
    The purpose of this article is to explore a Python script that
    performs a simple but complete parsing of JSON-formatted social media data,
    such as would be streamed or downloaded from a Gnip API endpoint.
    The article is meant to be accessible to readers with little to no previous experience with Python.
    In this example, we’ll start by loading some Twitter data, then extracting the Tweet actor ID
    and the language identification, and finally, printing the comma-separated output.

    The general form of this script is useful for the extraction of any combination of
    tweet elements, and it is easily modified to perform conditional logic,
    data manipulation, or arithmetic and other aggregation.
    By interfacing with Python’s many add-on packages,
    the user can incorporate a database interface, robust statistical treatments,
    machine learning, data visualization,
    and many other complex techniques.

    Data Input

    The first thing to do is to get data into our program.
    In Python, interaction with the operating system is typically accomplished with the sys module.
    Here we consider two input methods which make use of the module:

    In both cases, we must first import the sys module. Positional parameters from the
    command line invocation of python are stored in the variable sys.argv, which is an array, or list.
    For example, the Unix command:

    $ python my_script.py par_1 par_2
    

    will create a Python environment in which the variable sys.argv is a list that contains two elements: par_1 and par_2.

    In our script we want the option of providing the input file name as a command line parameter.
    We’ll test the length of sys.argv to see if this parameter exists,
    remembering that the first element of the list always contains the name of script being run.
    If the input file name is found, we’ll open it and store the resulting object in the variable line_generator.

    import sys
    
    if len(sys.argv) > 1:
        line_generator = open(sys.argv[1])
    

    If, instead of providing an input file name, we pipe the data directly into the script via the Unix input buffer,
    we’ll associate line_generator with that buffer, again using the sys module.

    import sys
    
    if len(sys.argv) > 1:
        line_generator = open(sys.argv[1])
    else:
        line_generator = sys.stdin
    

    The data can then be processed, line-by-line, in a for loop.

    import sys
    
    if len(sys.argv) > 1:
        line_generator = open(sys.argv[1])
    else:
        line_generator = sys.stdin
    
    for line in line_generator:
        ### analyze line ###
    

    Tweet Parsing

    Now that we have the tweets at our fingertips, let’s do something interesting with them.
    Because each tweet is represented by a JSON-formatted string on a single line,
    the first analysis task is to transform this string into a more useful Python object.
    Since the JSON format is specified in terms of key/value pairs,
    we’ll use Python’s dictionary type. The json module
    provides a mapping from JSON-formatted strings to dictionaries with its loads function.

    import json
    for line in line_generator:
        line_object = json.loads(line)
    

    Line object is a dictionary type.

    Here’s an example that extracts the actor’s ID and language for each tweet.
    More information about the keys needed to extract a particular element of a record
    can be found in the Twitter data format documentation.

    import json
    for line in line_generator:
        line_object = json.loads(line)
    
        actor_id_string = line_object["actor"]["id"]
        actor_id = int( actor_id_string.split(":")[2] )
        language_code = line_object["twitter_lang"]
    

    In the code snippet above, note that the actor_id_string variable is split into 3, colon-separated pieces,
    the third of which is a string containing the actual ID. To extract this number,
    we use the split method of the str (string) type, which returns a list of items.
    We use the zero-based index in square brackets to get the third of the colon-separated substrings.
    It is then cast to an integer type.

    Error Handling

    What happens if the line_object object does not contain, for example, a key called twitter_lang?
    Python will throw a KeyError type of exception. We can handle the error condition
    by catching the exception with the try/except construction:

    import json
    for line in line_generator:
        line_object = json.loads(line)
        try:
            actor_id_string = line_object["actor"]["id"]
            actor_id = int( actor_id_string.split(":")[2] )
            language_code = line_object["twitter_lang"]
        except KeyError, e:
            actor_id = -1
            language_code = "Null"
    

    In the case that a KeyError exception is thrown, we’ve chosen to set both variables to values
    that indicate that an error has occurred.

    Data Output

    So what do we do with the data once they’re extracted into variables?
    Here, we will use the format method of the str data type,
    which inserts values into a custom-formatted string.

    print_string = "{0:12d}, {1:2s}".format(actor_id,language_code)
    

    The curly brace pairs are control characters which indicate where in the string
    to insert the arguments to the format function.
    Inside the braces, the number to the left of the colon indicates the argument placement order.
    The number directly after the colon is the size of the field, and the letter gives the data type.
    Note that the comma is the only character that will be printed literally in our print_string variable.
    A newline will be added by the print statement below.

    To print to the screen, we simply use call print function on print_string variable.
    Let’s put all the code together:

    import sys
    
    if len(sys.argv) > 1:
        line_generator = open(sys.argv[1])
    else:
        line_generator = sys.stdin
    
    import json
    for line in line_generator:
        line_object = json.loads(line)
        try:
            actor_id_string = line_object["actor"]["id"]
            actor_id = int( actor_id_string.split(":")[2] )
            language_code = line_object["twitter_lang"]
        excepy KeyError, e:
            actor_id = -1
            language_code = "Null"
        print_string = "{0:12d}, {1:2s}".format(actor_id,language_code)
        print(print_string)
    
    

    A call to this script might look like:

    $ cat my_tweets.json | python my_script.py
       562505486, ja
      1624484822, en
       291831569, es
       592115739, en
      1287328680, fr
      1646041412, ar
      1337618864, es
       423833423, en
      2378519401, ja
      1881685154, ja
    

    An equivalent call, using a command line argument to specify the input file name, is:

    $ python my_script.py my_tweets.json
    

    Doing More

    This script can be easily extended to perform more interesting analyses.
    For example, one might initialize a dictionary type variable to count
    the number of tweets associated with each language. Python’s datetime module
    provides a set of convenient data structures for storing dates and time.
    With this capability, a user might extract the posted time of the tweets and count,
    on an hourly basis, the number of tweets per language.
    The matplotlib module has a good tutorial
    for ploting time series data.

    Here is a short list of useful Python modules, including those mentioned above:

    • sys - interact with the operating system
    • os - interact with the filesystem
    • json - translate between JSON-formatted data and Python dictionaries
    • datetime - work with dates and times
    • matplotlib - create data visualizations
    • collections - numerous data container classes
    • hashlib - hash functions
    • yaml - work with the YAML data format
    • fileinput - simplified interface to much of what was presented in the example above.
    • logging - manage the posting, the severity, and the storage of log messages
    • zlib - compress/decompress data
    • dataset - simple relational database interface
    • scipy - scientific computing tools (subpackage numpy is particularly useful)