Consuming, Parsing, and Processing Tweets with Python

posted on 27 October 2014 by Jeff Kolb

Introduction

Python is a powerful, easily readable, and well-documented scripting
language that is well suited for data exploration and analysis.
The purpose of this article is to explore a Python script that
performs a simple but complete parsing of JSON-formatted social media data,
such as would be streamed or downloaded from a Gnip API endpoint.
The article is meant to be accessible to readers with little to no previous experience with Python.
In this example, we’ll start by loading some Twitter data, then extracting the Tweet actor ID
and the language identification, and finally, printing the comma-separated output.

The general form of this script is useful for the extraction of any combination of
tweet elements, and it is easily modified to perform conditional logic,
data manipulation, or arithmetic and other aggregation.
By interfacing with Python’s many add-on packages,
the user can incorporate a database interface, robust statistical treatments,
machine learning, data visualization,
and many other complex techniques.

Data Input

The first thing to do is to get data into our program.
In Python, interaction with the operating system is typically accomplished with the sys module.
Here we consider two input methods which make use of the module:

positional argument parsing
reading from the Unix standard input buffer

In both cases, we must first import the sys module. Positional parameters from the
command line invocation of python are stored in the variable sys.argv, which is an array, or list.
For example, the Unix command:

$ python my_script.py par_1 par_2

will create a Python environment in which the variable sys.argv is a list that contains two elements: par_1 and par_2.

In our script we want the option of providing the input file name as a command line parameter.
We’ll test the length of sys.argv to see if this parameter exists,
remembering that the first element of the list always contains the name of script being run.
If the input file name is found, we’ll open it and store the resulting object in the variable line_generator.

import sys

if len(sys.argv) > 1:
    line_generator = open(sys.argv[1])

If, instead of providing an input file name, we pipe the data directly into the script via the Unix input buffer,
we’ll associate line_generator with that buffer, again using the sys module.

import sys

if len(sys.argv) > 1:
    line_generator = open(sys.argv[1])
else:
    line_generator = sys.stdin

The data can then be processed, line-by-line, in a for loop.

import sys

if len(sys.argv) > 1:
    line_generator = open(sys.argv[1])
else:
    line_generator = sys.stdin

for line in line_generator:
    ### analyze line ###

Tweet Parsing

Now that we have the tweets at our fingertips, let’s do something interesting with them.
Because each tweet is represented by a JSON-formatted string on a single line,
the first analysis task is to transform this string into a more useful Python object.
Since the JSON format is specified in terms of key/value pairs,
we’ll use Python’s dictionary type. The json module
provides a mapping from JSON-formatted strings to dictionaries with its loads function.

import json
for line in line_generator:
    line_object = json.loads(line)

Line object is a dictionary type.

Here’s an example that extracts the actor’s ID and language for each tweet.
More information about the keys needed to extract a particular element of a record
can be found in the Twitter data format documentation.

import json
for line in line_generator:
    line_object = json.loads(line)

    actor_id_string = line_object["actor"]["id"]
    actor_id = int( actor_id_string.split(":")[2] )
    language_code = line_object["twitter_lang"]

In the code snippet above, note that the actor_id_string variable is split into 3, colon-separated pieces,
the third of which is a string containing the actual ID. To extract this number,
we use the split method of the str (string) type, which returns a list of items.
We use the zero-based index in square brackets to get the third of the colon-separated substrings.
It is then cast to an integer type.

Error Handling

What happens if the line_object object does not contain, for example, a key called twitter_lang?
Python will throw a KeyError type of exception. We can handle the error condition
by catching the exception with the try/except construction:

import json
for line in line_generator:
    line_object = json.loads(line)
    try:
        actor_id_string = line_object["actor"]["id"]
        actor_id = int( actor_id_string.split(":")[2] )
        language_code = line_object["twitter_lang"]
    except KeyError, e:
        actor_id = -1
        language_code = "Null"

In the case that a KeyError exception is thrown, we’ve chosen to set both variables to values
that indicate that an error has occurred.

Data Output

So what do we do with the data once they’re extracted into variables?
Here, we will use the format method of the str data type,
which inserts values into a custom-formatted string.

print_string = "{0:12d}, {1:2s}".format(actor_id,language_code)

The curly brace pairs are control characters which indicate where in the string
to insert the arguments to the format function.
Inside the braces, the number to the left of the colon indicates the argument placement order.
The number directly after the colon is the size of the field, and the letter gives the data type.
Note that the comma is the only character that will be printed literally in our print_string variable.
A newline will be added by the print statement below.

To print to the screen, we simply use call print function on print_string variable.
Let’s put all the code together:

import sys

if len(sys.argv) > 1:
    line_generator = open(sys.argv[1])
else:
    line_generator = sys.stdin

import json
for line in line_generator:
    line_object = json.loads(line)
    try:
        actor_id_string = line_object["actor"]["id"]
        actor_id = int( actor_id_string.split(":")[2] )
        language_code = line_object["twitter_lang"]
    excepy KeyError, e:
        actor_id = -1
        language_code = "Null"
    print_string = "{0:12d}, {1:2s}".format(actor_id,language_code)
    print(print_string)

A call to this script might look like:

$ cat my_tweets.json | python my_script.py
   562505486, ja
  1624484822, en
   291831569, es
   592115739, en
  1287328680, fr
  1646041412, ar
  1337618864, es
   423833423, en
  2378519401, ja
  1881685154, ja

An equivalent call, using a command line argument to specify the input file name, is:

$ python my_script.py my_tweets.json

Doing More

This script can be easily extended to perform more interesting analyses.
For example, one might initialize a dictionary type variable to count
the number of tweets associated with each language. Python’s datetime module
provides a set of convenient data structures for storing dates and time.
With this capability, a user might extract the posted time of the tweets and count,
on an hourly basis, the number of tweets per language.
The matplotlib module has a good tutorial
for ploting time series data.

Here is a short list of useful Python modules, including those mentioned above:

sys - interact with the operating system
os - interact with the filesystem
json - translate between JSON-formatted data and Python dictionaries
datetime - work with dates and times
matplotlib - create data visualizations
collections - numerous data container classes
hashlib - hash functions
yaml - work with the YAML data format
fileinput - simplified interface to much of what was presented in the example above.
logging - manage the posting, the severity, and the storage of log messages
zlib - compress/decompress data
dataset - simple relational database interface
scipy - scientific computing tools (subpackage numpy is particularly useful)