The goal of data visualization is to transform numbers into insights. However, default data visualization output often disappoints. Sometimes, the graph shows irrelevant data or misses important aspects; sometimes, the graph lacks context; sometimes, it’s difficult to read. Often, data practitioners “feel” that something isn’t right with the graph, but cannot pinpoint the problem.
In this post, I’ll share the process of visualizing a complex issue using a simple plot. Despite the fact that the final plot looks elementary and straightforward, it took me several hours and trial-and-error attempts to achieve the result. By sharing this process, I hope to accomplish two goals: to offer my perspectives and approaches to data visualization and to learn from other options you suggest. You’ll find the code and the data used in this post here.
Plotting power distribution in the Knesset
This post is devoted to a graph I created to explore power concentration in the Israeli Parliament (the Knesset) over time. For context, I’ll share some facts about Israeli politics. The State of Israel was founded in 1948 as a parliamentary republic. The Israeli Parliament is elected based on proportional nationwide representation. For many years, the larger parties claimed that the smaller ones had disproportionately high influence on Israeli politics because governments could only be formed based on fragile multi-party coalitions. In an attempt to solve this issue in 1996, Israeli citizens got the right to select their prime minister in a direct vote. Contrary to the original intent, direct vote for a prime minister spurred small-party growth at the expense of the larger ones, leading to even more stagnation. The legislators reverted the change, but the genie had already left the bottle: the small parties are not small anymore and creating coalitions is harder today than it was in the past.
To explore unequal seat distribution in the Israeli Parliament, I used the Gini coefficient, which is a metric first used to measure income disparity between people. When all the income belongs to a single entity, the Gini coefficient equals one. When all the assets are equally distributed, the coefficient equals zero. I adopted the Gini coefficient to measure seat disparity over time. In the context of the Knesset’s composition, a high Gini coefficient means few large and strong parties, and some small ones; a lower Gini coefficient indicates a more equal power distribution.
Having collected the data, I could start plotting. For graphs in this post, I used matplotlib — a powerful visualization framework for Python. This is how a default matplotlib plot looks :
Personally, I don’t like the aesthetics of the default matplotlib plot, specifically, that the plot area is surrounded by four spines. In matplotlib terminology, spines are lines that are drawn around the plot. The default plot has four such spines.
A short but important diversion
Let us divert from the story and talk about data-ink. The term “data-ink” was defined by Professor Edward Tufte as: “the non-erasable core of a graphic, the non-redundant ink arranged in response to variation in the numbers represented.” According to Tufte, “Ink that fails to depict statistical information does not have much interest to the viewer.” You may find a short description of this approach, as well as its criticism in this document, as well as in this research paper.
One example of not using unnecessary “ink” is the X axis of our plot. Since most people will immediately recognize numbers such as 1957, 1977, or 2007 as years of the current era, I intentionally didn’t add any labels to that axis, resulting in less “ink” used to convey the same information.
Improving aesthetics with background color
Let us go back to the plot spines. Taking the data-ink ratio principle to the extreme, one might think that the spines provide no useful information and thus, should be removed from any plot. The truth is that the spines form the document’s structure, aiding reading and interpretation. I think that there are better ways to achieve the same goals. One alternative to using spines is to fill the plot area with a pale neutral color. I used Seaborn — a Python visualization library based on matplotlib that provides a high-level interface for drawing attractive statistical graphics. By merely importing Seaborn, I improved the plot’s aesthetics:
If we plan to use the graph in a printed or on-screen document, we may use the same color to fill the entire plot area. By doing so, we help the reader to distinguish between the plot and the accompanying text:
I liked the newer version, but there is more that we can do to improve this graph. Between the elections, Knesset seat distribution, and thus the Gini coefficient, remain approximately the same. To reflect this, a step plot is a better choice for the data.
In matplotlib, one of the arguments to the step function is the step position. I used the post
option so that the line bends after the election date. Next, I changed the plot aspect ratio. Since the graph depicts the evolution of a number over time, I feel that the figure’s width should be much greater than its height. Despite the fact that several objective ways to determine plot aspect ratio have been proposed (see for example this paper), I used subjective judgment which resulted in the following:
Readability and Context
Did you notice that most plotting libraries use vertical orientation for Y axis labels? This is not merely a matter of taste. Rotated text labels are a serious readability problem: different studies show that rotated text slows short phrase reading by 54% up to 200%.
Fortunately, fixing such a problem is relatively easy with matplotlib:
ax.set_ylabel( 'Knesset seats Gini Index', rotation=0, ha='right', # short for horizontalalignment='right' )
You will notice that although the Gini coefficient can obtain values between zero and one, I include only the range that represents my data. Such a range restriction is not always the best choice. In our case, however, we are interested in changes over time so the entire theoretical range will provide us no useful information. A much bigger problem is the fact that most people are not familiar with the Gini coefficient. To help them interpret the graph, I changed the legend to something most of us will understand, keeping the technical definition in place:
Also, notice how I reduced the number of ticks on the Y axis. Three ticks are sufficient to provide axis dimensions, as well as to demonstrate the fact that the axis is linear and not, for example, logarithmic.
At this point, I wanted to provide better historical context. Traditionally, we tend to make the numbers on our axes (the ticks) equally spaced. However, when the quantity of unique values is small enough, we may use the actual numbers as axis ticks, thus providing more details. In our case, offering such details adds to the viewers’ knowledge of Israeli politics and better connects the graph to other historical events.
The problem is, due to political instabilities, there is not always enough time between several election campaigns, resulting in label overlap. Traditionally, the overlap is solved by rotating the labels. However, as we know, rotated labels are hard to read. Instead, I altered the label positions. In matplotlib, doing so is really easy:
for i, l in enumerate(ax.get_xticklabels()): if i % 2: l.set_y(-0.1)
… and the result is a graph with readable, non-overlapping axis ticks. Can you spot the political instability represented by the uneven gaps between the elections?
Now, we are almost ready to analyze the effect the direct prime ministerial elections had on how equal (or unequal) seat distribution is inside the Israeli Knesset. To make it easier to locate the relevant election campaigns on the graph, without the need to search Wikipedia, I emphasize the relevant campaigns using a different color:
Note that I didn’t use any arrows in my annotation. Instead, I placed the text annotation next to the relevant data and used the same color for both the data and the text. The proximity between the label and the data, as well as the similarity in color, deliver all the required information, without the need to use additional non-data ink.
Now, we can clearly see the effect the direct prime ministerial elections had on Israeli politics. During most of Israel’s history before the 1996 change, Knesset seats were distributed between the parties in a significantly unequal fashion, meaning few large parties and some small ones. The direct elections resulted in an increased power for formerly niche parties, at the expense of the historically large factions, resulting in a less centralized power distribution. However, reverting the change did not revert the effect: many voters stayed with the niche parties. To complete the visualization process, I add some explanatory text and a title:
Note that the plot title does not describe the data. Doing so would be an unnecessary duplication and waste of non-data ink. Instead, the title says what the data means, thus guiding the reader to the conclusion.
Let us recap the changes
Feature |
Before |
After |
Chart area |
Enclosed by four spines. Labels are separate from the chart |
The entire plot is enclosed within neutrally filled area, not additional lines are needed |
Y-Axis title |
Too technical |
Explanatory |
Rotated — hard to read |
Horizontal, no head tilting required |
|
X-axis “ticks” |
Generic, omitting historical context |
Percise, provide historical context |
Event of interest |
Abscent, prior knowledge or search required |
Embedded in a non-distracting way |
Title |
Descriptive title leads to the conclusion |
The default plot already provided most of the information contained in the modified version. However, excessive axis ticks and spines gave us superfluous non-data ink. The rotated label also posed readability challenges. Although not connected to default settings, the original graph’s label text was too technical. Usually, coming up with precise labels that aren’t too technical is a non-trivial task. It is much easier to write the technical term, hoping that the reader will fill the gap on their own. Also, the original chart lacked the context needed to tell a story to people without the in-depth knowledge of the data and the domain.
The changes described here are far from trivial. Depending on your technical expertise, these changes may take from 15 minutes to several hours — even if you know how the final result should look. Is spending several hours for such an improvement a justified effort? It depends. I would not spend this much effort if I had to present the data to my close friends and colleagues who have intimate knowledge of the data and the domain. However, if you want to master your data visualization tool of choice, or if you’re preparing an important presentation or a blog post, two hours may be worth it. In my case, I see the time I’ve spent on improving this plot as an investment that has already paid off. Thanks to this plot, I have now much deeper knowledge of one of my most essential working tools.
And now, over to you — what techniques can you share based on your data visualization experiments? Go to this link, grab my code and my data, make your suggestions and improvements, and let’s discuss here in the comments.
Dog photo by Andrea Arden; Featured image is based on an image by Zeev Veez. Both images are used under the CC-by-2.0 license