What the NSA can do with “big data”

One organization's data centers hold the contents of much of the visible Internet—and much of it that isn't visible just by clicking your way around. It has satellite imagery of much of the world and ground-level photography of homes and businesses and government installations tied into a geospatial database that is cross-indexed to petabytes of information about individuals and organizations. And its analytics systems process the Web search requests, e-mail messages, and other electronic activities of hundreds of millions of people.

No one at this organization actually "knows" everything about what individuals are doing on the Web, though there is certainly the potential for abuse. By policy, all of the "knowing" happens in software, while the organization's analysts generally handle exceptions (like violations of the law) picked from the flotsam of the seas of data that their systems process.

I'm talking, of course, about Google. Most of us are okay with what Google does with its vast supply of "big data," because we largely benefit from it—though Google does manage to make a good deal of money off of us in the process. But if I were to backspace over Google's name and replace it with "National Security Agency," that would leave a bit of a different taste in many people's mouths.

Yet the NSA's PRISM program and the capture of phone carriers' call metadata are essentially about the same sort of business: taking massive volumes of data and finding relationships within it without having to manually sort through it, and surfacing "exceptions" that analysts are specifically looking for. The main difference is that with the NSA, finding these exceptions can result in Foreign Intelligence Surveillance Act (FISA) warrants to dig deeper—and FBI agents knocking at your door.

So what is it, exactly, that the NSA has in its pile of "big data," and what can they do with it?.

Drinking from the fire hose

Let's set aside what US law allows the NSA to do for a moment, and focus on some other laws that constrain the intelligence agency: the laws of physics and Moore's Law, to start with. The NSA has the capability to collect massive amounts of data on traffic over switched phone networks and the Internet and has had that capability for some time, thanks to cooperation from the phone companies themselves, deep packet inspection and packet capture hardware, and other signals monitoring capabilities. But they haven't had the ability to truly capture and store that data en masse and retain it indefinitely until relatively recently, due in part to work started at Google and Yahoo.

We know some of this thanks to an earlier whistleblower—former AT&T employee Mark Klein, who revealed in 2006 that AT&T had helped NSA install a tap into the fiber backbone for AT&T's WorldNet, "splitting" the traffic to run into a Narus Insight Semantic Traffic Analyzer. (The gear has since been rebranded as "Intelligence Traffic Analyzer," or ITA.)

The "secret room" in AT&T's Folsom Street office in San Francisco is believed to be one of several Internet wiretapping facilities at AT&T offices around the country feeding data to the NSA.
Mark Klein

Narus' gear was also used by the FBI as a replacement for its homegrown "Carnivore" system. It scans packets for "tag pairs"—sets of packet attributes and values that are being monitored for—and then grabs the data for packets that match the criteria. In an interview I conducted with Narus' director of product management for cyber analytics Neil Harrington in September of 2012, Harrington said the company's Insight systems can analyze and sort gigabits of data each second. "Typically with a 10 gigabit Ethernet interface, we would see a throughput rate of up to 12 gigabits per second with everything turned on. So out of the possible 20 gigabits, we see about 12. If we turn off tag pairs that we’re not interested in, we can make it more efficient."

A single Narus ITA is capable of processing the full contents of 1.5 gigabytes worth of packet data per second. That's 5400 gigabytes per hour, or 129.6 terabytes per day, for each 10-gigabit network tap. All that data gets shoveled off to a set of logic servers using a proprietary messaging protocol, which process and reassemble the contents of the packets, turning petabytes per day into gigabytes of tabular data about traffic—the metadata of the packets passing through the box— and captured application data.

NSA operates many of these network tap operations both in the US and around the world. But that's a massive fire-hose of data to try to digest in any meaningful way and in the early days of packet capture, NSA faced a few major problems with that vast stream of data. Storing it, indexing it, and analyzing it in volume required technology beyond what was generally available commercially. Considering that, according to Cisco, the total world Internet traffic for 2012 was 1.1 exabytes per day is physically impossible, let alone practical, for the NSA to capture and retain even a fraction of the world's Internet traffic on a daily basis.

There's also the issue of intercepting packets protected by Secure Socket Layer (SSL) encryption. Breaking encryption of SSL-protected traffic is, under the best of circumstances, computationally costly and can't be applied across the whole of Internet traffic (despite the apparent certificate-cracking success demonstrated by the Flame malware attack on Iran). So while the NSA can probably do it, they probably can't do it in real-time.

The original social network

Internet monitoring wasn't the only NSA data collection exposed in 2006. In May of that same year, details emerged about NSA's phone call database, obtained from phone carriers. Comprised of call data records—data on the time and length of calls, the phone numbers involved, and location data for mobile devices, among other things—the database collection started shortly after the terrorist attacks of September 11, 2001, with the cooperation of AT&T, Verizon, and BellSouth. Long-distance provider Qwest declined to participate in the program without the issuance of a FISA warrant.

According to reporting by USA Today, the NSA used the database for "social network analysis." While the target of the analysis was intended to be calls connecting to individuals overseas, the NSA scooped up the entire database from these companies, including domestic calls.

That database, or at least its successor, is called MARINA, according to reporting by The Week's Marc Ambinder. And according to documents revealed by the Guardian last week, the NSA is still collecting call data records for all domestic calls and calls between US and foreign numbers—except now the agency is armed with FISA warrants. That includes (according to the FISA order) "comprehensive communications routing information, including but not limited to session identifying information (e.g., originating and terminating telephone number, International Mobile Subscriber Identity (IMEI) number, etc.), trunk identifier, telephone calling card numbers, and time and duration of call."

In 2006, USA Today called the call database "the largest database in the world." That transactional record data for billions upon billions of phone calls presents a physical-space problem on a similar scale to that which the NSA encountered in its Internet monitoring—or perhaps, initially, on an even larger scale. To find the relationships inferred by phone calls between people requires massive amounts of columnar data to be indexed and analyzed.

Technology Lab —

What the NSA can do with “big data”

The NSA can't capture everything that crosses the Internet—but doesn't need to.

Drinking from the fire hose

The original social network