Jeff Kelly, Big Data Research Analyst at Wikibon.org
The Washington Post posted a story about the NSA is spying on citizens and accessing to technology companies servers. Specifically the NSA the main servers of leading U.S. Internet companies collecting huge volumes of information (call metadata) from Verizon associated with all domestic and international calls made by the company’s customers for three months starting in mid-April (see the court order here.) Less attention has been paid to what exactly the government does with all that data or the technology supporting it.
While details are sketchy (neither the NSA nor the White House will even acknowledge the existence of the program), it is important to take a step back and understand that the NSA cannot indiscriminately analyze, mine or otherwise explore this vast new trove of data. In order to analyze the data at hand, the NSA must get a court order justified by the reasonable suspicion of an imminent terrorist act.
Even then, the NSA may only access and analyze segments of the call metadata that relate specifically to the potential threat in question and only specific individuals within the NSA can access the data. Among those individuals, access levels vary based on legitimate need to see and analyze the specific data sets covered by the court order.
There are three points to watch:
1. Assuming that you agree that the federal government should be allowed to mine communications data such as call metadata to investigate legitimate terrorist threats, then from a purely practical perspective it makes sense for the NSA to collect such data in advance. Verizon Wireless alone has somewhere north of 75 million wireless subscribers. It would be next to impossible for NSA agents to collect, integrate and analyze that much data (75 million callers x multiple calls by each caller per day x weeks or months = Big Data) in a moments notice. By collecting the data ahead of time, the NSA is able to load the data into its platform and have the data ready for analysis when it obtains a court order justified by a legitimate threat. Consider that the NSA is likely collecting similar call metadata from other wireless providers, which all needs to be merged in order to “connect all the dots.” Slapping together a database with call metadata from hundreds of millions of callers is not something you do overnight.
2. Once a court order for analysis is obtained, the NSA needs the technical capabilities to limit their analysis to the data sets in question and to control which agents have access to the data. This is where we get to the technology behind the data mining program. While we don’t know for certain, the NSA is almost definitely using Accumulo, its homegrown scale-out NoSQL database, to process, store and analyze the call metadata. The NSA developed Accumulo several years ago when it couldn’t find a database that met its stringent requirements. Among those requirements were fine-grained security and access controls. Accumulo, which is often run on top of Hadoop and is based on Google BigTable, was built from the ground up with cell-level security capabilities. This allows, among other capabilities, administrators to grant data access on a cell-by-cell and user-by-user basis, rather than being forced to provide users “all-or-nothing” access. Accumulo is the only scalable database I have come across that offers this capability, without which the NSA would not be able to perform this data analysis and follow the law. There is one company out there that is seeking to commercialize Accumulo, Sqrrl, and according to their website they are seeking to bring these cell-level security capabilities to other industries, such as healthcare and finance.
3. As for the type of analysis the NSA performs on the call metadata, graph analysis is for sure one such type. Graph analysis allows you to visualize and uncover relationships between distinct entities hidden among large volumes of data. The resulting visuals are made up of nodes, which represent the entities, and edges, the lines that connect and represent the relationships between the nodes. Graph analysis is a popular way to better understand the dynamics of social networks (and is the basis of Facebook’s Graph Search, rolled out earlier this year) but is equally effective when trying to ferret out terrorist networks. And we know that the NSA has successfully tested Accumulo’s graph analysis capabilities on some huge data sets – in one case on a 1200 node Accumulo cluster with over a petabyte of data and 70 trillion edges.
Slide from May 2013 presentation by the NSA on its use of Accumulo for graph analysis.
The types of workloads in question are not just for intelligence and security agencies either. Specifically, fine-grained access control, I believe, is a critical feature for Big Data platforms in the enterprise. This is especially true as Big Data experiments and proof of concepts graduate to production-grade deployments, and as Hadoop adds support for additional computational models beyond MapReduce. As Hadoop becomes more robust and easier for non-MapReduce experts to use (such as by adding SQL-like and search capabilities), more and more users in the enterprise will interact with the platform. Not all users are created equal, and enterprises will need to implement fine grain access controls that restrict data access based on role, security authorization and other criteria.
But back to the matter at hand. Balancing national security against the protection of civil liberties is obviously a difficult challenge. And it’s a conversation we as a nation must continue having. It is important to realize, however, that the technology used to identify threats and keep us safe are maturing and developing rapidly. Technology is not a cure-all to the security v. civil liberties challenge, but as the tools become more sophisticated it becomes easier to zero-in on the bad actors without compromising the privacy of the rest of us.