Jeff Jonas visited us again at the start of November and gave a talk about some of the new work that he is doing. Jeff is our first return speaker, and this time he gave us an update on his thinking about sensemaking systems and how that is effecting his on-going work in developing a new technology.
Jeff mainly works on building sensemaking systems that can reconcile large amounts of data in real time. In brief a sensemaking system is one that, in contrast to a data warehousing solution, does something active with each piece of data as it is acquired, rather than only storing the data for later re-use. Identity disambiguation is a problem that these class of systems have been applied to in the past, however the new technique will be more generally applicable. One of the difficulties with the sensemaking problem is that any individual piece of data that arrives, on its own, is hard to evaluate in terms of how important it is in terms of relevance. Each piece of data needs somehow to be contextualised first. Jeff illustrated the underlying mechanics of such a system with an analogy to jigsaw puzzle solving.
When solving a jigsaw puzzle we make an assertion with each new puzzle piece that we pick up, it either fits perfectly in some place in the evolving solution space, or it belongs to a similar set of pieces but we don’t exactly know how yet, or one has no idea where it goes so it is placed anywhere.
When asserting that the new piece fits into an existing piece one always favors the false-negative, as one never puts pieces together unless we are really sure that they go together.
When we get a new connecting piece we re-consider if by now knowing this, other previous pieces already considered have a better placement.
Sometimes a new piece reverses an earlier assertion – e.g., determining where a piece belongs reveals a connected piece that upon closer inspection, really did not belong. In this case, this misplaced piece is removed.
The working space needed during the process of solving the puzzle is much larger than the final solution space.
From this description it sounded to me that Jeff’s system accumulates attributes of the things that we are interested, in folders, and then makes connections between these folders as new pieces of information come in to the system.
One of the important characteristics of a usable sensemaking system is that it needs to be able to change its decision state as new information comes in. This is to ensure that the system does not drift from the truth as arriving new data invalidates earlier assertions.
Systems like this end up expressing bias base on the observations they have received. So in theory, an organization could ingest slightly different data sets into different instances of the program in parallel and poll them for their views. One would be able to see dissent between these different instances.
A key issue about sensemaking is the ability of the system to count discrete objects. If you can’t count the discrete entities that you are interested in, then you can’t expect to produce high quality predictions. This is the key principle behind the new technology that Jeff is developing—on that this new work is a general counting engine. With this in mind he is currently looking for hard science problems that such an engine could be applied to, and this was one of the reasons for his visit in to our offices, so if anyone reading has some ideas please post them and we’ll pass them along.
One good way to disambiguate things is being able to track their spacetime and life arcs. The same thing cannot be in two places at the same time (at least it can’t if it is large enough to not be concerned with quantum mechanical effects) and the path something takes over space and time (life arcs) can be itself be a discriminating signature of identity. Science produces very large data sets, and some of these data sets are produced quickly. Jeff hopes to be able to find problems that would benefit from the disambiguation techniques that he is working on. Trying to imagine which types of data in the scientific realm would be a good candidate for this kind of analysis raises some interesting questions. Most science is produced through publication, which is a slow process and is not very real time. That said, Pubmed indexed a new paper about every 40 seconds in 2009, which is quasi-real time. Often it’s not individual members of a class of objects in which we are interested. It’s not a given Higgs Boson that interests us, but rather all characteristics of all Higgs Bosons. That said, one of the most important jobs of detectors at particle accelerators is indeed to do exactly the event disambiguation of particle trails that uses spacetime paths as the key discriminating factor.
I wondered whether in the context of scientifically interesting objects one could try to do this disambiguation of paths by projecting into a more general higher-dimensional parameter space. Jeff was very clear on the point that as far as he was concerned spacetime and life arcs are the gold standard in this regard, and I’d have to agree with that, however I think that the idea of using higher dimensional parameter spaces has some merit.
As with his last visit, Jeff reserved his most thought provoking idea till last. Quite recently he has been fascinated by the growing number of systems used by some companies to track mobile phone trails (life arcs). There are 600 billion transactions being generated daily in the US that contain geospatial data. Your travel patterns reveal where you spend your time, who you spend your time with, and they are highly predictive. The data is being de-identified and being shared with 3rd parties, however re-identification of an individual, in most cases, is trivial.
This data can also be seen in real time. It can give real time analytics on the health of a store; how many people are visiting in real time, what is their average journey distance to get to that store, is that number going up or down? Jeff suggested a number of ways to raise consumer awareness of the power of this kind of information. He suggested that phone companies should provide information such as the first name and first initial of the last name of the 10 individuals that you spend most of your time with not at work or at home (notably: if there is a name on the list you do not recognize, they are probably following you). There has been some research into analysing these trails, but it’s clear that we are just beginning to scratch the surface on this.