Managing Collisions

As a DSP handling big data and high request rate (as well as in any busy and complex server) we find it useful to keep in-memory mappings for data we use on request basis. Typically these mappings are stored in hashmap data structures, whose performance is highly influenced by the data distribution inside its buckets, or put differently, by the number of collisions.

A second use-case of interest which also involves collisions is when we try to map a set of N unique ids into a new domain of size B, in an attempt to reduce the space used for storing ids in internal data structures and data stores.

In both cases collisions should be taken into account. In the case of a hashmap we are dealing with a trade-off between memory consumption and running time. A larger table means faster lookups/updates, but slower iteration and greater memory consumption. In the mapping scenario, we may be saving space by reducing the id signature size, but on the same time we are losing data due to collisions. In both cases we would like to be able to control the number of collisions, keeping them below some predefined threshold.

Some Theory

For the purpose of this post, we examine the sequence of N insertions into B bins, and we define 'collision' as the event where an item is being mapped to a bin that is already occupied. Assuming that the mapping function is uniformly distributed over all bins, we can compute the expected number of collisions as follows:

The probability of a specific bin to be remain empty after N insertions is:

Using linearity of expectation property, we can sum over all bins, and see that the expected number of empty bins is:

Therefore the expected number of non-empty bins is:

Now, every item beyond the number of occupied bins is by definition a collision, so we can conclude that the expected number of collisions is:

(1)

Note that , and , as expected.

Let’s also define the collisions ratio as follows:

(2)

What happens if we map N items to N bins?

As N grows, this expression converges to (~0.37). In other words, for a large N, if we use only N bins to store N items, we should expect 37% of the lookups to cost extra time due to collisions!

How do standard map implementations provide collisions control? A common metric for that purpose is the load factor, defined as N/B, or the average items per bin. It is common to specify the maximum load factor at construction time, forcing the map to increase the number of buckets and rehash each time this value is achieved. In Java’s java.util.Hashmap, the default load factor is 0.75, while in C++, std::unordered_map uses a default of 1.0. The latter results in about 37% collisions, as computed above. In the former, we have:

I.e., the reduction of load factor from 1.0 to 0.75 reduced the collision ration by only 7 percent.

Generalising this, we can compute the relation between the load factor L and the collision ratio as follows:

(3)

Load factor graph

Measuring Collisions (Java)

How can we measure collisions in runtime? In Java this can be done using reflection. The following code snippet is a simplified version of the approach described in http://www.javaspecialists.eu/archive/Issue235.html:

private static void analyze(HashMap<?,?> m) throws ReflectiveOperationException {
      Map.Entry<?,?>[] entries = getTable(m);
      System.out.println("Bins: " + entries.length);
      System.out.println("Items: " + m.size());
      Field nextField = getNextField(entries);
    
      int collisions = 0;
      for (Map.Entry<?,?> e : entries) {
        if (e != null) {
          Object next = e;
          while ((next = nextField.get(next)) != null) {
           collisions++; 
          }
        }
      }
      System.out.println("Collisions: " + collisions);
    }
    
    private static Map.Entry<?, ?>[] getTable(Map<?, ?> map) throws ReflectiveOperationException {
      Field tableField = map.getClass().getDeclaredField("table");
      tableField.setAccessible(true);
      return (Map.Entry<?, ?>[]) tableField.get(map);
    }
    
    private static Field getNextField(Map.Entry<?,?>[] mapTable) throws ReflectiveOperationException {
      Class<?> nodeType = mapTable.getClass().getComponentType();
      Field nextNodeField = nodeType.getDeclaredField("next");
      nextNodeField.setAccessible(true);
      return nextNodeField;
    }

Let’s try it on a HashMap with many random strings as keys. Internally, HashMap uses powers of 2 for table sizes.

If we want to have a saturated map with 2^20=1,048,576 bins, we need to insert 0.75*1,048,576 = 786432 items:

public static void main(String[] args) throws ReflectiveOperationException {
    
      final int N = 786_432;
    
      HashMap<String, Void> m = new HashMap<>();
      for (int i = 0; i < N; i++) {
        m.put(UUID.randomUUID().toString(), null);
      }
    
      analyze(m);
    }

The output is:

Bins: 1048576
Items: 786432
Collisions: 233003

Our collision formula (1) estimates 233,168 collisions, which is pretty close In addition, since the map is saturated, the actual collision ratio (0.296278636) is very close the the theoretical one computed above (0.296488737).

Mapping IDs

Suppose that a system is required to work with a set of one billion locations of interest around the world, where each location is identified by an id in UUID format. Since this format uses 36 characters per id, and a 4 byte integer would suffice for representing an id, it would be wise to convert every location id to some integer, and then use this compact representation.

The mapping can easily be accomplished using a good hash function. An integer can store 2^32 different values, which is more than 4 times larger than the actual number of ids. This may look promising, but using (2) we observe that r(10^9, 2^32)=0.11, meaning that 11% of the locations will be “merged” with other locations, losing their uniqueness. If we can’t afford this inacuracy, we can use a 8 bytes integer instead, and get r(10^9, 2^64), which is practically zero!

The Birthday Problem

When setting B=365, our discussion becomes closely related to the famous Birthday Problem. There are many variations of questions that can be asked about this problem. If we treat N as a parameter, and ask what value of N will result in an average of 1 collision, then this is simply solving collisions(N,365)=1, which results in N=28. Note that a different formulation of the problem would consider N to be a random variable, and ask for the mean value for N, assuming there’s one collision. In this case we get N=24.6 (see computation details in link above).

Summary