A
distributed hash table (
DHT) is a class of decentralized
distributed system that provides a lookup service similar to a
hash table; (
key,
value) pairs are stored in a DHT, and any participating
node can efficiently retrieve the value associated with a given key. Responsibility for maintaining the mapping from keys to values is distributed among the nodes, in such a way that a change in the set of participants causes a minimal amount of disruption. This allows a DHT to
scale to extremely large numbers of nodes and to handle continual node arrivals, departures, and failures.
DHTs form an infrastructure that can be used to build more complex services, such as distributed file systems, peer-to-peer file sharing and content distribution systems, cooperative web caching, multicast, anycast, domain name services, and instant messaging. Notable distributed networks that use DHTs include BitTorrent's distributed tracker, the Kad network, the Storm botnet, YaCy, and the Coral Content Distribution Network.
History
DHT research was originally motivated, in part, by
peer-to-peer systems such as
Napster,
Gnutella, and
Freenet, which took advantage of resources distributed across the Internet to provide a single useful application. In particular, they took advantage of increased
bandwidth and
hard disk capacity to provide a file sharing service.
These systems differed in how they found the data their peers contained:
Napster had a central index server: each node, upon joining, would send a list of locally held files to the server, which would perform searches and refer the querier to the nodes that held the results. This central component left the system vulnerable to attacks and lawsuits.
Gnutella and similar networks moved to a flooding query model—in essence, each search would result in a message being broadcast to every other machine in the network. While avoiding a single point of failure, this method was significantly less efficient than Napster.
Finally, Freenet is also fully distributed, but employs a heuristic key-based routing in which each file is associated with a key, and files with similar keys tend to cluster on a similar set of nodes. Queries are likely to be routed through the network to such a cluster without needing to visit many peers. However, Freenet does not guarantee that data will be found.
Distributed hash tables use a more structured key-based routing in order to attain both the decentralization of Gnutella and Freenet, and the efficiency and guaranteed results of Napster. One drawback is that, like Freenet, DHTs only directly support exact-match search, rather than keyword search, although that functionality can be layered on top of a DHT.
In 2001, four systems—CAN, Chord, Pastry, and Tapestry—ignited DHTs as a popular research topic, and this area of research remains active. Outside academia, DHT technology has been adopted as a component of BitTorrent and in the Coral Content Distribution Network.
Properties
DHTs characteristically emphasize the following properties:
* Decentralization: the nodes collectively form the system without any central coordination.
Scalability: the system should function efficiently even with thousands or millions of nodes.
Fault tolerance: the system should be reliable (in some sense) even with nodes continuously joining, leaving, and failing.
A key technique used to achieve these goals is that any one node needs to coordinate with only a few other nodes in the system – most commonly, O(log n) of the participants (see below) – so that only a limited amount of work needs to be done for each change in membership.
Some DHT designs seek to be secure against malicious participants and to allow participants to remain anonymous, though this is less common than in many other peer-to-peer (especially file sharing) systems; see anonymous P2P.
Finally, DHTs must deal with more traditional distributed systems issues such as load balancing, data integrity, and performance (in particular, ensuring that operations such as routing and data storage or retrieval complete quickly).
Structure
The structure of a DHT can be decomposed into several main components. The foundation is an abstract
keyspace, such as the set of 160-bit
strings. A
keyspace partitioning scheme splits ownership of this keyspace among the participating nodes. An
overlay network then connects the nodes, allowing them to find the owner of any given key in the keyspace.
Once these components are in place, a typical use of the DHT for storage and retrieval might proceed as follows. Suppose the keyspace is the set of 160-bit strings. To store a file with given and in the DHT, the SHA-1 hash of is generated, producing a 160-bit key , and a message is sent to any node participating in the DHT. The message is forwarded from node to node through the overlay network until it reaches the single node responsible for key as specified by the keyspace partitioning. That node then stores the key and the data. Any other client can then retrieve the contents of the file by again hashing to produce and asking any DHT node to find the data associated with with a message . The message will again be routed through the overlay to the node responsible for , which will reply with the stored .
The keyspace partitioning and overlay network components are described below with the goal of capturing the principal ideas common to most DHTs; many designs differ in the details.
Keyspace partitioning
Most DHTs use some variant of
consistent hashing to map keys to nodes. This technique employs a function
which defines an abstract notion of the
distance from key
to key
, which is unrelated to geographical
distance or network
latency. Each node is assigned a single key called its
identifier (ID). A node with ID
owns all the keys
for which
is the closest ID, measured according to
.
Example. The Chord DHT treats keys as points on a circle, and is the distance traveling clockwise around the circle from to . Thus, the circular keyspace is split into contiguous segments whose endpoints are the node identifiers. If and are two adjacent IDs, then the node with ID owns all the keys that fall between and .
Consistent hashing has the essential property that removal or addition of one node changes only the set of keys owned by the nodes with adjacent IDs, and leaves all other nodes unaffected. Contrast this with a traditional hash table in which addition or removal of one bucket causes nearly the entire keyspace to be remapped. Since any change in ownership typically corresponds to bandwidth-intensive movement of objects stored in the DHT from one node to another, minimizing such reorganization is required to efficiently support high rates of churn (node arrival and failure).
Locality preserving hashing ensures that similar keys are assigned to similar objects. This can enable a more efficient execution of range queries.
Self-Chord decouples objects keys from peer IDs and sort keys along the ring with a statistical approach based on the swarm intelligence paradigm. Sorting ensures that similar keys are stored by neighbour nodes and that discovery procedures, including range queries, can be performed in logarithmic time.
Overlay network
Each node maintains a set of
links to other nodes (its
neighbors or
routing table). Together these links form the
overlay network. A node picks its neighbors according to a certain structure, called the
network's topology.
All DHT topologies share some variant of the most essential property: for any key , each node either has a node ID which owns or has a link to a node whose node ID is closer to , in terms of the keyspace distance defined above. It is then easy to route a message to the owner of any key using the following greedy algorithm (that is not necessarily globally optimal): at each step, forward the message to the neighbor whose ID is closest to . When there is no such neighbor, then we must have arrived at the closest node, which is the owner of as defined above. This style of routing is sometimes called key-based routing.
Beyond basic routing correctness, two important constraints on the topology are to guarantee that the maximum number of hops in any route (route length) is low, so that requests complete quickly; and that the maximum number of neighbors of any node (maximum node degree) is low, so that maintenance overhead is not excessive. Of course, having shorter routes requires higher maximum degree. Some common choices for maximum degree and route length are as follows, where is the number of nodes in the DHT, using Big O notation:
* Degree , route length
Degree , route length
Degree , route length
Degree , route length
The third choice is the most common, even though it is not quite optimal
in terms of degree/route length tradeoff, because such topologies typically
allow more flexibility in choice of neighbors. Many DHTs use that flexibility to pick neighbors which are
close in terms of latency in the physical underlying network.
Maximum route length is closely related to diameter: the maximum number of hops in any shortest path between nodes. Clearly the network's route length is at least as large as its diameter, so DHTs are limited by the degree/diameter tradeoff which is fundamental in graph theory. Route length can be greater than diameter since the greedy routing algorithm may not find shortest paths.
Algorithms for overlay networks
Aside from routing, there exist many algorithms which exploit the structure of the overlay network for sending a message to all nodes, or a subset of nodes, in a DHT. These algorithms are used by applications to do overlay multicast, range queries, or to collect statistics. Two systems that are based on this approach are Structella, which implements flooding and random walks on a Pastry overlay, and DQ-DHT, which implements a dynamic querying search algorithm over a Chord network.
DHT implementations
Most notable differences encountered in practical instances of DHT implementations include at least the following:
The address space is a parameter of DHT. Several real world DHTs use 128 bit or 160 bit key space
Some real-world DHTs use hash functions other than SHA1.
In the real world the key could be a hash of a file's content rather than a hash of a file's name to provide content-addressable storage, so that renaming of the file does not prevent users from finding it.
Some DHTs may also publish objects of different types. For example, key could be node and associated data could describe how to contact this node. This allows publication of presence information and often used in IM applications, etc. In simplest case is just a random number which is directly used as key (so in a 160-bit DHT will be a 160 bit number, usually randomly chosen). In some DHTs publishing of nodes IDs is also used to optimize DHT operations.
Redundancy can be added to improve reliability. The key pair can be stored in more than one node corresponding to the key. Usually, rather than selecting just one node, real world DHT algorithms select suitable nodes, with being an implementation-specific parameter of the DHT. In some DHT designs, nodes agree to handle a certain keyspace range, the size of which may be chosen dynamically, rather than hard-coded.
Some advanced DHTs like Kademlia perform iterative lookups through the DHT first in order to select a set of suitable nodes and send messages only to those nodes, thus drastically reducing useless traffic, since published messages are only sent to nodes which seem suitable for storing the key ; and iterative lookups cover just a small set of nodes rather than the entire DHT, reducing useless forwarding. In such DHTs forwarding of messages may only occur as part of a self-healing algorithm: if a target node receives a message but believes that is out of its handled range and a closer node (in terms of DHT keyspace) is known, the message is forwarded to that node. Otherwise, data are indexed locally. This leads to a somewhat self-balancing DHT behavior. Of course, such an algorithm requires nodes to publish their presence data in the DHT so the iterative lookups can be performed.
Examples
DHT protocols and implementations
Apache Cassandra
BitTorrent DHT - based on
Kademlia as provided by Khashmir.
CAN (Content Addressable Network)
Chord
Kademlia
Pastry
P-Grid
Tapestry
Applications employing DHTs
Codeen: Web caching
Coral Content Distribution Network
Freenet: A censorship-resistant anonymous network
Dijjer: Freenet-like distribution network
FAROO: Peer-to-peer web search engine
GNUnet: Freenet-like distribution network including a DHT implementation
WebSphere eXtreme Scale: proprietary DHT implementation by IBM, used for object caching
JXTA: Opensource P2P platform
YaCy: distributed search engine
maidsafe: C++ implementation of Kademlia, with NAT traversal and crypto libraries. On its home page listed as "Available as a technology licence and a software solution written in cross platform C++."
See also
membase: a persistent, replicated, clustered distributed object storage system compatible with memcached protocol
memcached: a high-performance, distributed memory object caching system
Prefix hash tree: sophisticated querying over DHTs
References
External links
Distributed Hash Tables, Part 1 by Brandon Wiley.
Distributed Hash Tables links Carles Pairot's Page on DHT and P2P research
Tangosol Coherence includes a structure similar to a DHT, though all nodes have knowledge of the other participants
kademlia.scs.cs.nyu.edu Archive.org snapshots of kademlia.scs.cs.nyu.edu
Hazelcast open source DHT implementation
Category:Distributed data storage
Category:File sharing