BitVault is a content-addressable distributed storage system, developed by Microsoft Research in China. BitVault uses peer-to-peer technology to distribute the tasks of storing and managing data. As such, there is no central authority responsible for management of the system. Rather, it is self-managing, provides high availability, reliability and scales up in a self-organizing manner, with low administrative overhead, which is almost constant irrespective of the size of the distributed overlay network.
BitVault system is best suited for reference data, which is large amount of data which changes very infrequently. Such data include archives of out-of-date data, as well as multimedia data like music and video, which, even though might be frequently used, changes very rarely.
Every participating peer node in BitVault architecture is a Smart Brick, which is a trimmed down PC with large disks. All Smart Bricks in a BitVault system are connected by a high-bandwidth, low latency network. A BitVault system can be easily scaled up – any computer can be configured to act as a Smart Brick by simply installing the BitVault software, and connecting it to the network, without any need for interrupting the already working nodes.
BitVault stores immutable data objects, i.e., objects which cannot be changed. The physical location of the objects is not fixed and can be on any of the bricks. Its location changes depending on its frequency of access; it can even be replicated at more than one brick. To get around this problem of changing locations, BitVault makes it accessible by means of a 160-bit key, which is unique for each object. The system dynamically references the location from which the object can be retrieved most efficiently, by using the key, and makes the object available. The unique key is generated from a hash of the data of the object, thus making the system content-addressable, as opposed to location-addressable. The hashes of the objects (key) are mapped to the physical addresses using hash tables, which are internally managed by the system and do not need any user intervention. Different sets of nodes maintain different sets of hash tables, which concern with only the data in that set of nodes, thereby giving rise to an overlay network in which the location of the data is tracked by a distributed hash table (DHT) architecture.
The BitVault architecture is composed of multiple bricks which constitute a logical 160 bit address space, each associated with hash of some data. The association is maintained in a Distributed Hash Table (DHT). The DHT partitions the entire hash table into smaller hash tables. For example, if there are n peers, the hash table would be divided into n hash tables, each starting from the row next to where its immediate predecessor ended. Each DHT has its associated brick, and the extent of the logical address space a brick is responsible for is called its Zone. The bricks communicate using peer-to-peer technology, over the Membership and Routing Layer (MRL). Lookup of any data object can be done by n bricks in parallel, in its own zone, giving an efficiency of O (log N).
Multiple copies of a single object, called replica, are stored in the BitVault system, to give enough redundancy. If any index is damaged, the nearest replica can be notified to start its repair. And if the index notices that the replica is damaged, it can initiate the repair of the replica. This method of error recovery is called the Object Driven Repair model. In order for this to work, there needs to be a membership service running which will give a logical ordering to the peers. This is achieved using the MRL. The membership service guarantees that any addition or removal of a brick is eventually and reliably informed to every other live bricks. The MRL is also responsible to route messages to and from bricks and its associated DHTs.
The MRL uses a one hop DHT to perform routing, i.e., it never takes more than one hop over a peer to route messages, when the BitVault system is stable, i.e., no new bricks are added, nor is any load balancing or repair going on. The MRL is implemented using an XRing architecture, which maintains a distributed routing table which facilitates one-hop routing.
Single brick architecture
A brick registers itself with the MRL with a 160 bit key that forms its identifier, and its zone in the DHT is from its id to just before the id of its next logical successor. The brick architecture is divided into two parts – the Index Module and the Data Module. The index module keeps a list of the list of all the replicas that are cached by the disc, mapped with their hashes. In addition, for each object that is stored, the IM also keeps a list of locations of all other replicas of the object. IM listens to the MRL and updates itself according to membership changes and also according to data being entered into BitVault system or being retrieved from it. The IM is also responsible to initiate repair of replicas once it is informed of a damaged one, and to ask for repair of replicas in its store. The IM is connected to a small Access Module, which serves as the gateway to external clients. Data module stores replicas of objects to a local disc. Along with the object, its metadata such as its hash key and its degree of replication in the BitVault system is also kept.
Inserting data into the BitVault system is called Check In. A Check In requires the object, its key and an initial replication degree. The MRL routes the object and all its parameters to some brick. The brick then stores the data onto its Data Module and starts the job of replicating the object, by publishing it to random bricks, to achieve the specified replication degree. When the object has achieved the required replication degree, its index is said to be complete, otherwise it is partial. The brick must do further replication of an object which has partial index. Bricks also periodically verify that the index of the object is still complete.
Check Out is the process of retrieving data from the BitVault system. The application which uses BitVault as its datastore gives the hash key of the object to be retrieved, which is sent by the MRL to any brick. If the brick doesn’t have the object, it passes the request on to other bricks, in parallel. If the brick has the object, it is retrieved from its Data Module and routed to the requestor.
BitVault faults can be either transient or permanent. A transient failure will occur when a brick is experiencing temporary failure such as a software crash forcing a reboot. A permanent failure indicates errors such as hardware failure. Whenever any fault is detected, other bricks which have a replica of the affected object update the entry of the object in the index to be partial, and thus triggering further replication. All the other bricks containing replicas collaboratively send different parts of the object data, in parallel, to a new brick which will hold the replica. This parallel replication speeds up the repair of a damaged index to get it back to the complete state.
Whenever a new brick is added to the BitVault system, it takes up a random ID and contacts other bricks. The bricks will then include this new brick in their list of members. The newly added brick also gets a response from those bricks which added this to their membership list. The new brick adds the respondents to its membership list. Background load balancing of the system kicks in to populate the new brick with live replicas.
Bricks periodically query other bricks about the load condition in them. The brick then transfers some replicas onto the low-load bricks to get a more or less balanced load on each brick. It also issues messages to other bricks to update their indices to reflect the change.
- Zheng Zhang, Qiao Lian, Shiding Lin, Wei Chen, Yu Chen, Chao Jin (December 2005). BitVault: a Highly Reliable Distributed Data Retention Platform (PDF). Technical Report MSR-TR-2005-179 (Report). Microsoft Research Asia. doi:10.1145/1243418.1243423.