By placing data near the clients, one may reduce the number of remote accesses, significantly reduce the latency of operations, and avoid network congestion. By taking into account the capacity of nodes and the workload characterization, one may avoid the overload of a few nodes that could otherwise become a bottleneck in the entire system. By minding the probability of failure of individual nodes, one can place data in a way that maximizes its availability, while reducing the overhead caused by monitoring and multiple replica restores. Most of these criteria impose conflicting requirements and each application must prioritize how to optimize placement. This work addresses these criteria among others, in an independent as well as in a combined way.
However, the benefits achieved by a clever data placement must be weighted against the costs of data lookup. In fact, to support total flexibility in the data placement one needs to resort to some form of distributed directory, that stores the mapping between data items and nodes. Unfortunately, the costs of performing lookups to the directory and the overhead of maintaining the directory up-to-date can easily become the bottleneck. Due to this problem, many practical systems use simple data placement strategies, such as consistent hashing.
This thesis proposes techniques that provide different tradeoffs between plain consistent hashing schemes and full directory systems for different sizes of system scales. The main goal is to provide better options between having strong flexibility with limited scalability (typically employed in datacenter systems), and having good scalability with limited flexibility (the main choice for internet scale systems).