João Gonçalves Paiva

Data Placement in Distributed Systems

Thesis submitted for the PhD in Computer Science and Engineering, Instituto Superior Técnico (IST), Universidade de Lisboa

Abstract

Data placement refers to the problem of deciding how to assign data items to nodes in a distributed system to optimize one or several of a number of performance criteria such as reducing network congestion, improving load balancing, among others.

By placing data near the clients, one may reduce the number of remote accesses, significantly reduce the latency of operations, and avoid network congestion. By taking into account the capacity of nodes and the workload characterization, one may avoid the overload of a few nodes that could otherwise become a bottleneck in the entire system. By minding the probability of failure of individual nodes, one can place data in a way that maximizes its availability, while reducing the overhead caused by monitoring and multiple replica restores. Most of these criteria impose conflicting requirements and each application must prioritize how to optimize placement. This work addresses these criteria among others, in an independent as well as in a combined way.

However, the benefits achieved by a clever data placement must be weighted against the costs of data lookup. In fact, to support total flexibility in the data placement one needs to resort to some form of distributed directory, that stores the mapping between data items and nodes. Unfortunately, the costs of performing lookups to the directory and the overhead of maintaining the directory up-to-date can easily become the bottleneck. Due to this problem, many practical systems use simple data placement strategies, such as consistent hashing.

This thesis proposes techniques that provide different tradeoffs between plain consistent hashing schemes and full directory systems for different sizes of system scales. The main goal is to provide better options between having strong flexibility with limited scalability (typically employed in datacenter systems), and having good scalability with limited flexibility (the main choice for internet scale systems).

Selected Publications

Data Placement in Distributed Systems: João Gonçalves Paiva; PhD Thesis. Departamento de Engenharia Informática, Instituto Superior Técnico (IST), Universidade de Lisboa; May, 2015.; Available pdf.
Scalable Self-Tuning Data Placement in Distributed Key-value Stores.: J. Paiva, P. Ruivo, P. Romano, and L. Rodrigues.; In ACM Transactions on Autonomous and Adaptive Systems, Volume 9, Number 4, January 2015; Digital Object Identifier no.10.1145/2641573
Policies for Efficient Data Replication in P2P Systems.: J. Paiva and L. Rodrigues.; The 19th IEEE International Conference on Parallel and Distributed Systems (ICPADS 2013), Seoul, Korea, December 2013.
Rollerchain: a DHT for Efficient Replication.: J. Paiva, J. Leitão, and L. Rodrigues.; The 12th IEEE International Symposium on Network Computing and Applications (IEEE NCA13), Cambridge, MA, USA, August 2013.; Best student paper; Available BibTeX, abstract (html).
AutoPlacer: scalable self-tuning data placement in distributed key-value stores.: J. Paiva, P. Ruivo, P. Romano, L. Rodrigues .; Proceedings of the 10th International Conference on Autonomic Computing (ICAC '13) San Jose, California, June 2013.; Available BibTeX, abstract (html).

Luís Rodrigues