SICS Distributed Systems Laboratory
Swedish Institute of Computer Science

 
 
  CONTENTS
  Overview
Members
Projects
Software
  RESEARCH
  Peer-to-Peer Research
Fully Decentralized Distributed Systems
Distribution Middleware
Simulation of Large Asynchronous Systems
  Internal Pages
  Group Calendar

    SICS
    Box 1263
    SE-16429 Kista
    Sweden

    +46 8 633 1500
    +46 8 751 7230 (fax)
space

Fully Decentralized Distributed Systems

Background

Distributed systems vary in their degree of centralization. Traditionally distributed systems tend to be fairly centralized. The server/client paradigm is the prime example. All essential computation and all services are placed on the server; the client is little more than input/output device, providing little more than the GUI. Server clusters typically rely on centralized coordination, the fact that all machines reside within one administrative domain, uniformity of connection (e.g. the same latency between all participants), centralized maintenance (e.g. code upgrading), monitoring (e.g. failure-detection) and control (e.g. the ability to stop a runaway processor).

The contrast to the traditional centralized approach is fully decentralized systems, recently popularized as peer-to-peer systems. Such distributed systems have a very large potential. Potentially they are

  1. more scalable - centralized solutions introduce a bottleneck at the centralized control point that limits scalability.
  2. more robust - this property is related to the first in that most mechanisms of fault-tolerance rely on redundancy and more machines allow for greater levels of redundancy.
  3. suitable for applications where there is no complete trust - centralized solutions require users to trust the central point. User must trust that the server does not behave incorrectly or maliciously as well as providing the necessary QoS.
  4. suitable for many applications where the centralized solution while technically feasible is not financially feasible.

As peer-to-peer systems can be composed of ordinary machines (PCs) at the edge of the network there is the idea that scalability is more or less for free. For instance, in peer-to-peer collaborative applications each additional user will increase the demand for resources, but unlike server-centric solutions the user is at the same time adding a resource, his machine. If the additional resource demand does not exceed the additional resource supply then the application scales naturally. This also changes the relationship between investment and return. Services may then be sold as a piece of software only; the users themselves will provide the computation and memory resources. This is in stark contrast with the traditional solution where the service provider must in addition to the software also provide a continuously operating server park.

In the area of fully decentralized distributed systems there are numerous research problems that need to be addressed if the full potential is to be realized. We believe that this research will have a wide impact. This will not be limited to pure peer-to-peer applications in new application areas (beyond the file-sharing characteristic of peer-to-peer systems of today) but will have a major impact on all service providers. When the technology for dealing with fully decentralized systems improves this also opens the door for various hybrid solutions where the simplicity of centralized control is optimally mixed with the scalability of decentralization.

Algorithms

Over the years numerous distributed algorithms have been developed and deployed. The problems tackled range from distributed agreement to election and consistency. Much of this work has been targeted to systems with a small number of nodes or a known number of nodes. Various limiting assumptions about the communication network is made such as uniform latencies and the ability to distinguish between permanent and transient failure. These results do not apply at all, or apply badly to large decentralized systems. The prime example of such a scenario is a peer-to-peer application involving millions of machines over the Internet, where new users/machines continually leave and enter the system.

There is a need to re-examine the traditional algorithms for distributed computing, adapting them where this is possible. Where adaptation does not work or works badly there is a need to develop new algorithms. There are three areas where the need for new and better algorithms is already fairly well established and where the need is acute.

1) Decentralized naming services

One crucial service of any multi-user system is the directory or naming service. This service maps names to addresses, and keys to data items. In centralized systems such a service is easy to implement as a database lookup at the central point. For fully decentralized, dynamic and large systems this does not work. The size of the system may not allow each machine to hold information about all names. Also often the system is so dynamic with data items being moved about as machines enter and leave the system that even if size were not a limiting factor it would not be possible to maintain such tables without flooding the network. Essentially there is a trade-off between table sizes, the time it takes to find an item, and the messages necessary to maintain the system. One research strand is concerned with trying to find the most optimal decentralized naming service. We want an algorithm that can be tuned according to the system parameters (frequency of membership changes, lookups, system size) for optimality.

2) Fault-tolerance

Algorithms for fault-tolerance in the decentralized and dynamic setting need to be able to cope with a continuous changing virtual network of connected machines. We have to assume that we are unable to distinguish between slow connections and crashes. Traditional redundancy algorithms need to be reexamined and adapted if possible. Exactly how this should be done and indeed if suitable adaptations can be made is not known. It is also of interest to look at the other category of algorithms for fault-tolerance, stabilization algorithms. These algorithms are inherently more robust to change in dynamic networks.

3) Consistency models

Traditional distributed computing works with exact consistency models and the algorithms necessary to achieve this (e.g. three-phase commit for distributed transactions). These 'strict' consistency models are of course necessary and do have their place in peer-to-peer systems. However, such algorithms do not scale well, and complicate fault-tolerance. The contrast is ad-hoc 'best effort' systems in real-world applications like virtual reality. We see a need to formalize a framework for consistency models and the algorithms to achieve them. This framework should capture applications like virtual reality where we have the requirement that knowledge of things that are close should be more accurate and up-to-date than things that are far away. We should not be limited to 'best effort' but need mechanisms for quality of service in so far as this is possible and quality awareness where not.