Online Testing of Deployed Federated and Heterogeneous Distributed Systems

October 7, 2011

Dejan Kostic


Online Testing of Deployed Federated and Heterogeneous Distributed Systems

Time:   11:00am
Location:   IMDEA conference room

It is notoriously difficult to make distributed systems reliable. This becomes even harder in the case of the widely-deployed systems that are heterogeneous (multiple implementations) and federated (multiple administrative entities). The set of routers in charge of the Internet’s inter-domain routing is a prime example of such a system.

We argue that a key step in making these systems reliable is the need to automatically explore the system behavior to check for potential faults. In this talk, I will describe the design and implementation of DiCE, a system for online testing of heterogeneous and federated distributed systems. DiCE runs concurrently with the production system by leveraging distributed checkpoints and isolated communication channels. DiCE orchestrates the exploration of relevant system states by controlling the inputs that drive system actions. While respecting privacy among different administrative entities, DiCE detects faults by checking for violations of properties that capture the desired system behavior. We demonstrate the ease of integrating DiCE with a BGP router and a DNS server, the building blocks of two vital services in the Internet. Our evaluation in the testbed shows that DiCE quickly and successfully detects three important classes of faults, resulting from configuration mistakes, policy conflicts and programming errors.

Joint work with Marco Canini, Vojin Jovanovic, Daniele Venzano, Gautam Kumar, Dejan Novakovic, Boris Spasojevic, and Olivier Crameri.