SICS
Nornir - a quality assurance platform for distributed software

 
 
  CONTENTS
  Overview
Motivation
Design
Publications
Download
Project funding
People
Links

    SICS
    Box 1263
    SE-16429 Kista
    Sweden

    +46-8 633 1500
    +46-8 751 7230 (fax)

For more information about SICS,
please email info@sics.se.
Overview

Nornir is a platform for testing complex distributed and parallel software. It is based on a complete system simulator, Simics, which provides a deterministic execution environment. It allows application executions to be reproduced and observed without intrusion, thereby enabling debugging of intermittent and timing-dependent errors. A simulator only allows a user to probe for low-level information (e.g. physical memory contents, disk blocks), however, which is not immediately useful for debugging high-level software. Nornir therefore complements the simulator with a translation framework, mapping low-level simulator data to high-level, application-related data, and presents to the user it in a modified version of the GNU debugger (GDB).

Nornir also provides the user with a programmable environment, where a user or tool builder can probe the state of an entire distributed application, at multiple abstraction layers, and also control the simulated environment and input model. These capabilities provide the means to construct tools that are very hard to build on real machines, for example testing/debugging tools for race conditions or performance profilers for real-time and distributed software. Nornir is also designed to support white-box testing of complex distributed applications, allowing developers to write application monitoring routines that inspect the integrity of global application state.

Our current research mostly involves two areas: test methods for intermittent errors (aka race conditions) in concurrent and distributed applications, and constructing programmable, robust testing environments for networked embedded systems. Some information on these areas can be found below, under Publications. More information will be provided here as we publish results in academic fora. SICS's industrial contacts may obtain articles with more information upon request. Send a mail to lalle@sics.se.

Motivation

Creating high quality computer programs is difficult. Computer programming is still an immature craft, and the tools at our disposal are few and simple. Creating programs that are asynchronous, concurrent, distributed, or have real-time requirements is even more difficult, since their execution depends on the timing and interleaving of events in a system, which we usually cannot control. Unfortunately, these troublesome types of programs are rapidly becoming more common, due to a number of technology changes occuring right now:

  • The processor vendors can no longer increase the speed of single-threaded execution cores, and are instead putting multiple computing cores on each chip. This forces software developers that care about performance to parallelise their software. It is difficult to develop parallel programs, and errors in parallel programs are often intermittent. Unfortunately, the traditional test and debug cycle, which is our primary software quality assurance method, is inefficient for finding intermittent concurrency errors.
  • Distributed applications, for example web services and peer-to-peer applications, are increasing in numbers faster than monolithic applications. Distributed software is also susceptible to concurrency errors, similarly to parallel software, but the intermittent behaviour depends on the order of message delivery rather than thread scheduling.
  • Many new distributed applications involve heterogeneous platforms, for example sensor networks and mobile applications. The few existing methods that specifically address quality assurance of distributed applications tend to require homogeneous environments, and are not applicable to heterogeneous systems. Furthermore, nodes involved in such applications often have limited resources, and provide few debugging capabilities.

In comparison to other areas of computer science, practical software quality assurance has received very little attention, and academic research has not kept up with the needs of modern software. As a result, we have no good practical methods and tools for aiding developers in improving quality of distributed, multithreaded, reactive programs with soft real-time requirements. Instead, the tools available, such as profilers, debuggers, and testing tools, were developed more than 30 years ago, and are tailored for long-running batch programs with predictable input, and without user interactions or expectations on short response times.

Using the tools available today for addressing quality of distributed software can be a frustrating experience: Testing tools fail to catch concurrency errors - since they are often not reproducible, a program that seems to work during testing may fail when put in production. Debugging tools affect the execution, preventing intermittent errors from being debugged efficiently. Profiling tools typically provide information on time or resource consumption in a single process, aggregated over the whole execution, whereas developers of complex software would be more interested in why an application sometimes fails to respond promptly, and which processes are causing the problem.

Let's assume for a while that we could get three wishes fulfilled for improving quality assurance of distributed applications. No unrealistic things, such as magic wands that point out all errors, but three things achievable with technology and no more unrealistic than a space craft. We have a suggestion for these wishes:

  • A machine or environment that provides reproducible executions, allowing even intermittent and timing-dependent errors to be tested and reproduced. We must of course test the same software that we ship, including third-party components, so the execution environment has to run unmodified software in binary form, and support heterogeneous, networked applications running on a mix of hardware and operating systems.
  • There is little point in reproducing test failures unless we can debug the erroneous behaviour, so we also need to be able to observe test execution. We will want to observe low-level details, such as bits in packets and registers, but also high-level information, such as variable and database table contents. In order to preserve reproducibility, observation cannot affect the execution, and it must therefore not rely on services in the software under test, which traditional observation tools do. In order to perform performance analysis of distributed systems, we need to observe multiple processes, potentially running in different types of run-time environments, and we also need to observe time flow and causality between distributed components.
  • Our software should work properly under the harsh conditions that it expected to handle when put in production, not only the conditions that usually appear on our test machines. In order to test it under such conditions, we need a hostile environment that exhibits all types of unusual, but possible, erratic behaviour that we can expect from the real world: unusual interleavings of events, timings of the services the software uses, hardware and software component faults, communication errors, etc.

If we could have these wishes fulfilled, we would really have means for testing complex distributed software, even for errors that are very hard and costly to find with existing methods. Naturally, we would not be able to achieve any significant coverage of test cases and erratic environment behaviour by manual testing, so a testing environment that supports the wishes above would have to fully programmable, and allow testers to write automated test runners, routines that check application integrity, and chaos injectors that attempt to provoke intermittent errors.

Nornir is an attempt to fulfil these wishes, or actually the last two. The first one already exists - it is called a complete system simulator (or sometimes full system simulator). It is essentially a binary compatible computer implemented in software, capable of running unmodified general-purpose operating systems and applications. The concept was first introduced as early as 1984, later reinvented in the SimOS project, and Simics, a commercial implementation, is now available. A complete system simulator that is designed to be deterministic always produces identical executions for a given initial state and simulation model. It also provides the means to fulfil the other two wishes: the simulator's state can be probed without intrusion, the machine model allows timing to be changed and supports fault injections, and the primary simulator services can be accessed through a programming interface.

Design

Although a complete system simulator can be probed for all system state that is visible to software, the information retrieved is raw, binary information that has been transformed by compilers, virtual machines, and operating systems, and is no longer easily comprehensible to humans. In order to make this information useful for a programmer, it must be translated back to the abstraction level the programmer deals with, i.e. to variables and types in the programming languages used in the application. Nornir therefore includes virtual machine translators, platform-specific modules that translate low-level data obtained from the simulator to data corresponding to the virtual machine of an application process. This virtual machine data is presented in standard symbolic debuggers, such as GDB. Today, Nornir supports non-intrusive debugging of Linux applications through a specialised version of GDB, but the design allows virtual machine translators to be stacked and scaled to higher abstraction layers, for example to Java, interpreted languages, web application builders, or database applications.

The main frontend of Nornir is the debugger shepherd, a common debugging interface for distributed applications. The shepherd controls multiple debuggers, and provides access to all software state in a system. It is programmable, and allows users to write routines for checking application integrity and for monitoring distributed sequences of events. It also supports the creation of user-defined debugging abstractions, for example breakpoints on other events than code execution, e.g. packet contents, database queries, elapsed time, etc.

Although Nornir can be used as an interactive debugger for distributed software, its main purpose is to provide a programmable platform for construction of new types of tools that are hard or impossible to build today: performance profilers for soft real-time and distributed applications, tools that allow testing for all types of concurrency-related errors, testing tools for fault-tolerant software, etc. It is also designed to be a platform for building white-box testing infrastructure for distributed software.

By using complete system simulation as a base, we avoid many of the limitations and drawbacks of alternative approaches. Our aim is to create a new quality assurance method that, unlike most research methods, place no requirements on the application under test, such as hardware/operating system/programming language homogeneity, complete access to source code, etc. Moreover, our method does not require any changes to the software development process and allows for incremental adoption, since Nornir can run applications and existing test suites without modification. The only major disadvantage with simulation as a method is the performance overhead, but for many applications, we believe that the benefits will outweigh this overhead.

Publications

Please see the copyright information for terms of use.

Lars Albertsson. Entropy injection. SICS technical report T2007-02.
[ pdf | ps.Z | Abstract ]

Lars Albertsson. Holistic debugging. SICS technical report T2006-14. This report is the full version of the paper presented at MASCOTS 2006. This is also the version that was actually reviewed and accepted for publication at the conference. If you are curious as to why the full paper was not published in the conference proceedings, read here.
[ bib | pdf | ps.Z | Abstract ]

Lars Albertsson. Holistic debugging - enabling instruction set simulation for software quality assurance. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS), Monterey, California, September 2006.
[ bib | pdf | ps | ps.gz | Abstract ]

Lars Albertsson. Temporal debugging and profiling of multimedia applications. In Martin G. Kienzle and Prashant J. Shenoy, editors, Multimedia Computing and Networking 2002, volume 4673 of Proceedings of SPIE, pages 196-207, January 2002.
[ bib | pdf | ps | ps.gz | Abstract ]

Lars Albertsson. Simulation-based debugging of soft real-time applications. In Proceedings of the Real-Time Application Symposium. IEEE Computer Society, IEEE Computer Society Press, May 2001.
[ bib | pdf | ps | ps.gz | Abstract ]

Lars Albertsson and Peter S. Magnusson. Simulation-based temporal debugging of Linux. In Proceedings of the Second Real-Time Linux Workshop, December 2000.
[ bib | pdf | ps | ps.gz | Abstract ]

Lars Albertsson and Peter S. Magnusson. Using complete system simulation for temporal debugging of general purpose operating systems and workloads. In Proceedings of MASCOTS 2000. IEEE Computer Society, IEEE Computer Society Press, August 2000.
[ bib | pdf | ps | ps.gz | Abstract ]

Download

Snapshots of Nornir can be downloaded from the distribution directory. It is distributed under a combination of open source licences. Most of the code developed by SICS is distributed under BSD licence. The distribution also includes a number of open source components distributed under various licences. In order to run Nornir, you also need to obtain a Simics distribution. You should be aware, however, that Nornir is experimental research software, and unlikely to solve problems in industrial software in its current status.

Project funding

The Nornir project is funded by the EU sixth framework project RUNES.

Development of the Nornir environment has previously been funded by the ARTES research program and by Vinnova, through the Time Bending project.

People

Lars Albertsson

Anders Wallberg

Fredrik Österlind

Links

SICS

Computer and Systems Laboratory

Simics

RUNES

ARTES

Last updated: $Date: 2004/08/13 14:47:01 $ (CEST)