Checkpointing is a technique for inserting fault tolerance into computing systems. It basically consists of storing a snapshot of the current application state, and later on, uses it for restarting the execution in case of failure.
There are many different points of view and techniques for achieving application checkpointing. Depending on the specific implementation, a tool can be classified as having several properties:
- Amount of state saved: This property refers to the abstraction level used by the technique to analyze an application. It can range from seeing each application as a black box, hence storing all application data, to selecting specific relevant cores of data in order to achieve a more efficient and portable operation.
- Automatization level: Depending on the effort needed to achieve fault tolerance through the use of a specific checkpointing solution.
- Portability: Whether or not the saved state can be used on different machines to restart the application.
- System architecture: How is the checkpointing technique implemented: inside a library, by the compiler or at operating system level.
Each design decision made affects the properties and efficiency of the final product. For instance, deciding to store the entire application state will allow for a more straightforward implementation, since no analysis of the application will be needed, but it will deny the portability of the generated state files, due to a number of non-portable structures (such as application stack or heap) being stored along with application data.
In distributed shared memory, checkpointing is a technique that helps tolerate the errors leading to losing the effect of work of long-running applications. The main property which should be induced by checkpointing techniques in such systems is in preserving system consistency in case of failure. There are two main approaches to checkpointing in such systems: coordinated checkpointing, in which all cooperating processes work together to establish coherent checkpoint; and communication induced (called also dependency induced) independent checkpointing.
It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. Even if we postulate the existence of global clock, the checkpoints made by different processes still may not form a consistent state. The need for establishing a consistent state may force other process to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).
In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In communication induced checkpointing, each process checkpoints its own state independently whenever this state is exposed to other processes (that is, for example whenever a remote process reads the page written to by the local process).
The system state may be saved either locally, in stable storage, or in a distant node's memory.
Practical implementations for Linux/Unix
A number of practical checkpointing packages have been developed for the Linux/Unix family of operating systems. These checkpointing packages may be divided into two classes, those which operate in user space, examples of which include the checkpointing package used by Condor and the portable checkpointing library developed by The University of Tennessee. User space checkpointing packages are highly portable and can typically be compiled and run on any modern Unix (e.g. Linux, FreeBSD, OpenBSD, Darwin etc.). In contrast, kernel based checkpointing packages such as Chpox and the checkpointing algorithms developed for the MOSIX cluster computing environment tend to be highly operating system dependent.
Modern checkpointing packages such as Cryopid are capable of checkpointing a "process pod", that is a parent process and all its associated children, and of dealing with file system abstractions such as sockets and pipes (FIFO's) in addition to regular files. In the case of Cryopid, there is also provision to roll all dynamic libraries, open files, sockets and FIFO's associated with the process into the checkpoint. This is very useful when the checkpointed process is to be restarted in a heterogeneous environment (e.g. the machine on which the checkpoint is restarted has libraries and file system which differ from the host on which the process was checkpointed).
Cryopid is now maintained under the SourceForge project Cryopid2. This version of Cryopid will compile on all Linux kernels up to 2.6.27 for 32-bit kernels. Work is in hand to get Cryopid2 working on 64-bit kernels. The cryopid2 package extends Bernard Blackham's original Cryopid package in a number of significant ways. For example, it allows the state of Linux real time signals to be preserved when a checkpoint is taken and also is capable of inter-operating with SSH via a "portal daemon" in order to implement full process migration (and of any associated pod processes) between Linux hosts. Cryopid2 also has the capability to roll up its environment (e.g. the bodies of open files) into the checkpoints it produces. This facilitates the migration of processes onto foreign hosts which present an arbitrary file system environment to an inbound migrating process. Pipes are also preserved in a similar manner: their contents are sucked into the migrating process prior to migration (so they form part of the checkpoint) and spat out into the kernel of the new host when execution is resumed. Cryopid2 is inter-operable with the P3 Organic computing environment which uses its services for both persistence and process migration
DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program or the operating system.
Among the applications supported by DMTCP are Open MPI, Python, Perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). See the QUICK-START file of the distribution for further details.
DMTCP is also the basis for URDB, the Universal Reversible Debugger. URDB is still experimental. Nevertheless, it currently adds reversibility to gdb, Python (pdb), and Perl (perl -d). It also supports reverse expression watchpoints, a form of temporal search within a process lifetime.
OpenVZ kernel has an ability to checkpoint and restart a virtual private server (VPS), i.e. a set of processes and all the data structures associated with those processes (opened files, sockets, IPC objects, network connections, etc.). The primary use of checkpointing is "live migration", a move of a VPS from one physical server to another without a need to shut down and restart it. OpenVZ supports checkpointing on x86, x86-64 and IA-64 architectures.
CRIU is a software tool for Linux operating systems. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to run the application from the point it was frozen at. The main peculiarity of the CRIU project is that it is mainly implemented in user space. The project is currently under development.
Berkeley Lab Checkpoint/Restart (BLCR)
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR
Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code. This work focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. This work is broken down into 4 main areas:
- Checkpoint/Restart for Linux (CR)
- Checkpointable MPI Libraries
- Resource Management Interface to Checkpoint/Restart
- Development of Process Management Interfaces
- E.N. Elnozahy, L. Alvisi, Y-M. Wang, and D.B. Johnson, "A survey of rollback-recovery protocols in message-passing systems", ACM Comput. Surv., vol. 34, no. 3, pp. 375–408, 2002.
- The Home of Checkpointing Packages
- Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)
- R.E. Ahmed, R.C. Frazier, and P.N. Marinos, " Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems", IEEE 20th International Symposium on Fault-Tolerant Computing (FTCS-20), Newcastle upon Tyne, UK, June 26–28, 1990, pp. 82–88.