Checkpointing is a technique to add fault tolerance into computing systems. It basically consists on saving a snapshot of the application's state, so that it can restart from that point in case of failure. This is particularly important for long running application that are executed in vulnerable computing system.
- 1 Checkpointing in distributed systems
- 2 Technique properties
- 3 Implementations for parallel and distributed applications
- 4 Implementations for single nodes
- 5 Implementation for embedded and ASIC devices
- 6 References
- 7 External links
- 8 Further reading
Checkpointing in distributed systems
In distributed computing, checkpointing is a technique that helps tolerate failures that otherwise would force long-running application to restart from the beginning. The most basic way to implement checkpointing, is to stop the application, copy all the required data from the memory to reliable storage (e.g., Parallel file system) and then continue with the execution. Checkpointing implementations should preserve system consistency. There are two main approaches for checkpointing in such systems: coordinated checkpointing and uncoordinated checkpointing. In the coordinated checkpointing approach, processes must ensure that their checkpoints are consistent. This is usually achieved by some kind of two-phase commit protocol algorithm. In uncoordinated checkpointing, each process checkpoints its own state independently. It must be stressed that simply forcing processes to checkpoint their state at fixed time intervals is not sufficient to ensure global consistency. The need for establishing a consistent state (i.e., no missing messages or duplicated messages) may force other processes to roll back to their checkpoints, which in turn may cause other processes to roll back to even earlier checkpoints, which in the most extreme case may mean that the only consistent state found is the initial state (the so-called domino effect).
There are many different points of view and techniques for achieving application checkpointing. Depending on the specific implementation, a tool can be classified as having several properties:
- Amount of state saved: This property refers to the abstraction level used by the technique to analyze an application. It can range from seeing each application as a black box, hence storing all application data, to selecting specific relevant cores of data in order to achieve a more efficient and portable operation.
- Automatization level: Depending on the effort needed to achieve fault tolerance through the use of a specific checkpointing solution.
- Portability: Whether or not the saved state can be used on different machines to restart the application.
- System architecture: How is the checkpointing technique implemented: inside a library, by the compiler or at operating system level.
Each design decision made affects the properties and efficiency of the final product. For instance, deciding to store the entire application state will allow for a more straightforward implementation, since no analysis of the application will be needed, but it will deny the portability of the generated state files, due to a number of non-portable structures (such as application stack or heap) being stored along with application data.
Implementations for parallel and distributed applications
Fault Tolerance Interface (FTI)
FTI is a library that aims to provide computational scientitsts with an easy way to perform checkpoint/restart in a scalable fashion. FTI leverages local storage plus multiple replications and erasures techniques to provide several levels of reliability and performance. FTI provides application-level checkpointing that allows users to select which data needs to be protected, in order to improve efficiency and avoid space, time and energy waste. It offers a direct data interface so that users do not need to deal with files and/or directory names. All metadata is managed by FTI in a transparent fashion for the user. If desired, users can dedicate one process per node to overlap fault tolerance workload and scientific computation, so that post-checkpoint tasks are executed asynchronously.
Berkeley Lab Checkpoint/Restart (BLCR)
The Future Technologies Group at the Lawrence National Laboratories are developing a hybrid kernel/user implementation of checkpoint/restart called BLCR. Their goal is to provide a robust, production quality implementation that checkpoints a wide range of applications, without requiring changes to be made to application code. BLCR focuses on checkpointing parallel applications that communicate through MPI, and on compatibility with the software suite produced by the SciDAC Scalable Systems Software ISIC. Its work is broken down into 4 main areas: Checkpoint/Restart for Linux (CR), Checkpointable MPI Libraries, Resource Management Interface to Checkpoint/Restart and Development of Process Management Interfaces.
DMTCP (Distributed MultiThreaded Checkpointing) is a tool for transparently checkpointing the state of an arbitrary group of programs spread across many machines and connected by sockets. It does not modify the user's program or the operating system. Among the applications supported by DMTCP are Open MPI, Python, Perl, and many programming languages and shell scripting languages. With the use of TightVNC, it can also checkpoint and restart X Window applications, as long as they do not use extensions (e.g. no OpenGL or video). Among the Linux features supported by DMTCP are open file descriptors, pipes, sockets, signal handlers, process id and thread id virtualization (ensure old pids and tids continue to work upon restart), ptys, fifos, process group ids, session ids, terminal attributes, and mmap/mprotect (including mmap-based shared memory). DMTCP supports the OFED API for InfiniBand on an experimental basis.
Implementations for single nodes
A number of checkpointing packages have been developed for the Linux/Unix family of operating systems. These checkpointing packages may be divided into two classes, those which operate in user space, examples of which include the checkpointing package used by Condor and the portable checkpointing library developed by The University of Tennessee. User space checkpointing packages are highly portable and can typically be compiled and run on any modern Unix (e.g. Linux, FreeBSD, OpenBSD, Darwin etc.). In contrast, kernel based checkpointing packages such as Chpox and the checkpointing algorithms developed for the MOSIX cluster computing environment tend to be highly operating system dependent.
Modern checkpointing packages such as Cryopid are capable of checkpointing a "process pod", that is a parent process and all its associated children, and of dealing with file system abstractions such as sockets and pipes (FIFO's) in addition to regular files. In the case of Cryopid, there is also provision to roll all dynamic libraries, open files, sockets and FIFO's associated with the process into the checkpoint. This is very useful when the checkpointed process is to be restarted in a heterogeneous environment (e.g. the machine on which the checkpoint is restarted has libraries and file system which differ from the host on which the process was checkpointed). Cryopid is now maintained under the SourceForge project Cryopid2. This version of Cryopid will compile on all Linux kernels up to 2.6.27 for 32-bit kernels. Work is in hand to get Cryopid2 working on 64-bit kernels.
CRIU is a software tool for Linux operating systems. Using this tool, you can freeze a running application (or part of it) and checkpoint it to a hard drive as a collection of files. You can then use the files to run the application from the point it was frozen at. The main peculiarity of the CRIU project is that it is mainly implemented in user space. The project is currently under development. CRIU is a project of OpenVZ. OpenVZ kernel has an ability to checkpoint and restart a virtual private server (VPS), i.e. a set of processes and all the data structures associated with those processes (opened files, sockets, IPC objects, network connections, etc.). The primary use of checkpointing is "live migration", a move of a VPS from one physical server to another without a need to shut down and restart it. OpenVZ supports checkpointing on x86, x86-64 and IA-64 architectures.
Implementation for embedded and ASIC devices
Mementos is a software system that transform general-purpose tasks into interruptible program for platforms with frequent power outages. It has been designed for batteryless embedded devices such as RFID tags and smart cards which rely on harvesting energy from ambient background. Mementos frequently senses the available energy in the system, and decides to checkpoint the program or continue the computation. In case of checkpointing, data will be stored in a non-volatile memory. When the energy become sufficient for reboot, the data will be retrieved from the memory, and the program continues from the stored state. Mementos has been implemented on the MSP430 family of microcontrollers. Mementos is named after Christopher Nolan's Memento.
Idetic is a set of automatic tools which helps Application-specific integrated circuit (ASIC) developers to automatically embeds checkpoints in their designs. It targets high-level synthesis tools and adds the checkpoints at the register-transfer level (Verilog code). It uses a dynamic programming approach to locate low overhead points in the state machine of the design. Since the checkpointing in hardware level involves sending the data of dependent registers to a non-volatile memory, the optimum points requires to have minimum number of registers to store. Idetic is deployed and evaluated on energy harvesting RFID tag device.
- Plank, J. S., Beck, M., Kingsley, G., & Li, K. (1994). Libckpt: Transparent checkpointing under unix. Computer Science Department.
- Bouteiller, B., Lemarinier, P., Krawezik, K., & Capello, F. (2003, December). Coordinated checkpoint versus message log for fault tolerant MPI. In Cluster Computing, 2003. Proceedings. 2003 IEEE International Conference on (pp. 242-250). IEEE.
- Elnozahy, E. N., Alvisi, L., Wang, Y. M., & Johnson, D. B. (2002). A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR), 34(3), 375-408.
- Bautista-Gomez, L., Tsuboi, S., Komatitsch, D., Cappello, F., Maruyama, N., & Matsuoka, S. (2011, November). FTI: high performance fault tolerance interface for hybrid systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (p. 32). ACM.
- Hargrove, P. H., & Duell, J. C. (2006, September). Berkeley lab checkpoint/restart (blcr) for linux clusters. In Journal of Physics: Conference Series (Vol. 46, No. 1, p. 494). IOP Publishing.
- Ansel, J., Arya, K., & Cooperman, G. (2009, May). DMTCP: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on (pp. 1-12). IEEE.
- Benjamin Ransford, Jacob Sorber, and Kevin Fu. 2011. Mementos: system support for long-running computation on RFID-scale devices. SIGPLAN Not. 47, 4 (March 2011), 159-170. DOI=10.1145/2248487.1950386 http://doi.acm.org/10.1145/2248487.1950386
- Mirhoseini, A.; Songhori, E.M.; Koushanfar, F., "Idetic: A high-level synthesis approach for enabling long computations on transiently-powered ASICs," Pervasive Computing and Communications (PerCom), 2013 IEEE International Conference on , vol., no., pp.216,224, 18–22 March 2013 URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6526735&isnumber=6526701
- Yibei Ling, Jie Mi, Xiaola Lin: A Variational Calculus Approach to Optimal Checkpoint Placement. IEEE Trans. Computers 50(7): 699-708 (2001)
- R.E. Ahmed, R.C. Frazier, and P.N. Marinos, " Cache-Aided Rollback Error Recovery (CARER) Algorithms for Shared-Memory Multiprocessor Systems", IEEE 20th International Symposium on Fault-Tolerant Computing (FTCS-20), Newcastle upon Tyne, UK, June 26–28, 1990, pp. 82–88.