Scheduling Saves in Fault-Tolerant Computations.
01 May 1993
Computer users with very long computations run the risk of losing work because of machine failures. Such losses can often be reduced by scheduling saves on secure storage devices of work successfully done. In the model studied here, the user leaves the computation unattended for extended periods of time, after which he or she returns to check whether a machine failure occurs. When a check reveals a failure, the user resets the computation so that it resumes from the point of the last successful save. Saves are themselves time consuming, so that any strategy for scheduling saves must strike a balance between the computing time lost during saves and the computing time that is occasionally lost, because of a failure since the last successful save. For a given time to the next check and given constant save times, this paper computes schedules that maximize the expected amount of work successfully done before the next check, under the uniform and exponential failure laws. Explicit formulas are obtained for the uniform law. A recurrence leads to routine numerical calculations for the more difficult system with an exponential failure law.