Optimal Control of Storage Regeneration with Repair Codes
11 December 2017
High availability of containerized applications requires to perform robust storage of applications' state. Since basic replication techniques are extremely costly at scale, storage space requirements can be reduced by means of erasure and/or repairing codes. In this paper we address storage regeneration using repair codes, which is a robust distributed storage technique with no need to fully restore the whole state in case of failure. In fact, it can replace only the lost servers. To do so, new clean slate storage units are made operational at a cost for activating new storage servers and a cost for the transfer of repair data. Our target is to guarantee maximal availability of containers' state files by a given deadline. Upon a fault occurring at a subset of the storage servers, we aim at ensuring that they are repaired by a given deadline. We introduce a controlled fluid model to quantify the optimal activation policy of replacement servers in order to repair such correlated faults. The solution concept is the optimal control of regeneration using the Pontryagin minimum principle. We characterize feasibility conditions and we can prove that the optimal policy is of threshold type. Numerical results describe how to apply the model for system dimensioning and show the tradeoff between activation of servers and transfer of information.