How to optimally allocate resources for coded distributed computing?

31 July 2017

New Image

To execute cloud computing tasks over a data center hosting hundreds of thousands of server nodes, it is natural to distribute computations across the nodes to take advantage of parallel processing. However, as we allocate more computing resources and further distribute the computations, a large amount of intermediate data must be moved between consecutive computation stages among the nodes, causing the communication load to become the bottleneck. In this paper, we study the optimal resource allocation in distributed computing, in order to minimize the total execution time accounting for the durations of both computation and communication phases. Particularly, we consider a general MapReduce-type framework, and focus on a recently proposed Coded Distributed Computing approach. For all values of problem parameters, we characterize the optimal number of servers that should be used for computing, provide the optimal placements of the Map and Reduce tasks, and propose an optimal coded data shuffling scheme. To prove the optimality of the proposed scheme, we first derive a matching information-theoretic converse on the execution time, then we prove that among all resource allocation schemes that achieve the minimum execution time, our proposed scheme uses the exactly least number of servers.