Straggler Mitigation in Distributed Matrix Multiplication: Fundamental Limits and Optimal Coding
17 June 2018
Consider massive matrix multiplication, a problem that underlies many data analytic applications, in a large-scale distributed system comprising a group of workers. We target the stragglers' delay performance bottleneck, which is due to the unpredictable latency in waiting for slowest nodes (or stragglers) to finish their tasks. We propose a novel coding strategy, named entangled polynomial code, designing intermediate computations at the workers in order to minimize the recovery threshold (i.e., the number of workers that we need to wait for in order to compute the final output). We prove the optimality of entangled polynomial code in several cases, and show that it provides order-wise improvement over the conventional schemes for straggler mitigation. Furthermore, we characterize the optimal recovery threshold among all linear coding strategies within a factor of 2 using bilinear complexity, by developing an improved version of the entangled polynomial code.