QUASI BLACK HOLE EFFECT OF GRADIENT DESCENT IN LARGE DIMENSION: CONSEQUENCE ON NEURAL NETWORK LEARNING

01 January 2019

New Image

The gradient descent to a local minimum is the key ingredient of deep neural networks learning techniques. We consider a function Lm(.) in dimension n with a random set of m absolute minima. When log m = o(n) we show that a gradient descent from an initial random point quasi always ends on a unique local minimum approximately at the centroid of the absolute minima. The fake minimum acts like an absorbing node, but its value by function Lm(.) can be far above the values obtained by Lm(.) on the absolute minima and sometimes give very bad coefficient for the neural network. Fortunately in most case the fake minimum lead to a neural network with no so bad prediction, with an error rate of order n-1/4. The only way to escape the fake minimum is to start a new gradient descent from a new random point and we show that finding a good initial point takes in average time which is at least proportional to ebn/mn2 for some b > 0.