- Aishwarya Pothula

# Efficient Exact Gradient Update for Training Deep Networks with Very Large Sparse Targets

The paper proposes an algorithmic approach that can calculate the exact loss for a wide range of loss functions, gradient updates for output weights and gradient for back-propagation in neural networks with large output spaces and sparse high dimensional inputs. In general, these calculations have been handled using approximating functions such as hierarchical softmax or sampling-based approximations. The proposed method, apart from calculating exact values, does the calculations O(d2) in contrast to O(Dd) without having to compute the D-dimensional output.

Sampling approximations compute only a fraction of the output’s dimensions sampled randomly and Hierarchical softmax makes use of a heuristically defined hierarchical tree structure to calculate the normalized probability of the target class. When compared to the full dimensions of the output, these are crude approximations. Whereas, the proposed method efficiently computes the exact gradient update equivalent to considering all output dimensions without actually computing the D outputs.

Handling sparse inputs is trivial, however, having to compute high dimensional outputs from high dimensional sparse inputs is highly expensive. Let us explore the problem statement formally. To represent a K-sparse vector in a D dimensional space, storing only the indices and values of non-zero elements is sufficient. Multiplication of such a vector by a dense dxD matrix is O(Kd) instead of O(Dd). Consider a network with high dimensional inputs D but K-sparse with only d dimensional intermediate layers such that K<<d<<D, trained with square loss. In forward propagation, the input-hidden weight matrix is O(Dd) or O(Kd) and the hidden layers weight matrices are O(d2 ). The output is O(Dd) but since the hidden representation is dense, computing the final matrix is still prohibitively expensive. It is a similar case for back-propagation too. However, the paper shows that it is possible to compute the loss L, gradient and gradient update in O(d2 ) time.

The proposed algorithm works by, first, keeping an up-to-data dxd dimension weight matrix Q = WTW where W is the output matrix. Doing so allows for the computation of loss without computing the output in O(d2) time. Second, W updates us with a non-sparse rank-one update, which updates all of W, but can be explicitly represented as a factorization into DxD and dxd. dxd is updated completely but only K rows are updated in the Dxd matrix, given that its inverse (O(d2)) is kept up-to-date. These steps allow for a two to three orders of magnitude improvement in computational benefit as well as reduced memory access. These update modifications are applicable to the spherical loss family but not to regular softmax. The complexity of the proposed algorithm is independent of the target size, allowing it to be applied to large practical problems.