- Aishwarya Pothula

# Collaborative Learning for Deep Neural Networks

Training deep neural networks involves dealing with non-convex optimization problems. Local gradient methods generally used do not guarantee convergence to a global minimum. As a solution, training multiple instances of a classification model with different seed values, known as ensemble learning, is widely accepted; it is known to make better predictions than single trained instances. However, this method is computationally expensive at inference time. Previous work to deal with the non-convex optimization problem involves auxiliary training, multi-task learning, knowledge distillation, general label smoothing, and a few variations of knowledge distillation. However, they too suffer from expensive computational requirements.

The goal of this paper is to perform collaborative learning for deep neural networks. Collaborative learning proposed involves training multiple classifier heads of the same network on the same training data. This is expected to increase generalization and robustness to label noise with no extra inference cost.

The proposed model for collaborative learning mainly benefits from two mechanisms. First, supplementary and regularization information is provided to each classifier by the way of training multiple class heads on the same training data. Next, *Intermediate-level-representation* reduces computational complexity and backpropagation rescaling leads to performance improvement by aggregating gradients from all classifier heads in a balanced way. Additionally, ‘dead’ filter weights in the bottom layer of the network due to vanishing gradients are reduced because of ILR, thereby enlarging the network capacity.

Collaborative learning works by 1) generating a population of classifier heads in the training graphs 2) formulating a learning objective and 3) optimizing learning from a group of classifiers collaboratively. Similar to auxiliary training, a set of new classifier heads are added to the original network graph during training time. However, unlike auxiliary training, each classifier head has an identical graph structure to the original one. This ensures that additional networks need not be designed to auxiliary classifiers. Also, structure symmetry for all heads ensures good balancing of aggregated backpropagation information without the need for additional weights to be associated with individual loss functions. The learning objective is designed such that each classifier head learns from ground-truth labels but also from the whole population through training. The objective function for the multi-class classification represents the measure of distance between the average prediction from the population and the prediction of each classifier head. Learning optimization is done to keep hyperparameters such as the type of SGD, learning rate and regularization same as those used in individual learning so that collaborative learning can just be applied on top of individual learning while avoiding the unnecessary parameter search when applied in practice. This is achieved through backpropagation rescaling.

Finally, when collaborative learning is applied to CIFAR-10, CIFAR-100 and ImageNet - ILSVRC 2012 datasets and compared with individual learning, collaborative learning performs much better reducing the generalization error while also keeping the computational complexity close the that of individual learning.