- Aishwarya Pothula

# Group Sparse Coding

Text, Image and Video processing often use bag-of-words representations. A bag-of-words representation encodes a document using a vector of the number of occurrences of each descriptor or word in the document. Building image dictionaries is not as easy as building dictionaries for text data. For visual data, a simple mapping from the document to the descriptor counts does not exist. Visual descriptors such as color, texture, angles, shapes etc need to be mined for visual data. The visual descriptors also need to measure at appropriate locations such as grids, special interest points, multiple scales etc.

Traditionally, Unsupervised Vector Quantization techniques such as k-means clustering were used to build dictionaries for visual data. VQ techniques work by representing images by a vector indexed by codewords(dictionary elements). The codewords for an element d counts the number of descriptors in the image that are closest to the codeword d. Since VQ techniques use codewords, they are maximally sparse per descriptor occurrence. However, there is no guarantee that they are sparse for the whole image. With respect to descriptor variability, such representations are not robust. Another alternative used to build dictionaries for visual data is L1 regularized optimization in which each visual descriptor represents a weighted sum of dictionary elements. This method too does not guarantee the sparse coding of the whole image.

As a solution to achieve sparsity at the level of descriptors and for the whole image, the paper proposes mixed-norm regularizers. These regularizers can be applied for image coding and dictionary learning. The image code is expressed as a convex optimization problem. Dictionary regularization directly results in a small dictionary size. Mixed norm regularizers take into account the structure of bags of visual descriptors in images, presenting setos images from a given category.

The main goal is to encode groups of image instances in terms of dictionary codewords. The first subgoal given a set of dictionary codewords D and group G is to minimize the tradeoff between compact encoding and accurate reconstruction error. This tradeoff is controlled by a parameter λ. The second subgoal is to estimate a good dictionary D given a set of training groups. To jointly encode G and D, the following convex optimization problem is solved

Where A contains a set of non-negative vectors representing the contribution of dj per instance

This tradeoff is achieved by switching between i) finding reconstruction weights that minimize the sum of encoding objectives for all groups ii) find a dictionary that minimizes the tradeoff between the sum of group encoding objectives and the mixed norm of the dictionary.

After applying the proposed model to PASCAl VOC visual dataset, the experiment resulted in mixed-norm regularizers performing better overall especially when the problem includes resource constraints such as time and space. Mixed norm regularizers may be especially helpful when a tradeoff between pure performance and resources is needed such as in many real-world applications.