- Aishwarya Pothula

# Self-Taught Learning: Transfer Learning for Unlabelled Data

The paper aims to show that unlabelled data can be used to perform supervised classification tasks. The unlabelled data is not assumed to follow the class labels or the generative distribution of the labelled data.

Obtaining labelled data for supervised machine learning is a difficult and expensive process. In this context, using unlabelled data for supervised learning holds promise in terms of expanding the applicability of learning methods.

The approach learns compact higher-level features of the unlabelled data which can then be used for the supervised classification task. This approach can be applied to all modalities such as images, audio, text, etc. However, the labelled and unlabeled data used for self-taught learning is expected to be of the same modality.

The approach is summarized as an optimization problem defined as

Optimization variables are a = { a(1), ……, a(k)} and a = { b1,b2 ……, bs} . ‘b’ is a basis vector and a is the activation vector for the basis vector. The number of basis vectors ‘s’ can be more than the number of input data dimension.

Self-taught learning involves first obtaining the basis vectors for the unlabelled data and then computing features for the classification task to obtain a new labelled training set based on the optimization problem definition. Next, a classifier C is learnt by applying a supervised learning algorithm such as an SVM to the labelled training set obtained. The output of this approach is the learned classifier. Now, this classifier can be used for any classifications tasks.

Any algorithm designed for self-taught learning must be able to detect and model high-level features at some abstract level. Many unsupervised learning algorithms such as PCA have been designed to do the same. However, PCA has two main limitations when compared to spare coding as a method for self-taught learning. First, PCA only extracts features that are a linear function of the input. Second, since PCA assumes each of the basis vectors to be orthogonal, PCA features cannot be greater than the input data dimensions. In contrast, sparse encoding is not limited by these restrictions. Sparse coding, by learning a large number of features greater than the input data dimensions but only using a few of them for a particular input, gives a higher-level representation of the input.

When compared to other several self-taught learning tasks, sparse encoding outperforms both PCA and RAW over most of the domains on which they were compared. As more labelled data becomes available, the performance of sparse encoding on labelled data becomes equivalent to sparse coding on unlabeled data.

A fundamental problem encountered in supervised learning is defining anma likeness function between input instances. In this paper, the Fisher kernel is used to measure the similarity between new inputs.