• Aishwarya Pothula

SVM-RFE Peak Selection For Cancer Classification With MassSpectrometry Data

The paper applies SVM-RFE to perform peak selection on mass spectrometry data for cancer classification. Mass spectrometry data measures the mass-to-charge (m/z) ratio and the number of ions present at different values of m/z of ion mixtures in the gas phase under a vacuum environment. The peaks selected from the data are used as input variables to the classifier. SVM-RFE peak selection is compared to that of T-statistics. The goodness of the selected peaks is determined by the performance of the classifier with only the selected peak as the input variables. The classifier used in this paper is linear SVM. SVM-RFE is also compared with SVM without peak selection to determine the importance of peak selection.

Protein samples from cancer patients and non-cancer patients are analysed through mass spectrometry instruments and the patterns are used to build a diagnostic classifier. However, the data needs to undergo some amount of pre-processing such as peak selection, normalization etc before it can be used for classification.

When the number of input variables far exceeds the number of input instances, like in the datasets used for this paper, there is a high chance that correlations form between the data and the phenotypes. To prevent these types of correlations, it is important to do the feature selection. In this case, it means to select a good subset of peaks to be used as input to the classifier.

The SVM-RFE is a linear SVM that recursively eliminates features. It has first been proposed to select a subset of genes for cancer classification. This paper explores how SVM-RFE can be used for peak selection in mass spectrometry data for cancer classification. The SVM-RFE works by first selecting all the peaks in spectrometry data. This data is fed into the classifier. A weight vector is obtained from training a classifier with this data. The weight associated with each input is considered it's rank. After the training is done, the feature with the least weight associated is eliminated recursively. Eliminating the lowest-ranked feature (one with the lowest weight) corresponds to removing the input variable least affecting the Objective function defined as

If the number of selected input variables are many then more than one feature can be eliminated at a time. Considering r to be the number of features eliminated at a time and m to be the total number of inputs, this can be done such that r = 1000 if m > 100000, r = 100 if 10000 < m ≤ 100000, r = 10 if 1000 < m ≤ 10000 and r = 1 if m ≤ 1000.

Efficiency of SVM-RFE is compared to that of T-statistics peak selection on Lung Cancer and Ovarian cancer mass spectrometry data. From the results, it was shown that compared to SVM without peak selection and T-statistics peak selection, SVM-RFE performed much better. The results also show that peak selection not only improves the accuracy of classification algorithms but also prediction accuracy.

3 views0 comments

Recent Posts

See All

A few weeks ago, I have started to write my first paper. In this blog, I plan to periodically share my experiences of academic writing. Even for someone accustomed to writing of some form every day, I