• Aishwarya Pothula

Learning Social Networks from Web Documents Using Support Vectors Classifiers

Extracting pairwise relations between individuals is essential to automatically generate a social network. The goal of this paper is to learn a social network from partial relationship data. With the assumption that only a small subset of the relationship data is known, social network data is translated into a text classification problem. Here, pairwise relationships are modelled by merging individual document vectors and the given relations are used as class labels in the training data. Next, a text classifier such as SVM is applied to learn and predict the unknown relations.

Social networks are represented either by graphs or adjacency matrices also known as socio-matrix. There are three steps involved in generating the social network. First, to model the actors in the social network. Second, to model the relations between the actors. Third, to train a classifier to learn the social network. The actor is represented by web documents including home page, blog, CV etc. All documents associated with an individual are merged together to build a unique document vector. Each actor is associated with a single document vector. The actor-document corpus is represented by a matrix called the Actor-Term matrix. Dimension reduction is performed using techniques such as stemming, stop word removal, removal of words with document frequency less than a certain threshold etc.

S = 1 - 2r/n(n-1) represents a metric of social networks called sparsity, where ‘r’ represents the number of relations and ‘n’ represents the number of actors. Isolated actors and subsets are created if the sparsity is very high. In contrast, if the sparsity is very low, there exists a relation amongst almost all actors and there is no social network benefit. There is a link between the inherent sparsity of social networks and the class distribution imbalance of the training data. This imbalance in the data is dealt with by performing up-sampling of minority classes and down-sampling of majority classes.

Learning a social network is classified as a binary class problem with positive and negative being the two classes. Positive class indicates ties and negative class indicates no ties or broken ties. An SVM with a linear kernel is used for this classification in the paper.

Evaluation measures for the performance of the model are precision, recall and F-measure. Precision measures the portion of positive identifications that are actually positive. Recall measures the proportion of actual positives identified correctly as positives. The F-measure is a weighted harmonic mean of the precision and recall measures which measures the accuracy of the model.

Experiments are conducted on a real Friend-of-a-Friend database containing 210,611 RDF triples. Two types of information are queried from the dataset; web resource addresses and URLs related to the individuals and relations between individuals. Relations are the true social network or the ground truth. They are used as training data and for evaluation. The web resources information is used to construct actor-term matrices and model actor-actor relationships based on merging document vectors. The missing relations are predicted using the proposed model using SVM. The resulting network is considered as a predictive social network.

0 views0 comments

Recent Posts

See All

A few weeks ago, I have started to write my first paper. In this blog, I plan to periodically share my experiences of academic writing. Even for someone accustomed to writing of some form every day, I