SVM-Tutorial using R (e1071-package)


by Hannes Planatscher and Janko Dietzsch.

R is a nice tool but like any other programming language, R has its pitfalls. This tutorial main goal is playing around with support vector machines. So do not hesitate to ask, whenever you have questions about R, or you get stuck working trough the different tasks.

0) Installation/Import of the package

R provides a easy-to-use package managment. See below.

You are a *nix user  owning root rights, or you are using Windows®:

install.packages('e1071')
library('e1071')

You do not have root rights on your *nix box, but you have rw-rights in a directory (~/...) somewhere on the system:

install.packages('e1071', lib='path')
library('e1071',lib.loc='
path')

1) Task: linear separable data

Prepare the trainingset (data,class)

data1 <- seq(1,10,by=2)

classes1 <- c('a','a','a','b','b')


This testdata will be used for validation of the model.

test1 <- seq(1,10,by=2) + 1

Look in the manual and try to find out more about svm(). Which parameters seem to be the most important to you? Build the model.

model1 <- svm(data1,classes1,type='C',kernel='linear')

View the model.

print(model1)
summary(model1)

Check if the model is able to classify your *training* data.

predict(model1,data1)
pred <- fitted(model1)
table(pred, classes1)

Validate your dataset. Does the prediction meet your expectation?

predict(model1,test1)

Try to find out, where class border was set by svm (hint: use predict() to do that).

2) Task: not-linear-separable data

Prepare the trainingset (data,class).

data2 <- seq(1,10)

classes2 <- c('b','b','b','a','a','a','a','b','b','b')

Build the model with a linear kernel.

model2 <- svm(data2,classes2,type='C',kernel='linear')

Check if the model is able to classify your training data. If not: Why not?

predict(model2,data2)
table(predict(model2,data2), classes2)

Try out the non-linear kernels (hint: ?svm), and check the performance . Does the choice of kernel matter?

3) Task: Example from the manual (iris-data)

Look in the manual and run the example on the classification of the iris data set. It is is possible to get more information about a model and its *attributes*?
Get as many information as you can about the model!

4) Task: Breast-cancer data

In this task you should build a model able to discriminate benign ('gutartig') and  malignant ('boesartig') tumors, using morphological data. Your database consist of 699 correct classified observations.

Download the data (http://www.potschi.de/svmtut/breast-cancer-wisconsin.data), and save it in the directory from where you started the R-term.

Read the data from the CSV-file.

bcdata <- read.csv('breast-cancer-wisconsin.data',head=TRUE)

Take a look at the column titles. Which feature does *not* contain information useful for classification?

names(bcdata)

Separate the data from the classification. The subset()-funtion allows you to (un)select the right columns..

databcall <- subset(bcdata,select=c(-Samplecodenumber,-Class))
classesbcall <- subset(bcdata,select=Class)

Take a subset of your data, that you will use as training data.

databctrain <- databcall[1:400,]
classesbctrain <- classesbcall[1:400,]

Take a subset of your data, that you will use as test data.

databctest <- databcall[401:699,]
classesbctest <- classesbcall[401:699,]

Build the model.

model <- svm(databctrain, classesbctrain)

Validate the model. Does it work?

pred <- predict(model, databctest)
table(pred,t(classesbctest))

Improving the performance of your models

Some kernels have so called hyperparameters. These parameters affect the performance of the kernel, and at last your classification!
The e1071-package provides a tool that does a grid search (Whats that?) and reports an estimation for the parameters. Get familiar with tune(), and use it to improve your model!

tune(svm, train.x=databctrain, train.y=classesbctrain, validation.x=databctest, validation.y=classesbctest, ranges = list(gamma = 2^(-1:1), cost = 2^(2:4)), control = tune.control(sampling = "fix"))

5) Task: Custom classification

Get some interesting data you want to classify. Hint: Search the WWW for "classification benchmarks". Be sure that you can get the data in a format thats easy to read (ideal: CSV with column titles). Check if the datasets are complete, if not: think about how to deal with that. Hint: common symbols for missing measurements are: '?','N','*'.  Please write down from where you got the data.  Build, test and tune a model until the classification performance satisfies you.  Save the R-commands you used in a file classification-datafilename.r (in the correct order ;)).  Report the results.

 If you give me the permission, I'll put your results on this tutorials homepage. Others may learn from your examples.

Appendix A - Data sources

Breast-cancer data:

Sources:

Dr. William H. Wolberg (physician)
University of Wisconsin Hospitals
Madison, Wisconsin
USA

Donor: Olvi Mangasarian (mangasarian@cs.wisc.edu)

Received by David W. Aha (aha@cs.jhu.edu)

Past work:

O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 & 18.
                                                                               
William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.
                                                                               
O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
via linear programming: Theory and application to medical diagnosis",
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
                                                                               
K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).