SVM-Tutorial using R (e1071-package)
by Hannes Planatscher and Janko Dietzsch.
R is a nice tool but like any other programming language, R has its pitfalls.
This tutorial main goal is playing around with support vector machines. So
do not hesitate to ask, whenever you have questions about R, or you get stuck
working trough the different tasks.
0) Installation/Import of the package
R provides a easy-to-use package managment. See below.
You are a *nix user owning root rights, or you are using Windows®:
You do not have root rights on your *nix box, but you have rw-rights in a
directory (~/...) somewhere on the system:
1) Task: linear separable data
Prepare the trainingset (data,class)
data1 <- seq(1,10,by=2)
classes1 <- c('a','a','a','b','b')
This testdata will be used for validation of the model.
test1 <- seq(1,10,by=2) + 1
Look in the manual and try to find out more about svm(). Which parameters
seem to be the most important to you? Build the model.
model1 <- svm(data1,classes1,type='C',kernel='linear')
View the model.
Check if the model is able to classify your *training* data.
pred <- fitted(model1)
Validate your dataset. Does the prediction meet your expectation?
Try to find out, where class border was set by svm (hint: use predict() to
2) Task: not-linear-separable data
Prepare the trainingset (data,class).
data2 <- seq(1,10)
classes2 <- c('b','b','b','a','a','a','a','b','b','b')
Build the model with a linear kernel.
model2 <- svm(data2,classes2,type='C',kernel='linear')
Check if the model is able to classify your training data. If not: Why not?
Try out the non-linear kernels (hint: ?svm), and check the performance .
Does the choice of kernel matter?
3) Task: Example from the manual (iris-data)
Look in the manual and run the example on the classification of the iris
data set. It is is possible to get more information about a model and its
Get as many information as you can about the model!
4) Task: Breast-cancer data
In this task you should build a model able to discriminate benign ('gutartig')
and malignant ('boesartig') tumors, using morphological data. Your
database consist of 699 correct classified observations.
Download the data (http://www.potschi.de/svmtut/breast-cancer-wisconsin.data),
and save it in the directory from where you started the R-term.
Read the data from the CSV-file.
bcdata <- read.csv('breast-cancer-wisconsin.data',head=TRUE)
Take a look at the column titles. Which feature does *not* contain information
useful for classification?
Separate the data from the classification. The subset()-funtion allows you
to (un)select the right columns..
databcall <- subset(bcdata,select=c(-Samplecodenumber,-Class))
classesbcall <- subset(bcdata,select=Class)
Take a subset of your data, that you will use as training data.
databctrain <- databcall[1:400,]
classesbctrain <- classesbcall[1:400,]
Take a subset of your data, that you will use as test data.
databctest <- databcall[401:699,]Build the model.
classesbctest <- classesbcall[401:699,]
model <- svm(databctrain, classesbctrain)
Validate the model. Does it work?
pred <- predict(model, databctest)
Improving the performance of your models
Some kernels have so called hyperparameters. These parameters affect the
performance of the kernel, and at last your classification!
The e1071-package provides a tool that does a grid search (Whats that?)
and reports an estimation for the parameters. Get familiar with tune(), and
use it to improve your model!
tune(svm, train.x=databctrain, train.y=classesbctrain, validation.x=databctest,
validation.y=classesbctest, ranges = list(gamma = 2^(-1:1), cost = 2^(2:4)),
control = tune.control(sampling = "fix"))
5) Task: Custom classification
Get some interesting data you want to classify. Hint: Search the WWW for
"classification benchmarks". Be sure that you can get the data in a format
thats easy to read (ideal: CSV with column titles). Check if the datasets
are complete, if not: think about how to deal with that. Hint: common symbols
for missing measurements are: '?','N','*'. Please write down from where
you got the data. Build, test and tune a model until the classification
performance satisfies you. Save the R-commands you used in a file classification-datafilename.r (in the correct order
;)). Report the results.
If you give me the permission, I'll
put your results on this tutorials homepage. Others may learn from your examples.
Appendix A - Data sources
Dr. William H. Wolberg (physician)
University of Wisconsin Hospitals
Donor: Olvi Mangasarian (email@example.com)
Received by David W. Aha (firstname.lastname@example.org)
O. L. Mangasarian and W. H. Wolberg: "Cancer diagnosis via linear
programming", SIAM News, Volume 23, Number 5, September 1990, pp 1 &
William H. Wolberg and O.L. Mangasarian: "Multisurface method of
pattern separation for medical diagnosis applied to breast cytology",
Proceedings of the National Academy of Sciences, U.S.A., Volume 87,
December 1990, pp 9193-9196.
O. L. Mangasarian, R. Setiono, and W.H. Wolberg: "Pattern recognition
via linear programming: Theory and application to medical diagnosis",
in: "Large-scale numerical optimization", Thomas F. Coleman and Yuying
Li, editors, SIAM Publications, Philadelphia 1990, pp 22-30.
K. P. Bennett & O. L. Mangasarian: "Robust linear programming
discrimination of two linearly inseparable sets", Optimization Methods
and Software 1, 1992, 23-34 (Gordon & Breach Science Publishers).