This tutorial will walk through data visualization, model building (training), feature selection, and finally, testing using a machine learning package called “caret” that serves as a wrapper for pretty much any algorithm that you can think of.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
First let’s read in our data. These are ADOS2 Module 3 questions to predict autism diagnosis.
setwd("/home/vanessa/Documents/Dropbox/Research/Papers/MachineLearning_Weka/ADOS2_m3")
SSC <- read.csv(file="SSCmod3_ADOS2_AutNot_fillMissing.csv", head=TRUE, sep=",")
VIP <- read.csv(file="SVIPmod3_updated.csv", head=TRUE, sep=",")
AC <- read.csv(file="ACmod3_AutNot_edit.csv", head=TRUE, sep=",")
NDAR <- read.csv(file='ndarMod3_AutNot.csv', head=TRUE, sep=",")
setwd("/home/vanessa/Documents/Dropbox/Research/Papers/MachineLearning_Weka/ADOS2_m3")
df = read.csv(file='AGREmod3.csv')
# Here are all the questions
questions = c("A1","A2","A3","A4","A5","A6","A7","A8","A9","B1","B2","B3","B4","B5","B6","B7","B8","B9","B10","C1","D1","D2","D3","D4","D5","E1","E2","E3")
df <- subset(df,select = c(questions,"Class"))
# We need to make labels just autism, non-spectrum
tmp = as.character(df$Class)
tmp[tmp == "Autism Spectrum"] = "Autism"
df$Class = as.factor(tmp)
Lets have a little fun with our data before we try anything. This will give us a sense of the distributions before we do anything, and which questions will be useful
library(AppliedPredictiveModeling)
featurePlot(x=df[,-29],y=df$Class,plot="density",
scales = list(x = list(relation="free"),
y = list(relation="free")),
adjust = 1.5,
pch = "|",
auto.key = list(columns = 2))
featurePlot(x = df[,-29],
y = df$Class,
plot = "box",
scales = list(y = list(relation="free"),
x = list(rot = 90)),
auto.key = list(columns = 2))
Do we have any variables with little to no variance
nzv = nearZeroVar(df)
hist(df[,23],main="This is the variable with piddley variance",xlab="question response")
filtered = df[, -nzv]
Correlated Vars?
corrs = cor(filtered[,-28])
highcorr = findCorrelation(corrs, cutoff = 0.75)
filtered = filtered[-highcorr]
Linear dependencies? (eg, A1 + A2 = A3)
combos = findLinearCombos(filtered[,-27])
Now let’s prepare training data for algorithms, first we should scale to have unit variance, 0 mean
preprocess = preProcess(filtered[,-27], method = c("center", "scale"))
training = predict(preprocess, filtered[,-27])
Take another look. The awesome thing about these plots is that they make the selected questions SO obvious! This tells us that it doesn’t matter so much what method we use for feature selection - the signal is real in the data and that’s why this classifier works so well.
featurePlot(x = training,
y = filtered$Class,
plot = "box",
scales = list(y = list(relation="free"),
x = list(rot = 90)),
auto.key = list(columns = 2))
Take a look at all the models we can choose from!
http://caret.r-forge.r-project.org/modelList.html (150+!) http://caret.r-forge.r-project.org/bytag.html
The great thing about caret is that it’s just a wrapper for most Machine Learning libraries in R. We could even write a function to test all of them - but let’s start simple and just do a few.
We already have our training data defined, “training”
training = cbind(training,Class = df$Class)
Let’s first set up the kind of resampling we want to do. There is bootstrap, LOOCV, CV
Here is a function to calculate sensitivity, specificity, AUC
head(twoClassSummary)
##
## 1 function (data, lev = NULL, model = NULL)
## 2 {
## 3 require(pROC)
## 4 if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))
## 5 stop("levels of observed and predicted data do not match")
## 6 rocObject <- try(pROC::roc(data$obs, data[, lev[1]]), silent = TRUE)
We can give this function to “trainControl” to tell it to give us those accuracy metrics.
Here is 10 fold CV, repeated 10 times. The default are just accuracy and Kappa, but we’ve changed it to output ROC metrics. This function uses trapezoidal rule to calculate accuracy
fitControl <- trainControl(method = "repeatedcv",number = 10,
repeats = 10,
classProbs = TRUE,
summaryFunction = twoClassSummary)
# methods can be "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob".
# number is number of folds in CV
If you want just accuracy, change summaryFunction to defaultSummary, and specify “Accuracy” in metric below. Or just specify nothing, because the default for classification models is to use accuracy and Kappa for validation.
Now let’s train our data - stochastic gradient boosting. Just a heads up - this is going to spit out a warning about the class labels (having a “-” that is converted to “.”). It should be fixed when you do this for realsies, but it doesn’t negatively impact us because we aren’t calculating probabilities.
# install.packages('pROC')
fit.gbm <- train(Class ~ ., data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
metric="ROC")
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
##
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
Caret makes it easy to plot how parameter selection influences accuracy.
trellis.par.set(caretTheme())
ggplot(fit.gbm)
We can also select to plot a particular metric, like sensitivity.
plot(fit.gbm,metric=c("Sens"))
Or how about a density plot?
densityplot(fit.gbm)
By default, the final model gets chosen with top performance, but you can write your own function by specifying the “selectionFunction” in the trainControl function.
Here is how to look at variable importance. The function automatically scales between 0 and 100, set scale = FALSE to not do that
imp <- varImp(fit.gbm, scale = FALSE)
plot(imp)
When we write up a model, we can report AUC for each predictor! and see which ones are important for each class, SO COOL.
rocimp <- filterVarImp(x = training[, -ncol(training)], y = training$Class)
head(rocimp)
## Autism Non.spectrum
## A1 0.3850 0.3850
## A2 0.7903 0.7903
## A3 0.4097 0.4097
## A4 0.7278 0.7278
## A5 0.7446 0.7446
## A6 0.3177 0.3177
Now if we want to get the actual predictions, you can specify type = “prob” or type = “class”
prediction = predict(fit.gbm, newdata = training[,-27])
Now lets try models that we care about, here is SVM
fit.svm <- train(Class ~ ., data = training,
method = "svmRadial",
trControl = fitControl,
tuneLength = 8, # this is how many performance metrics to report
metric = "ROC")
## Loading required package: kernlab
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
# Linear descriminant analysis
fit.lda = train(Class ~ ., data = training,
method = "lda",
metric = "ROC",
trControl = fitControl)
## Loading required package: MASS
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
We of course would do this for our test data, but let’s be lazy and do for training right now
postResample(prediction, training$Class)
## Accuracy Kappa
## 1 1
sensitivity(prediction, training$Class)
## [1] 1
confusionMatrix(prediction, training$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Autism Non-spectrum
## Autism 510 0
## Non-spectrum 0 93
##
## Accuracy : 1
## 95% CI : (0.994, 1)
## No Information Rate : 0.846
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.000
## Specificity : 1.000
## Pos Pred Value : 1.000
## Neg Pred Value : 1.000
## Prevalence : 0.846
## Detection Rate : 0.846
## Detection Prevalence : 0.846
## Balanced Accuracy : 1.000
##
## 'Positive' Class : Autism
##
# Shweet!
Or just get training performance
getTrainPerf(fit.gbm)
## TrainROC TrainSens TrainSpec method
## 1 0.9976 0.9943 0.8846 gbm
We can use the resamps function to compare performance BETWEEN models, and plot
resamps <- resamples(list(GBM = fit.gbm,
SVM = fit.svm,
LDA = fit.lda))
resamps
##
## Call:
## resamples.default(x = list(GBM = fit.gbm, SVM = fit.svm, LDA = fit.lda))
##
## Models: GBM, SVM, LDA
## Number of resamples: 100
## Performance metrics: ROC, Sens, Spec
## Time estimates for: everything, final model fit
summary(resamps)
##
## Call:
## summary.resamples(object = resamps)
##
## Models: GBM, SVM, LDA
## Number of resamples: 100
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.985 0.996 1.000 0.998 1.000 1 0
## SVM 0.991 1.000 1.000 0.999 1.000 1 0
## LDA 0.939 0.984 0.991 0.988 0.997 1 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.961 0.98 1 0.994 1 1 0
## SVM 0.961 1.00 1 0.997 1 1 0
## LDA 0.922 0.98 1 0.990 1 1 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## GBM 0.556 0.800 0.889 0.885 1.000 1 0
## SVM 0.667 0.889 1.000 0.943 1.000 1 0
## LDA 0.400 0.758 0.778 0.790 0.889 1 0
We can also produce some nice summary plots.
bwplot(resamps, layout = c(3, 1))
trellis.par.set(caretTheme())
dotplot(resamps, metric = "ROC")
splom(resamps) #scatterplot matrix
We can also calculate differences in performance
differences = diff(resamps)
differences
##
## Call:
## diff.resamples(x = resamps)
##
## Models: GBM, SVM, LDA
## Metrics: ROC, Sens, Spec
## Number of differences: 3
## p-value adjustment: bonferroni
bwplot(differences, layout = c(3, 1))
Actually, most of these models have build in feature selection, see # http://caret.r-forge.r-project.org/featureselection.html This means that optimal feature selection is matched to models, very nice.
Here is how to (manually) extract other information from the models:
# str(fit.gbm))# What is there?
attr(fit.gbm$terms,"variables") # How do I get an attribute?
## list(Class, A1, A2, A3, A4, A5, A6, A7, A8, A9, B1, B2, B4, B5,
## B6, B7, B8, B9, B10, C1, D1, D2, D4, D5, E1, E2, E3)
Here is how we access the features used by best performing models
predictors(fit.svm)
## [1] "A1" "A2" "A3" "A4" "A5" "A6" "A7" "A8" "A9" "B1" "B2"
## [12] "B4" "B5" "B6" "B7" "B8" "B9" "B10" "C1" "D1" "D2" "D4"
## [23] "D5" "E1" "E2" "E3"
What if we want more fine tuned control? For example, Jack wants to do backward first search with 10 fold CV using SVM. We can use the function rfe, recursive feature selection,
subsets <- c(1:ncol(training)-1)
Change the “functions” variable to tweak the method used for Feature Selection
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
number = 10,
verbose = FALSE)
For example, backward feature selection with 10 fold CV with SVM. If you want to repeat it 5 times, add the “repeats” variable to equal 5.
svmProfile = rfe(training[,-27], y = training$Class,
sizes = c(1:ncol(training)-1),
rfeControl = ctrl,
method = "svmRadial")
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
svmProfile
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 0 0.912 0.671 0.0247 0.0812
## 1 0.912 0.671 0.0247 0.0812
## 2 0.906 0.601 0.0220 0.1169
## 3 0.934 0.731 0.0332 0.1387
## 4 0.935 0.745 0.0325 0.1243
## 5 0.942 0.768 0.0352 0.1382
## 6 0.950 0.794 0.0393 0.1678
## 7 0.957 0.823 0.0275 0.1198
## 8 0.970 0.879 0.0172 0.0751
## 9 0.972 0.889 0.0235 0.0923
## 10 0.970 0.879 0.0188 0.0785
## 11 0.973 0.891 0.0160 0.0691
## 12 0.975 0.898 0.0179 0.0765
## 13 0.970 0.881 0.0172 0.0701
## 14 0.975 0.897 0.0142 0.0645
## 15 0.972 0.886 0.0136 0.0574
## 16 0.970 0.882 0.0218 0.0862
## 17 0.967 0.869 0.0191 0.0751
## 18 0.970 0.879 0.0232 0.0972
## 19 0.972 0.885 0.0175 0.0758
## 20 0.968 0.872 0.0165 0.0691
## 21 0.975 0.899 0.0161 0.0676 *
## 22 0.973 0.894 0.0193 0.0786
## 23 0.973 0.893 0.0139 0.0593
## 24 0.967 0.865 0.0175 0.0756
## 25 0.970 0.879 0.0104 0.0441
## 26 0.970 0.881 0.0152 0.0603
##
## The top 5 variables (out of 21):
## B7, B8, B1, A7, D1
We can get features used in the same way as before
predictors(svmProfile)
## [1] "B7" "B8" "B1" "A7" "D1" "B9" "A8" "A4" "B2" "B10" "B5"
## [12] "D2" "B4" "A9" "D4" "A3" "A5" "A2" "A1" "B6" "E2"
Or more detailed
svmProfile$fit
##
## Call:
## randomForest(x = x, y = y, importance = first, method = "svmRadial")
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 2.65%
## Confusion matrix:
## Autism Non-spectrum class.error
## Autism 506 4 0.007843
## Non-spectrum 12 81 0.129032
Here is the plot to show us the number that we should choose
plot(svmProfile, type = c("g", "o"))
I’d say we should choose something closer to 12, the additional accuracy probably == overfit! Let’s look again at variable importance:
varImp(svmProfile, scale = FALSE)
## Overall
## B7 22.042
## B8 12.917
## B1 12.284
## A7 9.518
## D1 9.235
## B9 9.141
## A8 8.506
## A4 7.547
## B2 6.577
## B10 5.189
## B5 4.335
## D2 3.987
## B4 3.951
## A9 3.435
## D4 3.223
## A1 3.047
## A2 2.975
## B6 2.952
## A3 2.904
## A5 2.854
## C1 2.590
## E2 2.477
## D5 2.468
## E1 1.291
Actually, after looking at this, I would suspect that a model closer to 8-10 would have higher bias / lower variance, and better extend to test data. But for now, let’s Filter down to 12:
features = c(predictors(svmProfile)[1:12],"Class")
filtered = df[,features]
preprocess = preProcess(filtered[,-13],method=c("center","scale"))
And build the final model
fit.svm <- train(Class ~ ., data = filtered,
method = "svmRadial",
trControl = fitControl,
tuneLength = 8,
metric = "ROC")
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
Extend to test data. I first thought that we would need to filter, normalize, but actually we don’t, the functions seem to take care of it.
prediction.SSC = predict(fit.svm, newdata = SSC)
prediction.VIP = predict(fit.svm, newdata = VIP)
prediction.NDAR = predict(fit.svm, newdata = NDAR)
prediction.AC = predict(fit.svm, newdata = AC)
And Performance Metrics
# Fix VIP labels
VIP$Class = as.character(VIP$Class)
VIP$Class[which(VIP$Class == "Nonspectrum")] = "Non-spectrum"
VIP$Class = as.factor(VIP$Class)
confusionMatrix(prediction.VIP, VIP$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Autism Non-spectrum
## Autism 28 2
## Non-spectrum 0 99
##
## Accuracy : 0.984
## 95% CI : (0.945, 0.998)
## No Information Rate : 0.783
## P-Value [Acc > NIR] : 1.31e-11
##
## Kappa : 0.956
## Mcnemar's Test P-Value : 0.48
##
## Sensitivity : 1.000
## Specificity : 0.980
## Pos Pred Value : 0.933
## Neg Pred Value : 1.000
## Prevalence : 0.217
## Detection Rate : 0.217
## Detection Prevalence : 0.233
## Balanced Accuracy : 0.990
##
## 'Positive' Class : Autism
##
confusionMatrix(prediction.NDAR, NDAR$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Autism Non-spectrum
## Autism 127 1
## Non-spectrum 3 26
##
## Accuracy : 0.975
## 95% CI : (0.936, 0.993)
## No Information Rate : 0.828
## P-Value [Acc > NIR] : 7e-09
##
## Kappa : 0.913
## Mcnemar's Test P-Value : 0.617
##
## Sensitivity : 0.977
## Specificity : 0.963
## Pos Pred Value : 0.992
## Neg Pred Value : 0.897
## Prevalence : 0.828
## Detection Rate : 0.809
## Detection Prevalence : 0.815
## Balanced Accuracy : 0.970
##
## 'Positive' Class : Autism
##
confusionMatrix(prediction.AC, AC$Class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Autism Non-spectrum
## Autism 186 6
## Non-spectrum 11 54
##
## Accuracy : 0.934
## 95% CI : (0.896, 0.961)
## No Information Rate : 0.767
## P-Value [Acc > NIR] : 6.96e-13
##
## Kappa : 0.82
## Mcnemar's Test P-Value : 0.332
##
## Sensitivity : 0.944
## Specificity : 0.900
## Pos Pred Value : 0.969
## Neg Pred Value : 0.831
## Prevalence : 0.767
## Detection Rate : 0.724
## Detection Prevalence : 0.747
## Balanced Accuracy : 0.922
##
## 'Positive' Class : Autism
##
It won’t work for SSC because we only have one class, so here is the sensitivity and accuracy:
sensitivity(prediction.SSC, SSC$Class)
## [1] 0.9732
Boum! Machine Learning has never been easier!