This tutorial will walk through data visualization, model building (training), feature selection, and finally, testing using a machine learning package called “caret” that serves as a wrapper for pretty much any algorithm that you can think of.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2

First let’s read in our data. These are ADOS2 Module 3 questions to predict autism diagnosis.

Testing Data

setwd("/home/vanessa/Documents/Dropbox/Research/Papers/MachineLearning_Weka/ADOS2_m3")
SSC <- read.csv(file="SSCmod3_ADOS2_AutNot_fillMissing.csv", head=TRUE, sep=",")
VIP <- read.csv(file="SVIPmod3_updated.csv", head=TRUE, sep=",")
AC <- read.csv(file="ACmod3_AutNot_edit.csv", head=TRUE, sep=",")
NDAR <- read.csv(file='ndarMod3_AutNot.csv', head=TRUE, sep=",")

Training data

setwd("/home/vanessa/Documents/Dropbox/Research/Papers/MachineLearning_Weka/ADOS2_m3")
df = read.csv(file='AGREmod3.csv')

# Here are all the questions
questions = c("A1","A2","A3","A4","A5","A6","A7","A8","A9","B1","B2","B3","B4","B5","B6","B7","B8","B9","B10","C1","D1","D2","D3","D4","D5","E1","E2","E3")
df <- subset(df,select = c(questions,"Class"))

# We need to make labels just autism, non-spectrum
tmp = as.character(df$Class)
tmp[tmp == "Autism Spectrum"] = "Autism"
df$Class = as.factor(tmp)

VISUALIZATION

Lets have a little fun with our data before we try anything. This will give us a sense of the distributions before we do anything, and which questions will be useful

library(AppliedPredictiveModeling)
featurePlot(x=df[,-29],y=df$Class,plot="density",
            scales = list(x = list(relation="free"),
                          y = list(relation="free")),
            adjust = 1.5,
            pch = "|",
            auto.key = list(columns = 2))

plot of chunk unnamed-chunk-4

featurePlot(x = df[,-29],
            y = df$Class,
            plot = "box",
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),
            auto.key = list(columns = 2))            

plot of chunk unnamed-chunk-4

PREPROCESSING

Do we have any variables with little to no variance

nzv = nearZeroVar(df)
hist(df[,23],main="This is the variable with piddley variance",xlab="question response")

plot of chunk unnamed-chunk-5

filtered = df[, -nzv]

Correlated Vars?

corrs = cor(filtered[,-28])
highcorr = findCorrelation(corrs, cutoff = 0.75)
filtered = filtered[-highcorr]

Linear dependencies? (eg, A1 + A2 = A3)

combos = findLinearCombos(filtered[,-27])

Now let’s prepare training data for algorithms, first we should scale to have unit variance, 0 mean

preprocess = preProcess(filtered[,-27], method = c("center", "scale"))
training = predict(preprocess, filtered[,-27])

Take another look. The awesome thing about these plots is that they make the selected questions SO obvious! This tells us that it doesn’t matter so much what method we use for feature selection - the signal is real in the data and that’s why this classifier works so well.

featurePlot(x = training,
            y = filtered$Class,
            plot = "box",
            scales = list(y = list(relation="free"),
                          x = list(rot = 90)),
            auto.key = list(columns = 2))            

plot of chunk unnamed-chunk-9

TRAINING

Take a look at all the models we can choose from!

http://caret.r-forge.r-project.org/modelList.html (150+!) http://caret.r-forge.r-project.org/bytag.html

The great thing about caret is that it’s just a wrapper for most Machine Learning libraries in R. We could even write a function to test all of them - but let’s start simple and just do a few.

We already have our training data defined, “training”

training = cbind(training,Class = df$Class)

Let’s first set up the kind of resampling we want to do. There is bootstrap, LOOCV, CV

Here is a function to calculate sensitivity, specificity, AUC

head(twoClassSummary)
##                                                                           
## 1 function (data, lev = NULL, model = NULL)                               
## 2 {                                                                       
## 3     require(pROC)                                                       
## 4     if (!all(levels(data[, "pred"]) == levels(data[, "obs"])))          
## 5         stop("levels of observed and predicted data do not match")      
## 6     rocObject <- try(pROC::roc(data$obs, data[, lev[1]]), silent = TRUE)

We can give this function to “trainControl” to tell it to give us those accuracy metrics.

Here is 10 fold CV, repeated 10 times. The default are just accuracy and Kappa, but we’ve changed it to output ROC metrics. This function uses trapezoidal rule to calculate accuracy

fitControl <- trainControl(method = "repeatedcv",number = 10, 
                           repeats = 10,
                           classProbs = TRUE,
                           summaryFunction = twoClassSummary)
# methods can be "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob". 
# number is number of folds in CV

If you want just accuracy, change summaryFunction to defaultSummary, and specify “Accuracy” in metric below. Or just specify nothing, because the default for classification models is to use accuracy and Kappa for validation.

Now let’s train our data - stochastic gradient boosting. Just a heads up - this is going to spit out a warning about the class labels (having a “-” that is converted to “.”). It should be fixed when you do this for realsies, but it doesn’t negatively impact us because we aren’t calculating probabilities.

# install.packages('pROC')
fit.gbm <- train(Class ~ ., data = training,
                 method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE,
                 metric="ROC")
## Loading required package: gbm
## Loading required package: survival
## Loading required package: splines
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: parallel
## Loaded gbm 2.1
## Loading required package: plyr
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
## Loading required package: pROC
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

Caret makes it easy to plot how parameter selection influences accuracy.

trellis.par.set(caretTheme())
ggplot(fit.gbm)

plot of chunk unnamed-chunk-14

We can also select to plot a particular metric, like sensitivity.

plot(fit.gbm,metric=c("Sens"))

plot of chunk unnamed-chunk-15

Or how about a density plot?

densityplot(fit.gbm)

plot of chunk unnamed-chunk-16

By default, the final model gets chosen with top performance, but you can write your own function by specifying the “selectionFunction” in the trainControl function.

Here is how to look at variable importance. The function automatically scales between 0 and 100, set scale = FALSE to not do that

imp <- varImp(fit.gbm, scale = FALSE)
plot(imp)

plot of chunk unnamed-chunk-17

When we write up a model, we can report AUC for each predictor! and see which ones are important for each class, SO COOL.

rocimp <- filterVarImp(x = training[, -ncol(training)], y = training$Class)
head(rocimp)
##    Autism Non.spectrum
## A1 0.3850       0.3850
## A2 0.7903       0.7903
## A3 0.4097       0.4097
## A4 0.7278       0.7278
## A5 0.7446       0.7446
## A6 0.3177       0.3177

Now if we want to get the actual predictions, you can specify type = “prob” or type = “class”

prediction = predict(fit.gbm, newdata = training[,-27])

Now lets try models that we care about, here is SVM

fit.svm <- train(Class ~ ., data = training,
                method = "svmRadial",
                trControl = fitControl,
                tuneLength = 8, # this is how many performance metrics to report
                metric = "ROC")
## Loading required package: kernlab
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum
# Linear descriminant analysis
fit.lda =  train(Class ~ ., data = training,
                            method = "lda",
                            metric = "ROC",
                            trControl = fitControl)
## Loading required package: MASS
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum

ASSESSING PERFORMANCE

We of course would do this for our test data, but let’s be lazy and do for training right now

postResample(prediction, training$Class)
## Accuracy    Kappa 
##        1        1
sensitivity(prediction, training$Class)
## [1] 1
confusionMatrix(prediction, training$Class)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autism Non-spectrum
##   Autism          510            0
##   Non-spectrum      0           93
##                                     
##                Accuracy : 1         
##                  95% CI : (0.994, 1)
##     No Information Rate : 0.846     
##     P-Value [Acc > NIR] : <2e-16    
##                                     
##                   Kappa : 1         
##  Mcnemar's Test P-Value : NA        
##                                     
##             Sensitivity : 1.000     
##             Specificity : 1.000     
##          Pos Pred Value : 1.000     
##          Neg Pred Value : 1.000     
##              Prevalence : 0.846     
##          Detection Rate : 0.846     
##    Detection Prevalence : 0.846     
##       Balanced Accuracy : 1.000     
##                                     
##        'Positive' Class : Autism    
## 
# Shweet!

Or just get training performance

getTrainPerf(fit.gbm)
##   TrainROC TrainSens TrainSpec method
## 1   0.9976    0.9943    0.8846    gbm

We can use the resamps function to compare performance BETWEEN models, and plot

resamps <- resamples(list(GBM = fit.gbm,
                          SVM = fit.svm,
                          LDA = fit.lda))
resamps
## 
## Call:
## resamples.default(x = list(GBM = fit.gbm, SVM = fit.svm, LDA = fit.lda))
## 
## Models: GBM, SVM, LDA 
## Number of resamples: 100 
## Performance metrics: ROC, Sens, Spec 
## Time estimates for: everything, final model fit
summary(resamps)
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: GBM, SVM, LDA 
## Number of resamples: 100 
## 
## ROC 
##      Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## GBM 0.985   0.996  1.000 0.998   1.000    1    0
## SVM 0.991   1.000  1.000 0.999   1.000    1    0
## LDA 0.939   0.984  0.991 0.988   0.997    1    0
## 
## Sens 
##      Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## GBM 0.961    0.98      1 0.994       1    1    0
## SVM 0.961    1.00      1 0.997       1    1    0
## LDA 0.922    0.98      1 0.990       1    1    0
## 
## Spec 
##      Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
## GBM 0.556   0.800  0.889 0.885   1.000    1    0
## SVM 0.667   0.889  1.000 0.943   1.000    1    0
## LDA 0.400   0.758  0.778 0.790   0.889    1    0

We can also produce some nice summary plots.

bwplot(resamps, layout = c(3, 1))

plot of chunk unnamed-chunk-24

trellis.par.set(caretTheme())
dotplot(resamps, metric = "ROC")

plot of chunk unnamed-chunk-24

splom(resamps) #scatterplot matrix

plot of chunk unnamed-chunk-24

We can also calculate differences in performance

differences = diff(resamps)
differences
## 
## Call:
## diff.resamples(x = resamps)
## 
## Models: GBM, SVM, LDA 
## Metrics: ROC, Sens, Spec 
## Number of differences: 3 
## p-value adjustment: bonferroni
bwplot(differences, layout = c(3, 1))

plot of chunk unnamed-chunk-25

FEATURE SELECTION

Actually, most of these models have build in feature selection, see # http://caret.r-forge.r-project.org/featureselection.html This means that optimal feature selection is matched to models, very nice.

Here is how to (manually) extract other information from the models:

# str(fit.gbm))# What is there?
attr(fit.gbm$terms,"variables") # How do I get an attribute?
## list(Class, A1, A2, A3, A4, A5, A6, A7, A8, A9, B1, B2, B4, B5, 
##     B6, B7, B8, B9, B10, C1, D1, D2, D4, D5, E1, E2, E3)

Here is how we access the features used by best performing models

predictors(fit.svm)
##  [1] "A1"  "A2"  "A3"  "A4"  "A5"  "A6"  "A7"  "A8"  "A9"  "B1"  "B2" 
## [12] "B4"  "B5"  "B6"  "B7"  "B8"  "B9"  "B10" "C1"  "D1"  "D2"  "D4" 
## [23] "D5"  "E1"  "E2"  "E3"

What if we want more fine tuned control? For example, Jack wants to do backward first search with 10 fold CV using SVM. We can use the function rfe, recursive feature selection,

subsets <- c(1:ncol(training)-1)

Change the “functions” variable to tweak the method used for Feature Selection

ctrl <- rfeControl(functions = rfFuncs,
                   method = "repeatedcv",
                   number = 10,
                   verbose = FALSE)

For example, backward feature selection with 10 fold CV with SVM. If you want to repeat it 5 times, add the “repeats” variable to equal 5.

svmProfile = rfe(training[,-27], y = training$Class, 
                 sizes = c(1:ncol(training)-1), 
                 rfeControl = ctrl,
                 method = "svmRadial")
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
svmProfile
## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold, repeated 1 times) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy Kappa AccuracySD KappaSD Selected
##          0    0.912 0.671     0.0247  0.0812         
##          1    0.912 0.671     0.0247  0.0812         
##          2    0.906 0.601     0.0220  0.1169         
##          3    0.934 0.731     0.0332  0.1387         
##          4    0.935 0.745     0.0325  0.1243         
##          5    0.942 0.768     0.0352  0.1382         
##          6    0.950 0.794     0.0393  0.1678         
##          7    0.957 0.823     0.0275  0.1198         
##          8    0.970 0.879     0.0172  0.0751         
##          9    0.972 0.889     0.0235  0.0923         
##         10    0.970 0.879     0.0188  0.0785         
##         11    0.973 0.891     0.0160  0.0691         
##         12    0.975 0.898     0.0179  0.0765         
##         13    0.970 0.881     0.0172  0.0701         
##         14    0.975 0.897     0.0142  0.0645         
##         15    0.972 0.886     0.0136  0.0574         
##         16    0.970 0.882     0.0218  0.0862         
##         17    0.967 0.869     0.0191  0.0751         
##         18    0.970 0.879     0.0232  0.0972         
##         19    0.972 0.885     0.0175  0.0758         
##         20    0.968 0.872     0.0165  0.0691         
##         21    0.975 0.899     0.0161  0.0676        *
##         22    0.973 0.894     0.0193  0.0786         
##         23    0.973 0.893     0.0139  0.0593         
##         24    0.967 0.865     0.0175  0.0756         
##         25    0.970 0.879     0.0104  0.0441         
##         26    0.970 0.881     0.0152  0.0603         
## 
## The top 5 variables (out of 21):
##    B7, B8, B1, A7, D1

We can get features used in the same way as before

predictors(svmProfile)
##  [1] "B7"  "B8"  "B1"  "A7"  "D1"  "B9"  "A8"  "A4"  "B2"  "B10" "B5" 
## [12] "D2"  "B4"  "A9"  "D4"  "A3"  "A5"  "A2"  "A1"  "B6"  "E2"

Or more detailed

svmProfile$fit
## 
## Call:
##  randomForest(x = x, y = y, importance = first, method = "svmRadial") 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 2.65%
## Confusion matrix:
##              Autism Non-spectrum class.error
## Autism          506            4    0.007843
## Non-spectrum     12           81    0.129032

Here is the plot to show us the number that we should choose

plot(svmProfile, type = c("g", "o"))

plot of chunk unnamed-chunk-33

I’d say we should choose something closer to 12, the additional accuracy probably == overfit! Let’s look again at variable importance:

varImp(svmProfile, scale = FALSE)
##     Overall
## B7   22.042
## B8   12.917
## B1   12.284
## A7    9.518
## D1    9.235
## B9    9.141
## A8    8.506
## A4    7.547
## B2    6.577
## B10   5.189
## B5    4.335
## D2    3.987
## B4    3.951
## A9    3.435
## D4    3.223
## A1    3.047
## A2    2.975
## B6    2.952
## A3    2.904
## A5    2.854
## C1    2.590
## E2    2.477
## D5    2.468
## E1    1.291

Actually, after looking at this, I would suspect that a model closer to 8-10 would have higher bias / lower variance, and better extend to test data. But for now, let’s Filter down to 12:

features = c(predictors(svmProfile)[1:12],"Class")
filtered = df[,features]
preprocess = preProcess(filtered[,-13],method=c("center","scale"))

And build the final model

fit.svm <- train(Class ~ ., data = filtered,
                 method = "svmRadial",
                 trControl = fitControl,
                 tuneLength = 8,
                 metric = "ROC")
## Warning: At least one of the class levels are not valid R variables names;
## This may cause errors if class probabilities are generated because the
## variables names will be converted to: Autism, Non.spectrum

Extend to test data. I first thought that we would need to filter, normalize, but actually we don’t, the functions seem to take care of it.

prediction.SSC = predict(fit.svm, newdata = SSC)
prediction.VIP = predict(fit.svm, newdata = VIP)
prediction.NDAR = predict(fit.svm, newdata = NDAR)
prediction.AC = predict(fit.svm, newdata = AC)

And Performance Metrics

# Fix VIP labels
VIP$Class = as.character(VIP$Class)
VIP$Class[which(VIP$Class == "Nonspectrum")] = "Non-spectrum"
VIP$Class = as.factor(VIP$Class)

confusionMatrix(prediction.VIP, VIP$Class)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autism Non-spectrum
##   Autism           28            2
##   Non-spectrum      0           99
##                                         
##                Accuracy : 0.984         
##                  95% CI : (0.945, 0.998)
##     No Information Rate : 0.783         
##     P-Value [Acc > NIR] : 1.31e-11      
##                                         
##                   Kappa : 0.956         
##  Mcnemar's Test P-Value : 0.48          
##                                         
##             Sensitivity : 1.000         
##             Specificity : 0.980         
##          Pos Pred Value : 0.933         
##          Neg Pred Value : 1.000         
##              Prevalence : 0.217         
##          Detection Rate : 0.217         
##    Detection Prevalence : 0.233         
##       Balanced Accuracy : 0.990         
##                                         
##        'Positive' Class : Autism        
## 
confusionMatrix(prediction.NDAR, NDAR$Class)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autism Non-spectrum
##   Autism          127            1
##   Non-spectrum      3           26
##                                         
##                Accuracy : 0.975         
##                  95% CI : (0.936, 0.993)
##     No Information Rate : 0.828         
##     P-Value [Acc > NIR] : 7e-09         
##                                         
##                   Kappa : 0.913         
##  Mcnemar's Test P-Value : 0.617         
##                                         
##             Sensitivity : 0.977         
##             Specificity : 0.963         
##          Pos Pred Value : 0.992         
##          Neg Pred Value : 0.897         
##              Prevalence : 0.828         
##          Detection Rate : 0.809         
##    Detection Prevalence : 0.815         
##       Balanced Accuracy : 0.970         
##                                         
##        'Positive' Class : Autism        
## 
confusionMatrix(prediction.AC, AC$Class)
## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autism Non-spectrum
##   Autism          186            6
##   Non-spectrum     11           54
##                                         
##                Accuracy : 0.934         
##                  95% CI : (0.896, 0.961)
##     No Information Rate : 0.767         
##     P-Value [Acc > NIR] : 6.96e-13      
##                                         
##                   Kappa : 0.82          
##  Mcnemar's Test P-Value : 0.332         
##                                         
##             Sensitivity : 0.944         
##             Specificity : 0.900         
##          Pos Pred Value : 0.969         
##          Neg Pred Value : 0.831         
##              Prevalence : 0.767         
##          Detection Rate : 0.724         
##    Detection Prevalence : 0.747         
##       Balanced Accuracy : 0.922         
##                                         
##        'Positive' Class : Autism        
## 

It won’t work for SSC because we only have one class, so here is the sensitivity and accuracy:

sensitivity(prediction.SSC, SSC$Class)
## [1] 0.9732

Boum! Machine Learning has never been easier!