Background

From paper Using Resistin, glucose, age and BMI to predict the presence of breast cancer, https://bmccancer.biomedcentral.com/track/pdf/10.1186/s12885-017-3877-1

The goal of this exploratory study was to develop and assess a prediction model which can potentially be used as a biomarker for breast cancer, based on anthropometric data and parameters which can be gathered in routine blood analysis.

For each of the 166 participants several clinical features were observed or measured, including * Age * BMI (Body Mass Index) * Glucose * Insulin * HOMA (The homeostasis model assessment (HOMA), based on plasma levels of fasting glucose and insulin, has been widely validated and applied for quantifying insulin resistance and β-cell function) * Leptin (A hormone predominantly made by adipose cells that helps to regulate energy balance by inhibiting hunger) * Adiponectin (A protein hormone which is involved in regulating glucose levels as well as fatty acid breakdown) * Resistin (An adipocyte-secreted hormone (adipokine) linked to obesity and insulin resistance in rodents) * MCP-1 (Monocyte chemoattractant protein-1is a potent chemoattractant for monocytes and macrophages to areas of inflammation)

Data Overview

Skim summary statistics
n obs: 116
n variables: 11

Variable type: character

variable missing complete n min max empty n_unique
Classification 0 116 116 1 1 0 2

Variable type: factor

variable missing complete n n_unique top_counts ordered
Disease_Status 0 116 116 2 dis: 64, hea: 52, NA: 0 FALSE

Variable type: numeric

variable missing complete n mean sd p0 p25 p50 p75 p100 hist
Adiponectin 0 116 116 10.18 6.84 1.66 5.47 8.35 11.82 38.04 ▆▇▂▁▂▁▁▁
Age 0 116 116 57.3 16.11 24 45 56 71 89 ▂▃▇▅▃▇▅▃
BMI 0 116 116 27.58 5.02 18.37 22.97 27.66 31.24 38.58 ▃▇▃▆▆▅▃▂
Glucose 0 116 116 97.79 22.53 60 85.75 92 102 201 ▂▇▅▁▁▁▁▁
HOMA 0 116 116 2.69 3.64 0.47 0.92 1.38 2.86 25.05 ▇▁▁▁▁▁▁▁
Insulin 0 116 116 10.01 10.07 2.43 4.36 5.92 11.19 58.46 ▇▂▁▁▁▁▁▁
Leptin 0 116 116 26.62 19.18 4.31 12.31 20.27 37.38 90.28 ▇▆▃▃▂▁▁▁
MCP.1 0 116 116 534.65 345.91 45.84 269.98 471.32 700.09 1698.44 ▆▇▆▃▂▁▁▁
Resistin 0 116 116 14.73 12.39 3.21 6.88 10.83 17.76 82.1 ▇▃▂▁▁▁▁▁

Data Exploration

  • Any correlated variables/features?

Summary by Disease

disease (N=64) healthy (N=52) Total (N=116) p value
Age 0.6421
   Mean (SD) 56.672 (13.493) 58.077 (18.958) 57.302 (16.113)
   Range 34.000 - 86.000 24.000 - 89.000 24.000 - 89.000
BMI 0.1561
   Mean (SD) 26.985 (4.620) 28.317 (5.427) 27.582 (5.020)
   Range 18.370 - 37.109 18.670 - 38.579 18.370 - 38.579
Glucose < 0.0011
   Mean (SD) 105.562 (26.557) 88.231 (10.192) 97.793 (22.525)
   Range 70.000 - 201.000 60.000 - 118.000 60.000 - 201.000
Insulin 0.0031
   Mean (SD) 12.513 (12.318) 6.934 (4.860) 10.012 (10.068)
   Range 2.432 - 58.460 2.707 - 26.211 2.432 - 58.460
HOMA 0.0021
   Mean (SD) 3.623 (4.589) 1.552 (1.218) 2.695 (3.642)
   Range 0.508 - 25.050 0.467 - 7.112 0.467 - 25.050
Leptin 0.9911
   Mean (SD) 26.597 (19.212) 26.638 (19.335) 26.615 (19.183)
   Range 6.334 - 90.280 4.311 - 83.482 4.311 - 90.280
Adiponectin 0.8351
   Mean (SD) 10.061 (6.189) 10.328 (7.631) 10.181 (6.843)
   Range 1.656 - 33.750 2.194 - 38.040 1.656 - 38.040
Resistin 0.0141
   Mean (SD) 17.254 (12.637) 11.615 (11.447) 14.726 (12.391)
   Range 3.210 - 55.215 3.292 - 82.100 3.210 - 82.100
MCP.1 0.3291
   Mean (SD) 563.016 (384.002) 499.731 (292.242) 534.647 (345.913)
   Range 90.090 - 1698.440 45.843 - 1256.083 45.843 - 1698.440
  1. Linear Model ANOVA

Data Analysis

Clustering

method coefficient rank
ward 0.9108062 1
complete 0.8633043 2
average 0.8113225 3
single 0.7419399 4

  • Explore data clusters
  • Various techniques like silhouette
  • Silhouette analysis can be used to study the separation distance between the resulting clusters
  • The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually

Optimal clusters

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 8 proposed 2 as the best number of clusters 
## * 1 proposed 3 as the best number of clusters 
## * 6 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 6 as the best number of clusters 
## * 2 proposed 8 as the best number of clusters 
## * 4 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************
## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 8 proposed  2 as the best number of clusters
## * 1 proposed  3 as the best number of clusters
## * 6 proposed  4 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 2 proposed  6 as the best number of clusters
## * 2 proposed  8 as the best number of clusters
## * 4 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

cluster_group Freq
1 63
2 53

PCA

  • PCA plot
  • Identify contributing variables

Machine Learning

  • Explore various machine learning algorithms
  • GLM
  • RandomForest
  • Split data into test and train
Disease_Status n
disease 19
healthy 15
Disease_Status n
disease 45
healthy 37

Explore various ML algorithms

# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- rep(1234, ncol(train_data)-1)
# for the last model
seeds[[11]] <- rep(1234, 1)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10,
                     repeats = 3,
                     index = createResample(train_data$Disease_Status, 10),
                     classProbs = TRUE,
                     seeds = seeds,
                     summaryFunction = twoClassSummary,
                     savePredictions = 'final',
                     allowParallel = TRUE)

algorithms <- c('adaboost','glmnet','lda','knn','nb','parRF','rpart','svmRadialWeights')

models <- caretList(Disease_Status ~ .,
                    data = train_data,
                    metric = metric,
                    trControl = ctrl,
                    preProcess = c("center", "scale"),
                    methodList = algorithms)
results <- resamples(models)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights 
## Number of resamples: 10 
## 
## ROC 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.6736842 0.7396825 0.7719322 0.7773017 0.8267740
## glmnet           0.5178571 0.7383333 0.7655075 0.7642146 0.8430241
## lda              0.5937500 0.6997863 0.7440351 0.7408308 0.7768398
## knn              0.6706349 0.6865132 0.7516869 0.7648407 0.8273507
## nb               0.7556561 0.7744949 0.8084821 0.8073260 0.8344643
## parRF            0.7302632 0.7667411 0.7986111 0.8112877 0.8677885
## rpart            0.5000000 0.6683532 0.7094406 0.7012338 0.7580196
## svmRadialWeights 0.7368421 0.7556548 0.7871120 0.8062080 0.8473011
##                       Max. NA's
## adaboost         0.9000000    0
## glmnet           0.8787879    0
## lda              0.8687783    0
## knn              0.9272727    0
## nb               0.8684211    0
## parRF            0.8891403    0
## rpart            0.8433333    0
## svmRadialWeights 0.9200000    0
## 
## Sens 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.5000000 0.6439394 0.7032967 0.7113942 0.7611336
## glmnet           0.5625000 0.6531955 0.6923077 0.6973651 0.7359649
## lda              0.5000000 0.6078947 0.6675824 0.6887805 0.7944444
## knn              0.4210526 0.4867788 0.6523810 0.6605633 0.7923077
## nb               0.5789474 0.6177885 0.6602871 0.6778123 0.7285714
## parRF            0.5625000 0.6379870 0.6754386 0.7054451 0.7692308
## rpart            0.3571429 0.5538278 0.6791667 0.6642982 0.8173077
## svmRadialWeights 0.6111111 0.6343985 0.6923077 0.7092611 0.7318182
##                       Max. NA's
## adaboost         0.9333333    0
## glmnet           0.8181818    0
## lda              0.8461538    0
## knn              1.0000000    0
## nb               0.8461538    0
## parRF            0.9444444    0
## rpart            0.9333333    0
## svmRadialWeights 1.0000000    0
## 
## Spec 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.4285714 0.6833333 0.7823529 0.7217087 0.8250000
## glmnet           0.2857143 0.6375000 0.7750000 0.7297339 0.8176471
## lda              0.4000000 0.5952381 0.7083333 0.7149720 0.8308824
## knn              0.5000000 0.6166667 0.7142857 0.7077591 0.8083333
## nb               0.6666667 0.7589286 0.8000000 0.8016387 0.8308824
## parRF            0.5833333 0.6750000 0.7166667 0.7256863 0.7875000
## rpart            0.4285714 0.5083333 0.6666667 0.6646359 0.7764706
## svmRadialWeights 0.5833333 0.6666667 0.7238095 0.7178291 0.7833333
##                       Max. NA's
## adaboost         0.8571429    0
## glmnet           1.0000000    0
## lda              1.0000000    0
## knn              0.8823529    0
## nb               1.0000000    0
## parRF            0.8571429    0
## rpart            0.9285714    0
## svmRadialWeights 0.8571429    0
dotplot(results)

model_cor <- modelCor(results)
ggcorrplot(model_cor, 
           hc.order = TRUE, 
           type = "lower",
           title = "All by All Correlation of Models",
           outline.col = "white",
           lab = TRUE)

plot(varImp(models$glmnet), main = "GLMnet - Variable Importance Plot")

plot(varImp(models$parRF), main = "Parallel Random Forest - Variable Importance Plot")

Ensemble method

greedy_ensemble <- caretEnsemble(
  models, 
  metric = metric,
  trControl = trainControl(
    number = length(algorithms),
    summaryFunction = twoClassSummary,
    classProbs = TRUE
    ))
summary(greedy_ensemble)
## The following models were ensembled: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights 
## They were weighted: 
## 2.6276 0.7084 -0.4894 -0.4315 -0.9677 -1.472 -2.9882 0.7363 -0.8185
## The resulting ROC is: 0.8338
## The fit for each individual model on the ROC is: 
##            method       ROC      ROCSD
##          adaboost 0.7773017 0.07058755
##            glmnet 0.7642146 0.10548102
##               lda 0.7408308 0.08478397
##               knn 0.7648407 0.08699957
##                nb 0.8073260 0.03991026
##             parRF 0.8112877 0.05722305
##             rpart 0.7012338 0.09631080
##  svmRadialWeights 0.8062080 0.06157617
model_preds <- lapply(models, predict, newdata = test_data, type = "prob")
model_preds <- lapply(model_preds, function(x) x[,"disease"])
model_preds <- data.frame(model_preds)
ens_preds <- predict(greedy_ensemble, newdata = test_data, type = "prob")
model_preds$ensemble <- ens_preds
caTools::colAUC(model_preds, test_data$Disease_Status)
##                      adaboost    glmnet       lda       knn        nb
## disease vs. healthy 0.8175439 0.7403509 0.7719298 0.8017544 0.5789474
##                         parRF    rpart svmRadialWeights  ensemble
## disease vs. healthy 0.7736842 0.622807        0.8035088 0.7649123

Random Forest

  • Search for optimal parameters in RF
## 
## Call:
## summary.resamples(object = results)
## 
## Models: 50, 100, 150, 200, 250 
## Number of resamples: 10 
## 
## ROC 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 50  0.6228070 0.7395833 0.8163012 0.7956873 0.8433333 0.9393939    0
## 100 0.6732456 0.7819314 0.8133242 0.8081363 0.8475792 0.9030303    0
## 150 0.7127193 0.7547249 0.8265110 0.8110996 0.8511905 0.9466667    0
## 200 0.6666667 0.7775493 0.8350275 0.8156990 0.8713774 0.8833333    0
## 250 0.6754386 0.7764333 0.8333333 0.8168656 0.8625090 0.8933333    0
## 
## Sens 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 50  0.5625000 0.6327751 0.6880952 0.6839202 0.7359649 0.7777778    0
## 100 0.5454545 0.6467611 0.7032967 0.7053054 0.7359649 0.9444444    0
## 150 0.6000000 0.6379870 0.6899038 0.7152236 0.7611336 0.8888889    0
## 200 0.5789474 0.6488095 0.6882591 0.7142048 0.7587413 0.8888889    0
## 250 0.5789474 0.6327751 0.6547619 0.7075431 0.8076923 0.8888889    0
## 
## Spec 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## 50  0.4166667 0.7190476 0.7750000 0.7439496 0.8250000 0.8823529    0
## 100 0.4166667 0.6439076 0.7000000 0.7092297 0.8214286 0.9000000    0
## 150 0.5000000 0.6041667 0.7000000 0.6763025 0.7389706 0.8000000    0
## 200 0.6428571 0.6750000 0.7166667 0.7549020 0.8511905 0.8823529    0
## 250 0.6428571 0.6750000 0.7166667 0.7549020 0.8511905 0.8823529    0
## $`50`

## 
## $`100`

## 
## $`150`

## 
## $`200`

## 
## $`250`

## Parallel Random Forest 
## 
## 82 samples
##  9 predictor
##  2 classes: 'disease', 'healthy' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 82, 82, 82, 82, 82, 82, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   1     0.7231732  0.6641894  0.6179692
##   2     0.7787381  0.7150147  0.6818768
##   3     0.8021109  0.7104917  0.6779692
##   4     0.7926048  0.7089348  0.6740896
##   5     0.8041746  0.7337621  0.7102101
##   6     0.8066544  0.7233326  0.6686835
##   7     0.8133463  0.7276502  0.7001401
##   8     0.8168656  0.7075431  0.7549020
##   9     0.8057697  0.7103176  0.7387115
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
my_grid <- expand.grid(C = c(.25, .5, 1),sigma = c(.01,.05,.1), Weight = 1:2)
model_list <- list()

# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 27)
# for the last model
seeds[[11]] <- rep(1234, 1)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10,
                     repeats = 3,
                     index = createResample(train_data$Disease_Status, 10),
                     classProbs = TRUE,
                     seeds = seeds,
                     summaryFunction = twoClassSummary,
                     savePredictions = 'final',
                     allowParallel = TRUE)

svm_model <- train(
  Disease_Status ~ .,
  data = train_data,
  method = "svmRadialWeights",
  trControl = ctrl,
  metric = metric,
  preProcess = c('center', 'scale'),
  tuneGrid = my_grid
  )

svm_model
## Support Vector Machines with Class Weights 
## 
## 82 samples
##  9 predictor
##  2 classes: 'disease', 'healthy' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 82, 82, 82, 82, 82, 82, ... 
## Resampling results across tuning parameters:
## 
##   C     sigma  Weight  ROC        Sens       Spec       
##   0.25  0.01   1       0.7226020  0.8105263  0.190000000
##   0.25  0.01   2       0.7217008  1.0000000  0.000000000
##   0.25  0.05   1       0.7330365  0.8296992  0.249027778
##   0.25  0.05   2       0.7668246  1.0000000  0.000000000
##   0.25  0.10   1       0.7445300  0.8110349  0.355000000
##   0.25  0.10   2       0.7829816  0.9947368  0.006666667
##   0.50  0.01   1       0.7260932  0.8315789  0.180000000
##   0.50  0.01   2       0.7225552  1.0000000  0.000000000
##   0.50  0.05   1       0.7402500  0.7434765  0.508995726
##   0.50  0.05   2       0.7673054  0.9718045  0.116666667
##   0.50  0.10   1       0.7578265  0.7682749  0.584626068
##   0.50  0.10   2       0.7843698  0.9115613  0.265138889
##   1.00  0.01   1       0.7264733  0.7918620  0.319305556
##   1.00  0.01   2       0.7224263  0.9947368  0.021111111
##   1.00  0.05   1       0.7747591  0.7226547  0.649209402
##   1.00  0.05   2       0.7787393  0.8861300  0.373579060
##   1.00  0.10   1       0.8011886  0.7345167  0.666709402
##   1.00  0.10   2       0.8028658  0.8384107  0.534690171
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1, C = 1 and Weight = 2.

Breast Cancer Wisconsin (Diagnostic) Data Set

Data Summary

## Skim summary statistics  
##  n obs: 569    
##  n variables: 32    
## 
## Variable type: character
## 
##  variable     missing    complete     n     min    max    empty    n_unique 
## -----------  ---------  ----------  -----  -----  -----  -------  ----------
##  diagnosis       0         569       569     6      9       0         2     
## 
## Variable type: integer
## 
##  variable     missing    complete     n     mean       sd        p0      p25       p50        p75       p100        hist   
## -----------  ---------  ----------  -----  -------  ---------  ------  --------  --------  ---------  ---------  ----------
##  id_number       0         569       569    3e+07    1.3e+08    8670    869218    906024    8813129    9.1e+08    ▇▁▁▁▁▁▁▁ 
## 
## Variable type: numeric
## 
##         variable            missing    complete     n      mean       sd        p0        p25       p50       p75      p100       hist   
## -------------------------  ---------  ----------  -----  --------  --------  ---------  --------  --------  --------  -------  ----------
##         area_mean              0         569       569    654.89    351.91     143.5     420.3     551.1     782.7     2501     ▅▇▂▂▁▁▁▁ 
##          area_se               0         569       569    40.34     45.49       6.8      17.85     24.53     45.19     542.2    ▇▁▁▁▁▁▁▁ 
##        area_worst              0         569       569    880.58    569.36     185.2     515.3     686.5      1084     4254     ▇▅▂▁▁▁▁▁ 
##     compactness_mean           0         569       569     0.1      0.053      0.019     0.065     0.093      0.13     0.35     ▅▇▆▃▁▁▁▁ 
##      compactness_se            0         569       569    0.025     0.018     0.0023     0.013      0.02     0.032     0.14     ▇▆▂▁▁▁▁▁ 
##     compactness_worst          0         569       569     0.25      0.16      0.027      0.15      0.21      0.34     1.06     ▆▇▅▂▁▁▁▁ 
##    concave_points_mean         0         569       569    0.049     0.039        0        0.02     0.034     0.074      0.2     ▇▆▃▃▁▁▁▁ 
##     concave_points_se          0         569       569    0.012     0.0062       0       0.0076    0.011     0.015     0.053    ▃▇▅▁▁▁▁▁ 
##   concave_points_worst         0         569       569     0.11     0.066        0       0.065      0.1       0.16     0.29     ▃▇▇▅▅▃▂▁ 
##      concavity_mean            0         569       569    0.089      0.08        0        0.03     0.062      0.13     0.43     ▇▃▂▂▁▁▁▁ 
##       concavity_se             0         569       569    0.032      0.03        0       0.015     0.026     0.042      0.4     ▇▂▁▁▁▁▁▁ 
##      concavity_worst           0         569       569     0.27      0.21        0        0.11      0.23      0.38     1.25     ▇▆▅▂▁▁▁▁ 
##  fractal_dimension_mean        0         569       569    0.063     0.0071     0.05      0.058     0.062     0.066     0.097    ▃▇▆▂▁▁▁▁ 
##   fractal_dimension_se         0         569       569    0.0038    0.0026    0.00089    0.0022    0.0032    0.0046    0.03     ▇▂▁▁▁▁▁▁ 
##  fractal_dimension_worst       0         569       569    0.084     0.018      0.055     0.071      0.08     0.092     0.21     ▆▇▃▁▁▁▁▁ 
##      perimeter_mean            0         569       569    91.97      24.3      43.79     75.17     86.24     104.1     188.5    ▂▇▇▃▂▁▁▁ 
##       perimeter_se             0         569       569     2.87      2.02      0.76       1.61      2.29      3.36     21.98    ▇▂▁▁▁▁▁▁ 
##      perimeter_worst           0         569       569    107.26     33.6      50.41     84.11     97.66     125.4     251.2    ▂▇▃▂▂▁▁▁ 
##        radius_mean             0         569       569    14.13      3.52      6.98       11.7     13.37     15.78     28.11    ▁▆▇▃▂▁▁▁ 
##         radius_se              0         569       569     0.41      0.28      0.11       0.23      0.32      0.48     2.87     ▇▂▁▁▁▁▁▁ 
##       radius_worst             0         569       569    16.27      4.83      7.93      13.01     14.97     18.79     36.04    ▂▇▅▂▂▁▁▁ 
##      smoothness_mean           0         569       569    0.096     0.014      0.053     0.086     0.096      0.11     0.16     ▁▂▇▇▃▁▁▁ 
##       smoothness_se            0         569       569    0.007     0.003     0.0017     0.0052    0.0064    0.0081    0.031    ▅▇▂▁▁▁▁▁ 
##     smoothness_worst           0         569       569     0.13     0.023      0.071      0.12      0.13      0.15     0.22     ▁▃▆▇▃▁▁▁ 
##       symmetry_mean            0         569       569     0.18     0.027      0.11       0.16      0.18      0.2       0.3     ▁▃▇▇▂▁▁▁ 
##        symmetry_se             0         569       569    0.021     0.0083    0.0079     0.015     0.019     0.023     0.079    ▇▇▃▁▁▁▁▁ 
##      symmetry_worst            0         569       569     0.29     0.062      0.16       0.25      0.28      0.32     0.66     ▁▇▆▂▁▁▁▁ 
##       texture_mean             0         569       569    19.29      4.3       9.71      16.17     18.84      21.8     39.28    ▂▆▇▅▂▁▁▁ 
##        texture_se              0         569       569     1.22      0.55      0.36       0.83      1.11      1.47     4.88     ▆▇▃▁▁▁▁▁ 
##       texture_worst            0         569       569    25.68      6.15      12.02     21.08     25.41     29.72     49.54    ▂▆▇▆▅▁▁▁

Correlation of Cell Features

PCA

diagnosis n
benign 107
malignant 63
diagnosis n
benign 250
malignant 149
# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- rep(1234, ncol(train_data)-1)
# for the last model
seeds[[11]] <- rep(1234, 1)

ctrl <- trainControl(method = "repeatedcv", 
                     number = 10,
                     repeats = 3,
                     index = createResample(train_data$diagnosis, 10),
                     classProbs = TRUE,
                     seeds = seeds,
                     summaryFunction = twoClassSummary,
                     savePredictions = 'final',
                     allowParallel = TRUE)

algorithms <- c('adaboost','glmnet','lda','knn','nb','parRF','rpart','svmRadialWeights')

models <- caretList(diagnosis ~ .,
                    data = train_data,
                    metric = metric,
                    trControl = ctrl,
                    preProcess = c("center", "scale"),
                    methodList = algorithms)
results <- resamples(models)
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights 
## Number of resamples: 10 
## 
## ROC 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.9749175 0.9762854 0.9889150 0.9875071 0.9981244
## glmnet           0.9778292 0.9839645 0.9900009 0.9898677 0.9972840
## lda              0.9686973 0.9804039 0.9885307 0.9856467 0.9934591
## knn              0.9679005 0.9721636 0.9793102 0.9826747 0.9959849
## nb               0.9689769 0.9789179 0.9863277 0.9858391 0.9932019
## parRF            0.9668867 0.9824051 0.9896213 0.9868628 0.9955792
## rpart            0.8789879 0.9179112 0.9236871 0.9301738 0.9579732
## svmRadialWeights 0.9814922 0.9833709 0.9883414 0.9900293 0.9977839
##                       Max. NA's
## adaboost         0.9995745    0
## glmnet           0.9997821    0
## lda              0.9973856    0
## knn              0.9973046    0
## nb               0.9971641    0
## parRF            0.9973046    0
## rpart            0.9818083    0
## svmRadialWeights 0.9991285    0
## 
## Sens 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.9518072 0.9652877 0.9747929 0.9767549 0.9892740
## glmnet           0.9787234 0.9891809 0.9948980 0.9936232 1.0000000
## lda              0.9680851 0.9816818 0.9890110 0.9890121 1.0000000
## knn              0.9690722 0.9762459 0.9780220 0.9826186 0.9890801
## nb               0.9278351 0.9520907 0.9582200 0.9606564 0.9780220
## parRF            0.9397590 0.9537169 0.9686650 0.9666717 0.9783607
## rpart            0.9230769 0.9315786 0.9570636 0.9535953 0.9746999
## svmRadialWeights 0.9603960 0.9789405 0.9882941 0.9863012 0.9974227
##                       Max. NA's
## adaboost         1.0000000    0
## glmnet           1.0000000    0
## lda              1.0000000    0
## knn              1.0000000    0
## nb               0.9795918    0
## parRF            1.0000000    0
## rpart            0.9801980    0
## svmRadialWeights 1.0000000    0
## 
## Spec 
##                       Min.   1st Qu.    Median      Mean   3rd Qu.
## adaboost         0.8222222 0.8962520 0.9212635 0.9266883 0.9813941
## glmnet           0.8524590 0.9028708 0.9382126 0.9286659 0.9616981
## lda              0.8032787 0.8469697 0.8770416 0.8857121 0.9233962
## knn              0.8000000 0.8748844 0.9015949 0.8985928 0.9244444
## nb               0.7555556 0.8558214 0.8722034 0.8731597 0.9167889
## parRF            0.8444444 0.8826156 0.9261017 0.9169727 0.9441824
## rpart            0.7777778 0.8529806 0.8809384 0.8767192 0.8974130
## svmRadialWeights 0.8444444 0.8954545 0.9137675 0.9231452 0.9579032
##                       Max. NA's
## adaboost         1.0000000    0
## glmnet           0.9838710    0
## lda              0.9814815    0
## knn              0.9622642    0
## nb               0.9433962    0
## parRF            0.9677419    0
## rpart            0.9629630    0
## svmRadialWeights 0.9814815    0
dotplot(results)

model_cor <- modelCor(results)
ggcorrplot(model_cor, 
           hc.order = TRUE, 
           type = "lower",
           title = "All by All Correlation of Models",
           outline.col = "white",
           lab = TRUE)

plot(varImp(models$glmnet), main = "GLMnet - Variable Importance Plot")

plot(varImp(models$parRF), main = "Parallel Random Forest - Variable Importance Plot")