From paper Using Resistin, glucose, age and BMI to predict the presence of breast cancer, https://bmccancer.biomedcentral.com/track/pdf/10.1186/s12885-017-3877-1
The goal of this exploratory study was to develop and assess a prediction model which can potentially be used as a biomarker for breast cancer, based on anthropometric data and parameters which can be gathered in routine blood analysis.
For each of the 166 participants several clinical features were observed or measured, including * Age * BMI (Body Mass Index) * Glucose * Insulin * HOMA (The homeostasis model assessment (HOMA), based on plasma levels of fasting glucose and insulin, has been widely validated and applied for quantifying insulin resistance and β-cell function) * Leptin (A hormone predominantly made by adipose cells that helps to regulate energy balance by inhibiting hunger) * Adiponectin (A protein hormone which is involved in regulating glucose levels as well as fatty acid breakdown) * Resistin (An adipocyte-secreted hormone (adipokine) linked to obesity and insulin resistance in rodents) * MCP-1 (Monocyte chemoattractant protein-1is a potent chemoattractant for monocytes and macrophages to areas of inflammation)
Skim summary statistics
n obs: 116
n variables: 11
Variable type: character
variable | missing | complete | n | min | max | empty | n_unique |
---|---|---|---|---|---|---|---|
Classification | 0 | 116 | 116 | 1 | 1 | 0 | 2 |
Variable type: factor
variable | missing | complete | n | n_unique | top_counts | ordered |
---|---|---|---|---|---|---|
Disease_Status | 0 | 116 | 116 | 2 | dis: 64, hea: 52, NA: 0 | FALSE |
Variable type: numeric
variable | missing | complete | n | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|---|
Adiponectin | 0 | 116 | 116 | 10.18 | 6.84 | 1.66 | 5.47 | 8.35 | 11.82 | 38.04 | ▆▇▂▁▂▁▁▁ |
Age | 0 | 116 | 116 | 57.3 | 16.11 | 24 | 45 | 56 | 71 | 89 | ▂▃▇▅▃▇▅▃ |
BMI | 0 | 116 | 116 | 27.58 | 5.02 | 18.37 | 22.97 | 27.66 | 31.24 | 38.58 | ▃▇▃▆▆▅▃▂ |
Glucose | 0 | 116 | 116 | 97.79 | 22.53 | 60 | 85.75 | 92 | 102 | 201 | ▂▇▅▁▁▁▁▁ |
HOMA | 0 | 116 | 116 | 2.69 | 3.64 | 0.47 | 0.92 | 1.38 | 2.86 | 25.05 | ▇▁▁▁▁▁▁▁ |
Insulin | 0 | 116 | 116 | 10.01 | 10.07 | 2.43 | 4.36 | 5.92 | 11.19 | 58.46 | ▇▂▁▁▁▁▁▁ |
Leptin | 0 | 116 | 116 | 26.62 | 19.18 | 4.31 | 12.31 | 20.27 | 37.38 | 90.28 | ▇▆▃▃▂▁▁▁ |
MCP.1 | 0 | 116 | 116 | 534.65 | 345.91 | 45.84 | 269.98 | 471.32 | 700.09 | 1698.44 | ▆▇▆▃▂▁▁▁ |
Resistin | 0 | 116 | 116 | 14.73 | 12.39 | 3.21 | 6.88 | 10.83 | 17.76 | 82.1 | ▇▃▂▁▁▁▁▁ |
disease (N=64) | healthy (N=52) | Total (N=116) | p value | |
---|---|---|---|---|
Age | 0.6421 | |||
Mean (SD) | 56.672 (13.493) | 58.077 (18.958) | 57.302 (16.113) | |
Range | 34.000 - 86.000 | 24.000 - 89.000 | 24.000 - 89.000 | |
BMI | 0.1561 | |||
Mean (SD) | 26.985 (4.620) | 28.317 (5.427) | 27.582 (5.020) | |
Range | 18.370 - 37.109 | 18.670 - 38.579 | 18.370 - 38.579 | |
Glucose | < 0.0011 | |||
Mean (SD) | 105.562 (26.557) | 88.231 (10.192) | 97.793 (22.525) | |
Range | 70.000 - 201.000 | 60.000 - 118.000 | 60.000 - 201.000 | |
Insulin | 0.0031 | |||
Mean (SD) | 12.513 (12.318) | 6.934 (4.860) | 10.012 (10.068) | |
Range | 2.432 - 58.460 | 2.707 - 26.211 | 2.432 - 58.460 | |
HOMA | 0.0021 | |||
Mean (SD) | 3.623 (4.589) | 1.552 (1.218) | 2.695 (3.642) | |
Range | 0.508 - 25.050 | 0.467 - 7.112 | 0.467 - 25.050 | |
Leptin | 0.9911 | |||
Mean (SD) | 26.597 (19.212) | 26.638 (19.335) | 26.615 (19.183) | |
Range | 6.334 - 90.280 | 4.311 - 83.482 | 4.311 - 90.280 | |
Adiponectin | 0.8351 | |||
Mean (SD) | 10.061 (6.189) | 10.328 (7.631) | 10.181 (6.843) | |
Range | 1.656 - 33.750 | 2.194 - 38.040 | 1.656 - 38.040 | |
Resistin | 0.0141 | |||
Mean (SD) | 17.254 (12.637) | 11.615 (11.447) | 14.726 (12.391) | |
Range | 3.210 - 55.215 | 3.292 - 82.100 | 3.210 - 82.100 | |
MCP.1 | 0.3291 | |||
Mean (SD) | 563.016 (384.002) | 499.731 (292.242) | 534.647 (345.913) | |
Range | 90.090 - 1698.440 | 45.843 - 1256.083 | 45.843 - 1698.440 |
method | coefficient | rank |
---|---|---|
ward | 0.9108062 | 1 |
complete | 0.8633043 | 2 |
average | 0.8113225 | 3 |
single | 0.7419399 | 4 |
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 8 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 8 proposed 2 as the best number of clusters
## * 1 proposed 3 as the best number of clusters
## * 6 proposed 4 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 2 proposed 6 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 4 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
cluster_group | Freq |
---|---|
1 | 63 |
2 | 53 |
Disease_Status | n |
---|---|
disease | 19 |
healthy | 15 |
Disease_Status | n |
---|---|
disease | 45 |
healthy | 37 |
# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- rep(1234, ncol(train_data)-1)
# for the last model
seeds[[11]] <- rep(1234, 1)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
index = createResample(train_data$Disease_Status, 10),
classProbs = TRUE,
seeds = seeds,
summaryFunction = twoClassSummary,
savePredictions = 'final',
allowParallel = TRUE)
algorithms <- c('adaboost','glmnet','lda','knn','nb','parRF','rpart','svmRadialWeights')
models <- caretList(Disease_Status ~ .,
data = train_data,
metric = metric,
trControl = ctrl,
preProcess = c("center", "scale"),
methodList = algorithms)
results <- resamples(models)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.6736842 0.7396825 0.7719322 0.7773017 0.8267740
## glmnet 0.5178571 0.7383333 0.7655075 0.7642146 0.8430241
## lda 0.5937500 0.6997863 0.7440351 0.7408308 0.7768398
## knn 0.6706349 0.6865132 0.7516869 0.7648407 0.8273507
## nb 0.7556561 0.7744949 0.8084821 0.8073260 0.8344643
## parRF 0.7302632 0.7667411 0.7986111 0.8112877 0.8677885
## rpart 0.5000000 0.6683532 0.7094406 0.7012338 0.7580196
## svmRadialWeights 0.7368421 0.7556548 0.7871120 0.8062080 0.8473011
## Max. NA's
## adaboost 0.9000000 0
## glmnet 0.8787879 0
## lda 0.8687783 0
## knn 0.9272727 0
## nb 0.8684211 0
## parRF 0.8891403 0
## rpart 0.8433333 0
## svmRadialWeights 0.9200000 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.5000000 0.6439394 0.7032967 0.7113942 0.7611336
## glmnet 0.5625000 0.6531955 0.6923077 0.6973651 0.7359649
## lda 0.5000000 0.6078947 0.6675824 0.6887805 0.7944444
## knn 0.4210526 0.4867788 0.6523810 0.6605633 0.7923077
## nb 0.5789474 0.6177885 0.6602871 0.6778123 0.7285714
## parRF 0.5625000 0.6379870 0.6754386 0.7054451 0.7692308
## rpart 0.3571429 0.5538278 0.6791667 0.6642982 0.8173077
## svmRadialWeights 0.6111111 0.6343985 0.6923077 0.7092611 0.7318182
## Max. NA's
## adaboost 0.9333333 0
## glmnet 0.8181818 0
## lda 0.8461538 0
## knn 1.0000000 0
## nb 0.8461538 0
## parRF 0.9444444 0
## rpart 0.9333333 0
## svmRadialWeights 1.0000000 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.4285714 0.6833333 0.7823529 0.7217087 0.8250000
## glmnet 0.2857143 0.6375000 0.7750000 0.7297339 0.8176471
## lda 0.4000000 0.5952381 0.7083333 0.7149720 0.8308824
## knn 0.5000000 0.6166667 0.7142857 0.7077591 0.8083333
## nb 0.6666667 0.7589286 0.8000000 0.8016387 0.8308824
## parRF 0.5833333 0.6750000 0.7166667 0.7256863 0.7875000
## rpart 0.4285714 0.5083333 0.6666667 0.6646359 0.7764706
## svmRadialWeights 0.5833333 0.6666667 0.7238095 0.7178291 0.7833333
## Max. NA's
## adaboost 0.8571429 0
## glmnet 1.0000000 0
## lda 1.0000000 0
## knn 0.8823529 0
## nb 1.0000000 0
## parRF 0.8571429 0
## rpart 0.9285714 0
## svmRadialWeights 0.8571429 0
dotplot(results)
model_cor <- modelCor(results)
ggcorrplot(model_cor,
hc.order = TRUE,
type = "lower",
title = "All by All Correlation of Models",
outline.col = "white",
lab = TRUE)
plot(varImp(models$glmnet), main = "GLMnet - Variable Importance Plot")
plot(varImp(models$parRF), main = "Parallel Random Forest - Variable Importance Plot")
greedy_ensemble <- caretEnsemble(
models,
metric = metric,
trControl = trainControl(
number = length(algorithms),
summaryFunction = twoClassSummary,
classProbs = TRUE
))
summary(greedy_ensemble)
## The following models were ensembled: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights
## They were weighted:
## 2.6276 0.7084 -0.4894 -0.4315 -0.9677 -1.472 -2.9882 0.7363 -0.8185
## The resulting ROC is: 0.8338
## The fit for each individual model on the ROC is:
## method ROC ROCSD
## adaboost 0.7773017 0.07058755
## glmnet 0.7642146 0.10548102
## lda 0.7408308 0.08478397
## knn 0.7648407 0.08699957
## nb 0.8073260 0.03991026
## parRF 0.8112877 0.05722305
## rpart 0.7012338 0.09631080
## svmRadialWeights 0.8062080 0.06157617
model_preds <- lapply(models, predict, newdata = test_data, type = "prob")
model_preds <- lapply(model_preds, function(x) x[,"disease"])
model_preds <- data.frame(model_preds)
ens_preds <- predict(greedy_ensemble, newdata = test_data, type = "prob")
model_preds$ensemble <- ens_preds
caTools::colAUC(model_preds, test_data$Disease_Status)
## adaboost glmnet lda knn nb
## disease vs. healthy 0.8175439 0.7403509 0.7719298 0.8017544 0.5789474
## parRF rpart svmRadialWeights ensemble
## disease vs. healthy 0.7736842 0.622807 0.8035088 0.7649123
##
## Call:
## summary.resamples(object = results)
##
## Models: 50, 100, 150, 200, 250
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50 0.6228070 0.7395833 0.8163012 0.7956873 0.8433333 0.9393939 0
## 100 0.6732456 0.7819314 0.8133242 0.8081363 0.8475792 0.9030303 0
## 150 0.7127193 0.7547249 0.8265110 0.8110996 0.8511905 0.9466667 0
## 200 0.6666667 0.7775493 0.8350275 0.8156990 0.8713774 0.8833333 0
## 250 0.6754386 0.7764333 0.8333333 0.8168656 0.8625090 0.8933333 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50 0.5625000 0.6327751 0.6880952 0.6839202 0.7359649 0.7777778 0
## 100 0.5454545 0.6467611 0.7032967 0.7053054 0.7359649 0.9444444 0
## 150 0.6000000 0.6379870 0.6899038 0.7152236 0.7611336 0.8888889 0
## 200 0.5789474 0.6488095 0.6882591 0.7142048 0.7587413 0.8888889 0
## 250 0.5789474 0.6327751 0.6547619 0.7075431 0.8076923 0.8888889 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 50 0.4166667 0.7190476 0.7750000 0.7439496 0.8250000 0.8823529 0
## 100 0.4166667 0.6439076 0.7000000 0.7092297 0.8214286 0.9000000 0
## 150 0.5000000 0.6041667 0.7000000 0.6763025 0.7389706 0.8000000 0
## 200 0.6428571 0.6750000 0.7166667 0.7549020 0.8511905 0.8823529 0
## 250 0.6428571 0.6750000 0.7166667 0.7549020 0.8511905 0.8823529 0
## $`50`
##
## $`100`
##
## $`150`
##
## $`200`
##
## $`250`
## Parallel Random Forest
##
## 82 samples
## 9 predictor
## 2 classes: 'disease', 'healthy'
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 82, 82, 82, 82, 82, 82, ...
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 1 0.7231732 0.6641894 0.6179692
## 2 0.7787381 0.7150147 0.6818768
## 3 0.8021109 0.7104917 0.6779692
## 4 0.7926048 0.7089348 0.6740896
## 5 0.8041746 0.7337621 0.7102101
## 6 0.8066544 0.7233326 0.6686835
## 7 0.8133463 0.7276502 0.7001401
## 8 0.8168656 0.7075431 0.7549020
## 9 0.8057697 0.7103176 0.7387115
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 8.
my_grid <- expand.grid(C = c(.25, .5, 1),sigma = c(.01,.05,.1), Weight = 1:2)
model_list <- list()
# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- sample.int(1000, 27)
# for the last model
seeds[[11]] <- rep(1234, 1)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
index = createResample(train_data$Disease_Status, 10),
classProbs = TRUE,
seeds = seeds,
summaryFunction = twoClassSummary,
savePredictions = 'final',
allowParallel = TRUE)
svm_model <- train(
Disease_Status ~ .,
data = train_data,
method = "svmRadialWeights",
trControl = ctrl,
metric = metric,
preProcess = c('center', 'scale'),
tuneGrid = my_grid
)
svm_model
## Support Vector Machines with Class Weights
##
## 82 samples
## 9 predictor
## 2 classes: 'disease', 'healthy'
##
## Pre-processing: centered (9), scaled (9)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 82, 82, 82, 82, 82, 82, ...
## Resampling results across tuning parameters:
##
## C sigma Weight ROC Sens Spec
## 0.25 0.01 1 0.7226020 0.8105263 0.190000000
## 0.25 0.01 2 0.7217008 1.0000000 0.000000000
## 0.25 0.05 1 0.7330365 0.8296992 0.249027778
## 0.25 0.05 2 0.7668246 1.0000000 0.000000000
## 0.25 0.10 1 0.7445300 0.8110349 0.355000000
## 0.25 0.10 2 0.7829816 0.9947368 0.006666667
## 0.50 0.01 1 0.7260932 0.8315789 0.180000000
## 0.50 0.01 2 0.7225552 1.0000000 0.000000000
## 0.50 0.05 1 0.7402500 0.7434765 0.508995726
## 0.50 0.05 2 0.7673054 0.9718045 0.116666667
## 0.50 0.10 1 0.7578265 0.7682749 0.584626068
## 0.50 0.10 2 0.7843698 0.9115613 0.265138889
## 1.00 0.01 1 0.7264733 0.7918620 0.319305556
## 1.00 0.01 2 0.7224263 0.9947368 0.021111111
## 1.00 0.05 1 0.7747591 0.7226547 0.649209402
## 1.00 0.05 2 0.7787393 0.8861300 0.373579060
## 1.00 0.10 1 0.8011886 0.7345167 0.666709402
## 1.00 0.10 2 0.8028658 0.8384107 0.534690171
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1, C = 1 and Weight = 2.
## Skim summary statistics
## n obs: 569
## n variables: 32
##
## Variable type: character
##
## variable missing complete n min max empty n_unique
## ----------- --------- ---------- ----- ----- ----- ------- ----------
## diagnosis 0 569 569 6 9 0 2
##
## Variable type: integer
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ----------- --------- ---------- ----- ------- --------- ------ -------- -------- --------- --------- ----------
## id_number 0 569 569 3e+07 1.3e+08 8670 869218 906024 8813129 9.1e+08 ▇▁▁▁▁▁▁▁
##
## Variable type: numeric
##
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## ------------------------- --------- ---------- ----- -------- -------- --------- -------- -------- -------- ------- ----------
## area_mean 0 569 569 654.89 351.91 143.5 420.3 551.1 782.7 2501 ▅▇▂▂▁▁▁▁
## area_se 0 569 569 40.34 45.49 6.8 17.85 24.53 45.19 542.2 ▇▁▁▁▁▁▁▁
## area_worst 0 569 569 880.58 569.36 185.2 515.3 686.5 1084 4254 ▇▅▂▁▁▁▁▁
## compactness_mean 0 569 569 0.1 0.053 0.019 0.065 0.093 0.13 0.35 ▅▇▆▃▁▁▁▁
## compactness_se 0 569 569 0.025 0.018 0.0023 0.013 0.02 0.032 0.14 ▇▆▂▁▁▁▁▁
## compactness_worst 0 569 569 0.25 0.16 0.027 0.15 0.21 0.34 1.06 ▆▇▅▂▁▁▁▁
## concave_points_mean 0 569 569 0.049 0.039 0 0.02 0.034 0.074 0.2 ▇▆▃▃▁▁▁▁
## concave_points_se 0 569 569 0.012 0.0062 0 0.0076 0.011 0.015 0.053 ▃▇▅▁▁▁▁▁
## concave_points_worst 0 569 569 0.11 0.066 0 0.065 0.1 0.16 0.29 ▃▇▇▅▅▃▂▁
## concavity_mean 0 569 569 0.089 0.08 0 0.03 0.062 0.13 0.43 ▇▃▂▂▁▁▁▁
## concavity_se 0 569 569 0.032 0.03 0 0.015 0.026 0.042 0.4 ▇▂▁▁▁▁▁▁
## concavity_worst 0 569 569 0.27 0.21 0 0.11 0.23 0.38 1.25 ▇▆▅▂▁▁▁▁
## fractal_dimension_mean 0 569 569 0.063 0.0071 0.05 0.058 0.062 0.066 0.097 ▃▇▆▂▁▁▁▁
## fractal_dimension_se 0 569 569 0.0038 0.0026 0.00089 0.0022 0.0032 0.0046 0.03 ▇▂▁▁▁▁▁▁
## fractal_dimension_worst 0 569 569 0.084 0.018 0.055 0.071 0.08 0.092 0.21 ▆▇▃▁▁▁▁▁
## perimeter_mean 0 569 569 91.97 24.3 43.79 75.17 86.24 104.1 188.5 ▂▇▇▃▂▁▁▁
## perimeter_se 0 569 569 2.87 2.02 0.76 1.61 2.29 3.36 21.98 ▇▂▁▁▁▁▁▁
## perimeter_worst 0 569 569 107.26 33.6 50.41 84.11 97.66 125.4 251.2 ▂▇▃▂▂▁▁▁
## radius_mean 0 569 569 14.13 3.52 6.98 11.7 13.37 15.78 28.11 ▁▆▇▃▂▁▁▁
## radius_se 0 569 569 0.41 0.28 0.11 0.23 0.32 0.48 2.87 ▇▂▁▁▁▁▁▁
## radius_worst 0 569 569 16.27 4.83 7.93 13.01 14.97 18.79 36.04 ▂▇▅▂▂▁▁▁
## smoothness_mean 0 569 569 0.096 0.014 0.053 0.086 0.096 0.11 0.16 ▁▂▇▇▃▁▁▁
## smoothness_se 0 569 569 0.007 0.003 0.0017 0.0052 0.0064 0.0081 0.031 ▅▇▂▁▁▁▁▁
## smoothness_worst 0 569 569 0.13 0.023 0.071 0.12 0.13 0.15 0.22 ▁▃▆▇▃▁▁▁
## symmetry_mean 0 569 569 0.18 0.027 0.11 0.16 0.18 0.2 0.3 ▁▃▇▇▂▁▁▁
## symmetry_se 0 569 569 0.021 0.0083 0.0079 0.015 0.019 0.023 0.079 ▇▇▃▁▁▁▁▁
## symmetry_worst 0 569 569 0.29 0.062 0.16 0.25 0.28 0.32 0.66 ▁▇▆▂▁▁▁▁
## texture_mean 0 569 569 19.29 4.3 9.71 16.17 18.84 21.8 39.28 ▂▆▇▅▂▁▁▁
## texture_se 0 569 569 1.22 0.55 0.36 0.83 1.11 1.47 4.88 ▆▇▃▁▁▁▁▁
## texture_worst 0 569 569 25.68 6.15 12.02 21.08 25.41 29.72 49.54 ▂▆▇▆▅▁▁▁
diagnosis | n |
---|---|
benign | 107 |
malignant | 63 |
diagnosis | n |
---|---|
benign | 250 |
malignant | 149 |
# length is = (n_repeats*nresampling)+1
seeds <- vector(mode = "list", length = 11)
for(i in 1:10) seeds[[i]] <- rep(1234, ncol(train_data)-1)
# for the last model
seeds[[11]] <- rep(1234, 1)
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
index = createResample(train_data$diagnosis, 10),
classProbs = TRUE,
seeds = seeds,
summaryFunction = twoClassSummary,
savePredictions = 'final',
allowParallel = TRUE)
algorithms <- c('adaboost','glmnet','lda','knn','nb','parRF','rpart','svmRadialWeights')
models <- caretList(diagnosis ~ .,
data = train_data,
metric = metric,
trControl = ctrl,
preProcess = c("center", "scale"),
methodList = algorithms)
results <- resamples(models)
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: adaboost, glmnet, lda, knn, nb, parRF, rpart, svmRadialWeights
## Number of resamples: 10
##
## ROC
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.9749175 0.9762854 0.9889150 0.9875071 0.9981244
## glmnet 0.9778292 0.9839645 0.9900009 0.9898677 0.9972840
## lda 0.9686973 0.9804039 0.9885307 0.9856467 0.9934591
## knn 0.9679005 0.9721636 0.9793102 0.9826747 0.9959849
## nb 0.9689769 0.9789179 0.9863277 0.9858391 0.9932019
## parRF 0.9668867 0.9824051 0.9896213 0.9868628 0.9955792
## rpart 0.8789879 0.9179112 0.9236871 0.9301738 0.9579732
## svmRadialWeights 0.9814922 0.9833709 0.9883414 0.9900293 0.9977839
## Max. NA's
## adaboost 0.9995745 0
## glmnet 0.9997821 0
## lda 0.9973856 0
## knn 0.9973046 0
## nb 0.9971641 0
## parRF 0.9973046 0
## rpart 0.9818083 0
## svmRadialWeights 0.9991285 0
##
## Sens
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.9518072 0.9652877 0.9747929 0.9767549 0.9892740
## glmnet 0.9787234 0.9891809 0.9948980 0.9936232 1.0000000
## lda 0.9680851 0.9816818 0.9890110 0.9890121 1.0000000
## knn 0.9690722 0.9762459 0.9780220 0.9826186 0.9890801
## nb 0.9278351 0.9520907 0.9582200 0.9606564 0.9780220
## parRF 0.9397590 0.9537169 0.9686650 0.9666717 0.9783607
## rpart 0.9230769 0.9315786 0.9570636 0.9535953 0.9746999
## svmRadialWeights 0.9603960 0.9789405 0.9882941 0.9863012 0.9974227
## Max. NA's
## adaboost 1.0000000 0
## glmnet 1.0000000 0
## lda 1.0000000 0
## knn 1.0000000 0
## nb 0.9795918 0
## parRF 1.0000000 0
## rpart 0.9801980 0
## svmRadialWeights 1.0000000 0
##
## Spec
## Min. 1st Qu. Median Mean 3rd Qu.
## adaboost 0.8222222 0.8962520 0.9212635 0.9266883 0.9813941
## glmnet 0.8524590 0.9028708 0.9382126 0.9286659 0.9616981
## lda 0.8032787 0.8469697 0.8770416 0.8857121 0.9233962
## knn 0.8000000 0.8748844 0.9015949 0.8985928 0.9244444
## nb 0.7555556 0.8558214 0.8722034 0.8731597 0.9167889
## parRF 0.8444444 0.8826156 0.9261017 0.9169727 0.9441824
## rpart 0.7777778 0.8529806 0.8809384 0.8767192 0.8974130
## svmRadialWeights 0.8444444 0.8954545 0.9137675 0.9231452 0.9579032
## Max. NA's
## adaboost 1.0000000 0
## glmnet 0.9838710 0
## lda 0.9814815 0
## knn 0.9622642 0
## nb 0.9433962 0
## parRF 0.9677419 0
## rpart 0.9629630 0
## svmRadialWeights 0.9814815 0
dotplot(results)
model_cor <- modelCor(results)
ggcorrplot(model_cor,
hc.order = TRUE,
type = "lower",
title = "All by All Correlation of Models",
outline.col = "white",
lab = TRUE)
plot(varImp(models$glmnet), main = "GLMnet - Variable Importance Plot")
plot(varImp(models$parRF), main = "Parallel Random Forest - Variable Importance Plot")