Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to predict activity using data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: correctly (classe= A), throwing the elbows to the front (classe= B), lifting the dumbbell only halfway (classe= C), lowering the dumbbell only halfway (classe= D) and throwing the hips to the front (classe= E). More information is available from the website: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data (used to generate training and testing datasets, see below) for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data (perhaps more correctly designated as the validation data, see below) are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har
The procedure used may be summarized in the following steps: (1) loading data; (2) cleaning data; (3) pre-processing of data; (4) generating models using training data with cross validation; (5) predictions using testing data; (6) determining accuracy/out-sample error; and (7) validating the best model. Details will be given below in text and in commentary in code chunk sections.
Data was loaded from the training csv file provided. This raw data had been examined offline in spreadsheet form where 52 predictors plus the outcome (classe) were selected. Basically, predictors with incomplete data were omitted - data4 represents the “cleaned” dataset. See the code chunk that follows.
#Project for Coursera Practical Machine Learning Class - Weight Lifting Data
#Using caret package
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
#Loading Data
data1<-read.csv("pml-training.csv")
#Raw data examined offline and 52 predictors selected plus outcome labeled "classe"
#This effectively creates the "cleaned" dataset
roll<-grep("^roll",(names(data1)))
pitch<-grep("^pitch",(names(data1)))
yaw<-grep("^yaw",(names(data1)))
total<-grep("^total",(names(data1)))
gyros<-grep("^gyros",(names(data1)))
accel<-grep("^accel",(names(data1)))
magnet<-grep("^magnet",(names(data1)))
classe<-grep("^classe",(names(data1)))
data2<-c(roll,pitch,yaw,total,gyros,accel,magnet,classe)
data3<-sort(data2)
#The "cleaned" dataset
data4<-data1[,data3]
The cleaned data were then pre-processed by removing highly correlated predictors - the cutoff being correlation of 0.5. This reduced the number of predictors from 52 to 21 which makes the modeling much faster and the resulting models simpler and eaiser to handle.
Next, predictors for modeling were selected from those remaining using the rfe function. This function estimates model accuracy depending on the number of variables/predictors used as it successively eliminates variables/predictors from lowest to highest importance. It was determined that using the 15 most important predictors was best - this reduction from 21 also reduces time required to generate models. A figure is provided illustrating accuracy versus number of predictors. Note that when rounded that the accuracy for 15 variables is the same as for 17 (which has the highest accuracy), and 15 is the smallest number of variables where this is true, so only 15 variables were used.
The dataset was adjusted accordingly - the result is data7. See the code chunk that follows.
#Pre-processing the data
#Finding and removing highly correlated predictors
correlationMatrix <- cor(data4[,1:52])
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
hc<-sort(highlyCorrelated)
nhc<-c(1:53)
nhc<-nhc[-hc]
data5<-data4[,nhc]
#Generating small training set for predictor selection (after removal of correlated predictors)
set.seed(1)
intrain<-createDataPartition(y=data5$classe,p=0.05,list=FALSE)
training<-data5[intrain,]
testing<-data5[-intrain,]
dim(training)
## [1] 983 22
#Predictor selection
set.seed(7)
library(mlbench)
library(randomForest)
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
control <- rfeControl(functions=rfFuncs, method="cv", number=3)
results <- rfe(training[,1:21], training[,22], sizes=c(1:21), rfeControl=control)
print(results)
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (3 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 1 0.3235 0.1276 0.008458 0.013912
## 2 0.4588 0.3089 0.014978 0.022329
## 3 0.6022 0.4943 0.017530 0.024614
## 4 0.6460 0.5505 0.045736 0.058469
## 5 0.7080 0.6298 0.003295 0.003952
## 6 0.7396 0.6702 0.022870 0.028977
## 7 0.7548 0.6898 0.015500 0.019028
## 8 0.7650 0.7026 0.014304 0.018104
## 9 0.7904 0.7348 0.011028 0.013886
## 10 0.7904 0.7344 0.028568 0.036030
## 11 0.7853 0.7282 0.023507 0.028969
## 12 0.7894 0.7333 0.019437 0.023955
## 13 0.7894 0.7333 0.031346 0.039188
## 14 0.7945 0.7399 0.031562 0.039111
## 15 0.7965 0.7424 0.031692 0.039413
## 16 0.7924 0.7374 0.031341 0.038862
## 17 0.8026 0.7501 0.037654 0.046857 *
## 18 0.7965 0.7424 0.028727 0.035438
## 19 0.7955 0.7411 0.026423 0.032643
## 20 0.7843 0.7271 0.025074 0.030380
## 21 0.7955 0.7412 0.026377 0.032533
##
## The top 5 variables (out of 17):
## magnet_belt_y, pitch_forearm, roll_dumbbell, pitch_dumbbell, roll_forearm
predictors(results)
## [1] "magnet_belt_y" "pitch_forearm" "roll_dumbbell"
## [4] "pitch_dumbbell" "roll_forearm" "roll_arm"
## [7] "gyros_belt_z" "yaw_arm" "accel_forearm_z"
## [10] "magnet_forearm_y" "magnet_forearm_x" "gyros_arm_y"
## [13] "gyros_belt_x" "yaw_forearm" "magnet_arm_z"
## [16] "gyros_forearm_y" "total_accel_arm"
plot(results, type=c("g", "o"), main ="RFE: Predictor Selection for Model", xlab="Number of Predictors (after correlated predictors removed)")
#Removing "non-selected" predictors from dataset
data7<-data5[,predictors(results)[1:15]]
data7$classe<-data5$classe
The adjusted dataset (data7) was then partitioned into a training dataset for modeling and a testing dataset. Random Forest (RF) and GBM Boost modeling were used - these tend to give the best results for classification situations as in this project.
The Random Forest modeling function took a considerable amount to time to run even with predictor reduction and limiting cross validation. It was found that a training set of only 25% of the dataset would give an excellent model (about 94% accuracy in predicting using testing data and 19/20 correct predictions in the validaton step). Increasing to using 50% of the dataset took very long to run and made only a small improvement (about 97% accuracy on testing data and 20/20 in validation); to save time in preparing this report, only the model using 25% of the dataset as training will be presented in the code chunks below - a change to p=0.5 is all that is needed in the second code line in the chunk that follows to get the slightly better model.
The GBM Boost modeling function ran much faster than the one for Random Forest. The same dataset was used, but the training dataset partition was a more usual 70%. The GBM model did not perform as well as the Random Forest model, so the latter was chosen for the validation step.
In both modeling efforts, cross validation was used: K-fold without repeat with k=3. Using repeated cross validation and/or higher k’s did not seem to give a big effect and slowed the modeling functions down considerably when tested - these latter efforts are not shown in the chunks below.
See the next code chunk below for details.
The testing data from the partition of data7 (this is not the test data used for the validation discussed below) was used to predict activity (classe) using the models generated.
See the next code chunk that follows for details.
Please note that here, fraction error=1-fraction accuracy. Errors determined from checking a model against training data used to generate it are really in-sample errors, though some might say that if cross validation was used then this is an “estimated” out-sample error. A better accuracy/out-sample error value is that determined when checking the model against testing data (the more representative the testing data, the better then error estimate, of course); this is not the test data used for the validation discussed below, but the testing data from partitioned data7. The testing data (from data7) is not used in generating the models, nor is the test data used for validation. Accuracy and out-sample error based on the validation are given in the section on validation; since these are based on fewer cases, they are probably not as good as that based on the testing data from data7.
Relevant values are reported below and figures are given for illustration of these. The figures showing out-sample error are based on testing data (from data7, not validation data); the red line on these figures represents overall out-sample error for all activities based on this same data.
See the code chunk that follows for details.
#Generating datasets for Random Forest modeling
set.seed(2)
intrain<-createDataPartition(y=data7$classe,p=0.25,list=FALSE)
training<-data7[intrain,]
testing<-data7[-intrain,]
#Random Forest Modeling and Prediction Testing
cntrl<-trainControl(method="cv", number=3)
set.seed(8)
modfit2<-train(classe~.,method="rf",data=training, prox=TRUE,trControl=cntrl)
modfit2
## Random Forest
##
## 4907 samples
## 15 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 3272, 3270, 3272
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9213334 0.9004584 0.007447223 0.009315407
## 8 0.9123672 0.8891556 0.006535183 0.008225308
## 15 0.8972878 0.8700916 0.004074965 0.005093638
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
modfit2$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 5.87%
## Confusion matrix:
## A B C D E class.error
## A 1359 22 7 4 3 0.02580645
## B 45 858 36 8 3 0.09684211
## C 8 26 797 21 4 0.06892523
## D 6 3 53 736 6 0.08457711
## E 1 8 13 11 869 0.03658537
pred<-predict(modfit2,newdata=testing)
table(pred,testing$classe)
##
## pred A B C D E
## A 4108 125 14 15 7
## B 40 2529 111 17 37
## C 20 122 2368 157 60
## D 15 45 67 2207 35
## E 2 26 6 16 2566
confusionMatrix(pred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4108 125 14 15 7
## B 40 2529 111 17 37
## C 20 122 2368 157 60
## D 15 45 67 2207 35
## E 2 26 6 16 2566
##
## Overall Statistics
##
## Accuracy : 0.9363
## 95% CI : (0.9323, 0.9402)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9194
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9816 0.8883 0.9228 0.9150 0.9486
## Specificity 0.9847 0.9827 0.9705 0.9868 0.9958
## Pos Pred Value 0.9623 0.9250 0.8684 0.9316 0.9809
## Neg Pred Value 0.9926 0.9735 0.9835 0.9834 0.9885
## Prevalence 0.2844 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2792 0.1719 0.1609 0.1500 0.1744
## Detection Prevalence 0.2901 0.1858 0.1853 0.1610 0.1778
## Balanced Accuracy 0.9832 0.9355 0.9466 0.9509 0.9722
#Generating points and Plotting of Random Forest model performance
tf<-as.data.frame(table(pred,testing$classe))
aa<-1-(tf[1,3]/sum(tf[1:5,3]))
bb<-1-(tf[7,3]/sum(tf[6:10,3]))
cc<-1-(tf[13,3]/sum(tf[11:15,3]))
dd<-1-(tf[19,3]/sum(tf[16:20,3]))
ee<-1-(tf[25,3]/sum(tf[21:25,3]))
ov<-1-(sum(pred==testing$classe)/nrow(testing))
print("Random Forest Model: Fraction Out-Sample Error for Testing Data Predictions =")
## [1] "Random Forest Model: Fraction Out-Sample Error for Testing Data Predictions ="
ov
## [1] 0.06367652
tfx<-vector(mode="character", length=5)
tfx<-tf[1:5,1]
tfy<-c(aa,bb,cc,dd,ee)
plot(tfy, type="o", col="blue", xaxt="n",main="RF Model: Testing Dataset Predictions", ylab="Fraction Out-Sample Error",xlab="Actual Activity (Classe)")
abline(h=ov, col="red",lty=2)
axis(1, at=1:5, lab=(tfx))
#Generating datasets for GBM Boost modeling
set.seed(3)
intrain<-createDataPartition(y=data7$classe,p=0.70,list=FALSE)
training<-data7[intrain,]
testing<-data7[-intrain,]
#GBM Boost Modeling and Prediction Testing
cntrl<-trainControl(method="cv", number=3)
set.seed(9)
modfit3<-train(classe~.,method="gbm",data=training,verbose=FALSE,trControl=cntrl)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
##
## The following object is masked from 'package:caret':
##
## cluster
##
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
modfit3
## Stochastic Gradient Boosting
##
## 13737 samples
## 15 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 9158, 9158, 9158
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa Accuracy SD
## 1 50 0.6427895 0.5435266 0.019734054
## 1 100 0.6906166 0.6068652 0.005242835
## 1 150 0.7164592 0.6403566 0.001831513
## 2 50 0.7334207 0.6615579 0.006353310
## 2 100 0.7871442 0.7303739 0.005696266
## 2 150 0.8214312 0.7740326 0.006571032
## 3 50 0.7797190 0.7208023 0.007387638
## 3 100 0.8359176 0.7923464 0.006975902
## 3 150 0.8644537 0.8284946 0.008213065
## Kappa SD
## 0.026325890
## 0.006743810
## 0.002208237
## 0.007748223
## 0.007158529
## 0.008300423
## 0.009396124
## 0.008852925
## 0.010372466
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
modfit3$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 15 predictors of which 15 had non-zero influence.
pred<-predict(modfit3,newdata=testing)
table(pred,testing$classe)
##
## pred A B C D E
## A 1574 130 15 4 9
## B 34 879 101 27 75
## C 23 71 860 85 45
## D 31 42 36 838 33
## E 12 17 14 10 920
confusionMatrix(pred,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1574 130 15 4 9
## B 34 879 101 27 75
## C 23 71 860 85 45
## D 31 42 36 838 33
## E 12 17 14 10 920
##
## Overall Statistics
##
## Accuracy : 0.8617
## 95% CI : (0.8526, 0.8704)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8249
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9403 0.7717 0.8382 0.8693 0.8503
## Specificity 0.9625 0.9501 0.9539 0.9711 0.9890
## Pos Pred Value 0.9088 0.7876 0.7934 0.8551 0.9455
## Neg Pred Value 0.9759 0.9455 0.9654 0.9743 0.9670
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2675 0.1494 0.1461 0.1424 0.1563
## Detection Prevalence 0.2943 0.1896 0.1842 0.1665 0.1653
## Balanced Accuracy 0.9514 0.8609 0.8961 0.9202 0.9196
#Generating points and Plotting of GBM Boost model performance
tf<-as.data.frame(table(pred,testing$classe))
aa<-1-(tf[1,3]/sum(tf[1:5,3]))
bb<-1-(tf[7,3]/sum(tf[6:10,3]))
cc<-1-(tf[13,3]/sum(tf[11:15,3]))
dd<-1-(tf[19,3]/sum(tf[16:20,3]))
ee<-1-(tf[25,3]/sum(tf[21:25,3]))
ov<-1-(sum(pred==testing$classe)/nrow(testing))
print("GBM Boost Model: Fraction Out-Sample Error for Testing Data Predictions=")
## [1] "GBM Boost Model: Fraction Out-Sample Error for Testing Data Predictions="
ov
## [1] 0.1383178
tfx<-vector(mode="character", length=5)
tfx<-tf[1:5,1]
tfy<-c(aa,bb,cc,dd,ee)
plot(tfy, type="o", col="blue", xaxt="n",main="GBM Boost Model: Testing Dataset Predictions", ylab="Fraction Out-Sample Error",xlab="Actual Activity (Classe)")
abline(h=ov, col="red",lty=2)
axis(1, at=1:5, lab=(tfx))
From the accuracy and out-sample errors, it was determined that the Random Forest model was better than the GBM Boost model. The Random Forest model (RF) was checked using the test data provided for validation (20 cases) - this is not the testing data generated from the partition of data7.
As noted above, there were actually 2 RF models tested in the validation step. One got 19/20 correct and the other 20/20. This would be a 0.95 fraction accuracy/ 0.05 fraction out-sample error and 1.0 fraction accuracy/ 0.0 fraction out-sample error, respectively. Only the former is actually detailed in this report, as explained above.
See code chunk that follows directly. Validation itself was through the Coursera course website, so this part of the code won’t be run here.
#Generating Files for Model Validation - code commented out since not needed for report
#data100<-read.csv("pml-testing.csv")
#answers<-predict(modfit2,newdata=data100)
#answers
# pml_write_files = function(x){
# n = length(x)
# for(i in 1:n){
# filename = paste0("problem_id_",i,".txt")
# write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
# }
# }
# pml_write_files(answers)
The Random Forest modeling resulted in a model with very high accuracy/low out-sample error, even though short cuts were taken to avoid lengthy computer processing time.