Project for Coursera Practical Machine Learning Course: Activity (Classe) Prediction Using Weight Lifting Data

Model Development and Testing

The procedure used may be summarized in the following steps: (1) loading data; (2) cleaning data; (3) pre-processing of data; (4) generating models using training data with cross validation; (5) predictions using testing data; (6) determining accuracy/out-sample error; and (7) validating the best model. Details will be given below in text and in commentary in code chunk sections.

Loading Data and Cleaning Data

Data was loaded from the training csv file provided. This raw data had been examined offline in spreadsheet form where 52 predictors plus the outcome (classe) were selected. Basically, predictors with incomplete data were omitted - data4 represents the “cleaned” dataset. See the code chunk that follows.

#Project for Coursera Practical Machine Learning Class - Weight Lifting Data

#Using caret package 
library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

#Loading Data
data1<-read.csv("pml-training.csv")

#Raw data examined offline and 52 predictors selected plus outcome labeled "classe"
#This effectively creates the "cleaned" dataset
roll<-grep("^roll",(names(data1)))
pitch<-grep("^pitch",(names(data1)))
yaw<-grep("^yaw",(names(data1)))
total<-grep("^total",(names(data1)))
gyros<-grep("^gyros",(names(data1)))
accel<-grep("^accel",(names(data1)))
magnet<-grep("^magnet",(names(data1)))
classe<-grep("^classe",(names(data1)))
data2<-c(roll,pitch,yaw,total,gyros,accel,magnet,classe)
data3<-sort(data2)

#The "cleaned" dataset
data4<-data1[,data3]

Pre-processing of Data

The cleaned data were then pre-processed by removing highly correlated predictors - the cutoff being correlation of 0.5. This reduced the number of predictors from 52 to 21 which makes the modeling much faster and the resulting models simpler and eaiser to handle.

Next, predictors for modeling were selected from those remaining using the rfe function. This function estimates model accuracy depending on the number of variables/predictors used as it successively eliminates variables/predictors from lowest to highest importance. It was determined that using the 15 most important predictors was best - this reduction from 21 also reduces time required to generate models. A figure is provided illustrating accuracy versus number of predictors. Note that when rounded that the accuracy for 15 variables is the same as for 17 (which has the highest accuracy), and 15 is the smallest number of variables where this is true, so only 15 variables were used.

The dataset was adjusted accordingly - the result is data7. See the code chunk that follows.

#Pre-processing the data
#Finding and removing highly correlated predictors
correlationMatrix <- cor(data4[,1:52])
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.5)
hc<-sort(highlyCorrelated)
nhc<-c(1:53)
nhc<-nhc[-hc]
data5<-data4[,nhc]

#Generating small training set for predictor selection (after removal of correlated predictors)
set.seed(1)
intrain<-createDataPartition(y=data5$classe,p=0.05,list=FALSE)
training<-data5[intrain,]
testing<-data5[-intrain,]
dim(training)

## [1] 983  22

#Predictor selection
set.seed(7)
library(mlbench)
library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

control <- rfeControl(functions=rfFuncs, method="cv", number=3)
results <- rfe(training[,1:21], training[,22], sizes=c(1:21), rfeControl=control)
print(results)

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (3 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables Accuracy  Kappa AccuracySD  KappaSD Selected
##          1   0.3235 0.1276   0.008458 0.013912         
##          2   0.4588 0.3089   0.014978 0.022329         
##          3   0.6022 0.4943   0.017530 0.024614         
##          4   0.6460 0.5505   0.045736 0.058469         
##          5   0.7080 0.6298   0.003295 0.003952         
##          6   0.7396 0.6702   0.022870 0.028977         
##          7   0.7548 0.6898   0.015500 0.019028         
##          8   0.7650 0.7026   0.014304 0.018104         
##          9   0.7904 0.7348   0.011028 0.013886         
##         10   0.7904 0.7344   0.028568 0.036030         
##         11   0.7853 0.7282   0.023507 0.028969         
##         12   0.7894 0.7333   0.019437 0.023955         
##         13   0.7894 0.7333   0.031346 0.039188         
##         14   0.7945 0.7399   0.031562 0.039111         
##         15   0.7965 0.7424   0.031692 0.039413         
##         16   0.7924 0.7374   0.031341 0.038862         
##         17   0.8026 0.7501   0.037654 0.046857        *
##         18   0.7965 0.7424   0.028727 0.035438         
##         19   0.7955 0.7411   0.026423 0.032643         
##         20   0.7843 0.7271   0.025074 0.030380         
##         21   0.7955 0.7412   0.026377 0.032533         
## 
## The top 5 variables (out of 17):
##    magnet_belt_y, pitch_forearm, roll_dumbbell, pitch_dumbbell, roll_forearm

predictors(results)

##  [1] "magnet_belt_y"    "pitch_forearm"    "roll_dumbbell"   
##  [4] "pitch_dumbbell"   "roll_forearm"     "roll_arm"        
##  [7] "gyros_belt_z"     "yaw_arm"          "accel_forearm_z" 
## [10] "magnet_forearm_y" "magnet_forearm_x" "gyros_arm_y"     
## [13] "gyros_belt_x"     "yaw_forearm"      "magnet_arm_z"    
## [16] "gyros_forearm_y"  "total_accel_arm"

plot(results, type=c("g", "o"), main ="RFE: Predictor Selection for Model",  xlab="Number of Predictors (after correlated predictors removed)")

#Removing "non-selected" predictors from dataset
data7<-data5[,predictors(results)[1:15]]
data7$classe<-data5$classe

Generating Models Using Training Data with Cross Validation

The adjusted dataset (data7) was then partitioned into a training dataset for modeling and a testing dataset. Random Forest (RF) and GBM Boost modeling were used - these tend to give the best results for classification situations as in this project.

The Random Forest modeling function took a considerable amount to time to run even with predictor reduction and limiting cross validation. It was found that a training set of only 25% of the dataset would give an excellent model (about 94% accuracy in predicting using testing data and 19/20 correct predictions in the validaton step). Increasing to using 50% of the dataset took very long to run and made only a small improvement (about 97% accuracy on testing data and 20/20 in validation); to save time in preparing this report, only the model using 25% of the dataset as training will be presented in the code chunks below - a change to p=0.5 is all that is needed in the second code line in the chunk that follows to get the slightly better model.

The GBM Boost modeling function ran much faster than the one for Random Forest. The same dataset was used, but the training dataset partition was a more usual 70%. The GBM model did not perform as well as the Random Forest model, so the latter was chosen for the validation step.

In both modeling efforts, cross validation was used: K-fold without repeat with k=3. Using repeated cross validation and/or higher k’s did not seem to give a big effect and slowed the modeling functions down considerably when tested - these latter efforts are not shown in the chunks below.

See the next code chunk below for details.

Predictions Using Testing Data

The testing data from the partition of data7 (this is not the test data used for the validation discussed below) was used to predict activity (classe) using the models generated.

See the next code chunk that follows for details.

Determining Accuracy/Out-Sample Error

Please note that here, fraction error=1-fraction accuracy. Errors determined from checking a model against training data used to generate it are really in-sample errors, though some might say that if cross validation was used then this is an “estimated” out-sample error. A better accuracy/out-sample error value is that determined when checking the model against testing data (the more representative the testing data, the better then error estimate, of course); this is not the test data used for the validation discussed below, but the testing data from partitioned data7. The testing data (from data7) is not used in generating the models, nor is the test data used for validation. Accuracy and out-sample error based on the validation are given in the section on validation; since these are based on fewer cases, they are probably not as good as that based on the testing data from data7.

Relevant values are reported below and figures are given for illustration of these. The figures showing out-sample error are based on testing data (from data7, not validation data); the red line on these figures represents overall out-sample error for all activities based on this same data.

See the code chunk that follows for details.

#Generating datasets for Random Forest modeling
set.seed(2)
intrain<-createDataPartition(y=data7$classe,p=0.25,list=FALSE)
training<-data7[intrain,]
testing<-data7[-intrain,] 

#Random Forest Modeling and Prediction Testing
cntrl<-trainControl(method="cv", number=3)
set.seed(8)
modfit2<-train(classe~.,method="rf",data=training, prox=TRUE,trControl=cntrl)
modfit2

## Random Forest 
## 
## 4907 samples
##   15 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 3272, 3270, 3272 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa      Accuracy SD  Kappa SD   
##    2    0.9213334  0.9004584  0.007447223  0.009315407
##    8    0.9123672  0.8891556  0.006535183  0.008225308
##   15    0.8972878  0.8700916  0.004074965  0.005093638
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

modfit2$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 5.87%
## Confusion matrix:
##      A   B   C   D   E class.error
## A 1359  22   7   4   3  0.02580645
## B   45 858  36   8   3  0.09684211
## C    8  26 797  21   4  0.06892523
## D    6   3  53 736   6  0.08457711
## E    1   8  13  11 869  0.03658537

pred<-predict(modfit2,newdata=testing)
table(pred,testing$classe)

##     
## pred    A    B    C    D    E
##    A 4108  125   14   15    7
##    B   40 2529  111   17   37
##    C   20  122 2368  157   60
##    D   15   45   67 2207   35
##    E    2   26    6   16 2566

confusionMatrix(pred,testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4108  125   14   15    7
##          B   40 2529  111   17   37
##          C   20  122 2368  157   60
##          D   15   45   67 2207   35
##          E    2   26    6   16 2566
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9363          
##                  95% CI : (0.9323, 0.9402)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9194          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9816   0.8883   0.9228   0.9150   0.9486
## Specificity            0.9847   0.9827   0.9705   0.9868   0.9958
## Pos Pred Value         0.9623   0.9250   0.8684   0.9316   0.9809
## Neg Pred Value         0.9926   0.9735   0.9835   0.9834   0.9885
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2792   0.1719   0.1609   0.1500   0.1744
## Detection Prevalence   0.2901   0.1858   0.1853   0.1610   0.1778
## Balanced Accuracy      0.9832   0.9355   0.9466   0.9509   0.9722

#Generating points and Plotting of Random Forest model performance
tf<-as.data.frame(table(pred,testing$classe))
aa<-1-(tf[1,3]/sum(tf[1:5,3]))
bb<-1-(tf[7,3]/sum(tf[6:10,3]))
cc<-1-(tf[13,3]/sum(tf[11:15,3]))
dd<-1-(tf[19,3]/sum(tf[16:20,3]))
ee<-1-(tf[25,3]/sum(tf[21:25,3]))
ov<-1-(sum(pred==testing$classe)/nrow(testing))

print("Random Forest Model: Fraction Out-Sample Error for Testing Data Predictions =")

## [1] "Random Forest Model: Fraction Out-Sample Error for Testing Data Predictions ="

ov

## [1] 0.06367652

tfx<-vector(mode="character", length=5)
tfx<-tf[1:5,1]
tfy<-c(aa,bb,cc,dd,ee)
plot(tfy, type="o", col="blue", xaxt="n",main="RF Model: Testing Dataset Predictions", ylab="Fraction Out-Sample Error",xlab="Actual Activity (Classe)")
abline(h=ov, col="red",lty=2)
axis(1, at=1:5, lab=(tfx))

#Generating datasets for GBM Boost modeling
set.seed(3)
intrain<-createDataPartition(y=data7$classe,p=0.70,list=FALSE)
training<-data7[intrain,]
testing<-data7[-intrain,] 

#GBM Boost Modeling and Prediction Testing
cntrl<-trainControl(method="cv", number=3)
set.seed(9)
modfit3<-train(classe~.,method="gbm",data=training,verbose=FALSE,trControl=cntrl)

## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1

modfit3

## Stochastic Gradient Boosting 
## 
## 13737 samples
##    15 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 9158, 9158, 9158 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD
##   1                   50      0.6427895  0.5435266  0.019734054
##   1                  100      0.6906166  0.6068652  0.005242835
##   1                  150      0.7164592  0.6403566  0.001831513
##   2                   50      0.7334207  0.6615579  0.006353310
##   2                  100      0.7871442  0.7303739  0.005696266
##   2                  150      0.8214312  0.7740326  0.006571032
##   3                   50      0.7797190  0.7208023  0.007387638
##   3                  100      0.8359176  0.7923464  0.006975902
##   3                  150      0.8644537  0.8284946  0.008213065
##   Kappa SD   
##   0.026325890
##   0.006743810
##   0.002208237
##   0.007748223
##   0.007158529
##   0.008300423
##   0.009396124
##   0.008852925
##   0.010372466
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

modfit3$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 15 predictors of which 15 had non-zero influence.

pred<-predict(modfit3,newdata=testing)
table(pred,testing$classe)

##     
## pred    A    B    C    D    E
##    A 1574  130   15    4    9
##    B   34  879  101   27   75
##    C   23   71  860   85   45
##    D   31   42   36  838   33
##    E   12   17   14   10  920

confusionMatrix(pred,testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1574  130   15    4    9
##          B   34  879  101   27   75
##          C   23   71  860   85   45
##          D   31   42   36  838   33
##          E   12   17   14   10  920
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8617          
##                  95% CI : (0.8526, 0.8704)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8249          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9403   0.7717   0.8382   0.8693   0.8503
## Specificity            0.9625   0.9501   0.9539   0.9711   0.9890
## Pos Pred Value         0.9088   0.7876   0.7934   0.8551   0.9455
## Neg Pred Value         0.9759   0.9455   0.9654   0.9743   0.9670
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2675   0.1494   0.1461   0.1424   0.1563
## Detection Prevalence   0.2943   0.1896   0.1842   0.1665   0.1653
## Balanced Accuracy      0.9514   0.8609   0.8961   0.9202   0.9196

#Generating points and Plotting of GBM Boost model performance
tf<-as.data.frame(table(pred,testing$classe))
aa<-1-(tf[1,3]/sum(tf[1:5,3]))
bb<-1-(tf[7,3]/sum(tf[6:10,3]))
cc<-1-(tf[13,3]/sum(tf[11:15,3]))
dd<-1-(tf[19,3]/sum(tf[16:20,3]))
ee<-1-(tf[25,3]/sum(tf[21:25,3]))
ov<-1-(sum(pred==testing$classe)/nrow(testing))

print("GBM Boost Model: Fraction Out-Sample Error for Testing Data Predictions=")

## [1] "GBM Boost Model: Fraction Out-Sample Error for Testing Data Predictions="

ov

## [1] 0.1383178

tfx<-vector(mode="character", length=5)
tfx<-tf[1:5,1]
tfy<-c(aa,bb,cc,dd,ee)
plot(tfy, type="o", col="blue", xaxt="n",main="GBM Boost Model: Testing Dataset Predictions", ylab="Fraction Out-Sample Error",xlab="Actual Activity (Classe)")
abline(h=ov, col="red",lty=2)
axis(1, at=1:5, lab=(tfx))

Validating the Best Model

From the accuracy and out-sample errors, it was determined that the Random Forest model was better than the GBM Boost model. The Random Forest model (RF) was checked using the test data provided for validation (20 cases) - this is not the testing data generated from the partition of data7.

As noted above, there were actually 2 RF models tested in the validation step. One got 19/20 correct and the other 20/20. This would be a 0.95 fraction accuracy/ 0.05 fraction out-sample error and 1.0 fraction accuracy/ 0.0 fraction out-sample error, respectively. Only the former is actually detailed in this report, as explained above.

See code chunk that follows directly. Validation itself was through the Coursera course website, so this part of the code won’t be run here.

#Generating Files for Model Validation - code commented out since not needed for report

#data100<-read.csv("pml-testing.csv")
#answers<-predict(modfit2,newdata=data100)
#answers
#  pml_write_files = function(x){
#    n = length(x)
#    for(i in 1:n){
#      filename = paste0("problem_id_",i,".txt")
#      write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
#    }
# }
#  pml_write_files(answers)

Project for Coursera Practical Machine Learning Course: Activity (Classe) Prediction Using Weight Lifting Data

CRR

August 22, 2015

Background/Project Goal

Data