class.coursera.org-predmachlearn-016

Practical Machine Learning

Prediction Assignment Writeup

Juan Carlos Bretal Fernandez, December 2014

This is my analisys and solution to the problem. The Random Forest method is the best for this large dataset with too many predictors. Its necesary aplly a preprocesing. The `pml-testing.csv´ dont have all the columns and then i toke the option of remove all the "unused" columns. The train is SLOW but very accurate. See Confusion Matrix a validation at the bottom of this file.

#load data and libraries
library(caret)
library(doMC)
registerDoMC(cores = 2)
set.seed(2)

datos = read.csv("~/Descargas/pml-training.csv")
training <- subset(datos,select=-c(X,
                                user_name,
                                raw_timestamp_part_1,
                                raw_timestamp_part_2,
                                cvtd_timestamp))

test <- read.csv("~/Descargas/pml-testing.csv")
testing <- subset(test,select=-c(X,
                                user_name,
                                raw_timestamp_part_1,
                                raw_timestamp_part_2,
                                cvtd_timestamp,
                                problem_id))

# comparative of unused columns training vs test set
nzvTrain <- nearZeroVar(training)
nzvTest <- nearZeroVar(testing)
nzvTest %in% nzvTrain

filteredTraining <- training[, -nzvTest]# removing unused columns
dim(filteredTraining)

set.seed(2)

# creating a partition over the training set to achieve validation set
inTrain = createDataPartition(filteredTraining$classe, p = 0.6)[[1]]
trainFit = filteredTraining[ inTrain,]
testFit = filteredTraining[-inTrain,]

# fit to random forest with preprocessing
modFit <- train( trainFit$classe ~ . ,method="rf", 
                  allowParallel=TRUE ,
                  data=trainFit , 
                  preProcess = c("center", "scale"))
print(modFit$finalModel)

Call:
 randomForest(x = x, y = y, mtry = param$mtry, allowParallel = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 0.38%
Confusion matrix:
     A    B    C    D    E  class.error
A 3347    0    0    0    1 0.0002986858
B    9 2267    3    0    0 0.0052654673
C    0   11 2043    0    0 0.0053554041
D    0    0   13 1915    2 0.0077720207
E    0    0    0    6 2159 0.0027713626

Its a very good acuracy. This will be my prediction for the test set in file pml-testing.csv.

filteredTest <- testing[, -nzvTest]# removing unused columns
# adding classe column
filteredTest["classe"] <- filteredTraining[dim(filteredTraining)[2]]

# prediction over the test set pml-testing.csv
prediction <- predict(modFit, newdata=filteredTest)

______________________________________________
prediction
 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E
______________________________________________

Prediction over the test set partition of training set for aditional cross validation (40% of training set). Total accuracy of 99,5% which is a great value.


# to calculate accuracy and confusion matrix
predict1 <- predict(modFit, testFit)
cm <- confusionMatrix(predict1,testFit$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 2232    8    0    0    0
         B    0 1508   11    0    0
         C    0    2 1357    9    0
         D    0    0    0 1277    7
         E    0    0    0    0 1435

Overall Statistics

               Accuracy : 0.9953          
                 95% CI : (0.9935, 0.9967)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       

                  Kappa : 0.994           
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            1.0000   0.9934   0.9920   0.9930   0.9951
Specificity            0.9986   0.9983   0.9983   0.9989   1.0000
Pos Pred Value         0.9964   0.9928   0.9920   0.9945   1.0000
Neg Pred Value         1.0000   0.9984   0.9983   0.9986   0.9989
Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
Detection Rate         0.2845   0.1922   0.1730   0.1628   0.1829
Detection Prevalence   0.2855   0.1936   0.1744   0.1637   0.1829
Balanced Accuracy      0.9993   0.9958   0.9951   0.9960   0.9976

Conclusion: the Random Forest with preprocesing its the more accurate method, but with the 60% of the training set tooks too many to do it daily, may be other way to do it faster. Perhaps a drastical reduction of predictors to the most importants only. And reducing the dataset size to a value less than 11Mb for the training set.

Here some graphics generated during the work.

The importance of variables.

The accuracy vs variables

The error vs trees