data analysis assignment 2

Upload: marina-mog

Post on 03-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Data Analysis Assignment 2

    1/5

    Data Analysis Assignment 2

    Introduction

    During the last few years, there has been an enormous and exciting development of Activity-Based

    Computing and Human Activity Recognition. This was allowed by the advent of miniaturized

    sensing technology that can be directly worn by individuals. These devices are used to measure

    variables that capture movement and allow researchers to predict the type of human activity

    undergoing. As recent research has shown, human activity can be predicted using a single tri-axial

    accelerometer [1]. In particular, mobile smartphones provided with tools as gyroscopes and

    accelerometers have been used in order to experimentally measure changes in movement

    parameters and correlate them to the activities. This might be used in order to predict human

    activity in an accurate form. The relevance of this kind of research relies in the possibility of

    developing smartphones that will anticipate required services by users.

    In this paper we provide a predictive model for human activity, using the Human Activity

    Recognition Using Smartphones Dataset [2]. These data was built by an experiment carried out

    with a group of 30 volunteers within 19-48 years old, during which each person performed six

    activities: a) walking, b) walking upstairs, c) walking downstairs, d) sitting, e) standing, and f)

    laying. We used random forest in order to detect relevant variables and then we built a predictive

    tree. We also used pruning to make a smaller model, easier to interpret.

    Methods

    Firstwe renamed the variables in order to avoid name duplications. Then we partitioned the data

    in order to get a Train Set and a Test Set. Our Train Set included data collected for subjects 1, 3, 5,

    6 and 7, and our Test Set included the data collected for subjects 27,28,29 and 30. We also used a

    Validation Set with subjects 8,9,10 and 11. We used a combination of predictive methods: random

    forest, predictive tree and pruning. The random forest was developed by Leo Breiman and Adele

    Cutler [3], and it is a very efficient algorithm that uses model aggregation ideas and ensemble

    methods for both classification and regression problems. As Genuer et al explain, the principle of

    random forests is to combine many binary decision trees built using several bootstrap samples

    coming from the learning sample L and choosing randomly at each node a subset of explanatory

    variables X [4].

    Results

    As a first step to train the predictive model, we used a Random Forest within our Training Set,allowing all variables as predictors. The results were:

    randomForest (formula = as.factor(activity) ~ ., data = train.set, proximity = TRUE)

    Type of random forest: classification

    Number of trees: 500

    No. of variables tried at each split: 23

    OOB estimate of error rate: 1.29%

  • 7/28/2019 Data Analysis Assignment 2

    2/5

    The accuracy of the random forest classification is high but it turns to a complicated model with 23

    variables and also has the risk of over fitting. Therefore, we decided to use this tool only to detect

    relevant variables following the advice of the literature [4] and use them in a simpler Tree, with

    less variables and easier to interpret. The random forest gives a measure of importance for each

    variable in the prediction trees developed, which is called MeanDecreaseGini. We used this

    measure to select the most relevant variables: we established a criterion that included variables

    that had a value of importance of 11 % or more, this resulted in 23 variables to be included in our

    predictive model. With these selected variables we then performed a new tree. The results were:

    Classification tree using Random Forest to Select relevant variables:

    tree(formula = as.factor(activity) ~ V42 + V57 + V560 + V41 + V51 + V54 + V53 + V559 + V50 + V58 + V10 +

    V382 + V505 + V394 + V4 + V390 + V348 + V228 + V232 + V70 + V97 + V97, data = train.set)

    Variables actually used in tree construction: [1] "V382" "V57" "V51" "V505" "V70" "V53" "V58"

    Number of terminal nodes: 9

    Residual mean deviance: 0.3687 = 595.2 / 1614

    Misclassification error rate: 0.06654 = 108 / 1623

    This Tree showed a lower error rate (0.066). After this, we performed a Tree with only the

    variables that were actually used (seven variables), and implemented pruning (with

    best=6) to get the smaller model possible to fit our data. Our decision was to get an easier

    to use and interpret model.

    Our predictive model is as follows:

    E(HA) = V382 + V57 + V51 + V505 + V70

    E(HA) is the Expected Human Activity

    V382 is the Body Acceleration Jerk band Energy () 1,8

    V57 is the Gravity Acceleration Energy in X

    V51 is the Gravity Acceleration Maximum in Z

    V505 is Body Acceleration Magnitude Median Absolute Deviation

    V70 is Gravity Acceleration Autorregression Coefficient in Y

  • 7/28/2019 Data Analysis Assignment 2

    3/5

    Figure 1. Predictive Tree for Human Activity Recognition Using Smartphone

    As the figure of the Tree shows, the variable of Body Acceleration Jerk allows us to

    differentiate two clusters of activities: on one side we get standing, sitting and laying, and

    on the other side we get walk, walking up and walking down. Within the left cluster the

    variable Gravity Acceleration Energy (V57) allows us to separate laying from standing andsitting, and these later are separated through the variable of Gravity Acceleration

    Maximum in Z. Within the right cluster, the variable that measures the Median Absolute

    Deviation of Body Acceleration Magnitude differentiates walkup and walk from walk

    down. Then walk up is separated from walk by the variable Gravity Acceleration

    Autorregression Coefficient in Y.

    Once we had our predictive model established we performed a cross validation using our

    validation data set.

    Classification tree:snip.tree (tree = ValidTree, nodes = 12L)

    Number of terminal nodes: 6

    Residual mean deviance: 0.289 = 41.33 / 143

    Misclassification error rate: 0.04027 = 6 / 149

  • 7/28/2019 Data Analysis Assignment 2

    4/5

    Table 1. Confusion matrix using Validation Set

    Laying Sitting Standing Walk Walkdown Walkup

    Laying 28 0 0 0 0 0

    Sitting 0 23 0 0 0 0

    Standing 0 0 26 0 0 0

    Walk 0 0 0 23 1 3

    Walkdown 0 0 0 1 19 2

    Walkup 0 0 0 0 0 24

    The Miss classification error rate was very low (0.04) so we considered validated our model and

    proceeded to test it in the actual Test Set.

    Classification tree:

    snip.tree(tree = TestTree, nodes = c(12L, 11L, 7L))

    Variables actually used in tree construction:

    [1] "V382" "V57" "V505" "V70"

    Number of terminal nodes: 6

    Residual mean deviance: 0.5751 = 850.5 / 1479

    Misclassification error rate: 0.1051 = 156 / 1485

    Table 2. Confusion matrix for Test Set using Predictive Model

    Laying Sitting Standing Walk Walkdown Walkup

    Laying 293 0 0 0 0 0

    Sitting 0 204 60 0 0 0

    Standing 0 0 283 0 0 0

    Walk 0 0 0 209 0 20

    Walkdown 0 0 0 3 189 8

    Walkup 0 0 0 3 62 151

    The error rate we got in the Test Set was higher (0.105) than the error rate showed in our

    validation data. Nevertheless, the error rate is still low and the model is quite simple, easy to

    interpret and performs very well in computational times.

    Conclusions

    We were able to construct a predictive model for human activity recognition using only sixvariables measuring movement from miniaturized sensing technology located in a Smartphone.

    The methods used involved random forest for variable selection, trees for prediction and pruning

    to lower the number of variables. There are limitations to our model because we did not explore

    the problem of high correlation between the predictors used.

  • 7/28/2019 Data Analysis Assignment 2

    5/5

    References

    1. Khan, Adil Mehmood, Human Activity Recognition Using A Single Tri-axial Accelerometer, PhD.

    Thesis, South Korea, Kyung Hee University, 2011.

    2. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L Reyes-Ortiz. Human

    Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector

    Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain.

    Dec. 2012.

    3. Breiman, Leo, Random Forests in Machine Learning 45 (1): pp. 532, 2001.

    4. Genuer, Robin; Poggi, Jean-Michel and Tuleau-Malot, Christine, Variable Selection using

    Random Forests in Pattern Recognition Letters 31, 14, pp. 2225-2236, 2010.