data analysis assignment 2

7/28/2019 Data Analysis Assignment 2

1/5

Data Analysis Assignment 2

Introduction

During the last few years, there has been an enormous and exciting development of Activity-Based

Computing and Human Activity Recognition. This was allowed by the advent of miniaturized

sensing technology that can be directly worn by individuals. These devices are used to measure

variables that capture movement and allow researchers to predict the type of human activity

undergoing. As recent research has shown, human activity can be predicted using a single tri-axial

accelerometer [1]. In particular, mobile smartphones provided with tools as gyroscopes and

accelerometers have been used in order to experimentally measure changes in movement

parameters and correlate them to the activities. This might be used in order to predict human

activity in an accurate form. The relevance of this kind of research relies in the possibility of

developing smartphones that will anticipate required services by users.

In this paper we provide a predictive model for human activity, using the Human Activity

Recognition Using Smartphones Dataset [2]. These data was built by an experiment carried out

with a group of 30 volunteers within 19-48 years old, during which each person performed six

activities: a) walking, b) walking upstairs, c) walking downstairs, d) sitting, e) standing, and f)

laying. We used random forest in order to detect relevant variables and then we built a predictive

tree. We also used pruning to make a smaller model, easier to interpret.

Methods

Firstwe renamed the variables in order to avoid name duplications. Then we partitioned the data

in order to get a Train Set and a Test Set. Our Train Set included data collected for subjects 1, 3, 5,

6 and 7, and our Test Set included the data collected for subjects 27,28,29 and 30. We also used a

Validation Set with subjects 8,9,10 and 11. We used a combination of predictive methods: random

forest, predictive tree and pruning. The random forest was developed by Leo Breiman and Adele

Cutler [3], and it is a very efficient algorithm that uses model aggregation ideas and ensemble

methods for both classification and regression problems. As Genuer et al explain, the principle of

random forests is to combine many binary decision trees built using several bootstrap samples

coming from the learning sample L and choosing randomly at each node a subset of explanatory

variables X [4].

Results

As a first step to train the predictive model, we used a Random Forest within our Training Set,allowing all variables as predictors. The results were:

randomForest (formula = as.factor(activity) ~ ., data = train.set, proximity = TRUE)

Type of random forest: classification

Number of trees: 500

No. of variables tried at each split: 23

OOB estimate of error rate: 1.29%


2/5

The accuracy of the random forest classification is high but it turns to a complicated model with 23

variables and also has the risk of over fitting. Therefore, we decided to use this tool only to detect

relevant variables following the advice of the literature [4] and use them in a simpler Tree, with

less variables and easier to interpret. The random forest gives a measure of importance for each

variable in the prediction trees developed, which is called MeanDecreaseGini. We used this

measure to select the most relevant variables: we established a criterion that included variables

that had a value of importance of 11 % or more, this resulted in 23 variables to be included in our

predictive model. With these selected variables we then performed a new tree. The results were:

Classification tree using Random Forest to Select relevant variables:

tree(formula = as.factor(activity) ~ V42 + V57 + V560 + V41 + V51 + V54 + V53 + V559 + V50 + V58 + V10 +

V382 + V505 + V394 + V4 + V390 + V348 + V228 + V232 + V70 + V97 + V97, data = train.set)

Variables actually used in tree construction: [1] "V382" "V57" "V51" "V505" "V70" "V53" "V58"

Number of terminal nodes: 9

Residual mean deviance: 0.3687 = 595.2 / 1614

Misclassification error rate: 0.06654 = 108 / 1623

This Tree showed a lower error rate (0.066). After this, we performed a Tree with only the

variables that were actually used (seven variables), and implemented pruning (with

best=6) to get the smaller model possible to fit our data. Our decision was to get an easier

to use and interpret model.

Our predictive model is as follows:

E(HA) = V382 + V57 + V51 + V505 + V70

E(HA) is the Expected Human Activity

V382 is the Body Acceleration Jerk band Energy () 1,8

V57 is the Gravity Acceleration Energy in X

V51 is the Gravity Acceleration Maximum in Z

V505 is Body Acceleration Magnitude Median Absolute Deviation

V70 is Gravity Acceleration Autorregression Coefficient in Y


3/5

Figure 1. Predictive Tree for Human Activity Recognition Using Smartphone

As the figure of the Tree shows, the variable of Body Acceleration Jerk allows us to

differentiate two clusters of activities: on one side we get standing, sitting and laying, and

on the other side we get walk, walking up and walking down. Within the left cluster the

variable Gravity Acceleration Energy (V57) allows us to separate laying from standing andsitting, and these later are separated through the variable of Gravity Acceleration

Maximum in Z. Within the right cluster, the variable that measures the Median Absolute

Deviation of Body Acceleration Magnitude differentiates walkup and walk from walk

down. Then walk up is separated from walk by the variable Gravity Acceleration

Autorregression Coefficient in Y.

Once we had our predictive model established we performed a cross validation using our

validation data set.

Classification tree:snip.tree (tree = ValidTree, nodes = 12L)





4/5

Table 1. Confusion matrix using Validation Set

Laying Sitting Standing Walk Walkdown Walkup

Laying 28 0 0 0 0 0

Sitting 0 23 0 0 0 0

Standing 0 0 26 0 0 0

Walk 0 0 0 23 1 3

Walkdown 0 0 0 1 19 2

Walkup 0 0 0 0 0 24

The Miss classification error rate was very low (0.04) so we considered validated our model and

proceeded to test it in the actual Test Set.

Classification tree:

snip.tree(tree = TestTree, nodes = c(12L, 11L, 7L))

Variables actually used in tree construction:

[1] "V382" "V57" "V505" "V70"




Table 2. Confusion matrix for Test Set using Predictive Model

Laying Sitting Standing Walk Walkdown Walkup

Laying 293 0 0 0 0 0

Sitting 0 204 60 0 0 0

Standing 0 0 283 0 0 0

Walk 0 0 0 209 0 20

Walkdown 0 0 0 3 189 8

Walkup 0 0 0 3 62 151

The error rate we got in the Test Set was higher (0.105) than the error rate showed in our

validation data. Nevertheless, the error rate is still low and the model is quite simple, easy to

interpret and performs very well in computational times.

Conclusions

We were able to construct a predictive model for human activity recognition using only sixvariables measuring movement from miniaturized sensing technology located in a Smartphone.

The methods used involved random forest for variable selection, trees for prediction and pruning

to lower the number of variables. There are limitations to our model because we did not explore

the problem of high correlation between the predictors used.


5/5

References

1. Khan, Adil Mehmood, Human Activity Recognition Using A Single Tri-axial Accelerometer, PhD.

Thesis, South Korea, Kyung Hee University, 2011.

2. Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra and Jorge L Reyes-Ortiz. Human

Activity Recognition on Smartphones using a Multiclass Hardware-Friendly Support Vector

Machine. International Workshop of Ambient Assisted Living (IWAAL 2012). Vitoria-Gasteiz, Spain.

Dec. 2012.

3. Breiman, Leo, Random Forests in Machine Learning 45 (1): pp. 532, 2001.

4. Genuer, Robin; Poggi, Jean-Michel and Tuleau-Malot, Christine, Variable Selection using

Random Forests in Pattern Recognition Letters 31, 14, pp. 2225-2236, 2010.

data analysis assignment 2

Documents