automatic feature engineering with rapidminer auto model...2019/02/06  · rapidminer auto model’s...

Post on 16-Sep-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Feature Engineering with RapidMiner Auto Model:

Rapidly identifying alcoholics from their EEGs with ease, precision and accuracy.

By

Dr Gwin NYAKUENGAMA

DatAnalytics

Email: DatAnalytics@iinet.com.au

Webpage: https://dat-analytics.net/ 1

To use RapidMiner Auto Model’s automatic feature engineering to identify alcoholics from their electroencephalograms (EEGs).

AIM

2

Electroencephalogram (EEG); Alcoholics; RapidMiner Auto Model; Automatic Feature Engineering; Machine Learning; Classification; Deep Learning; Decision Tree; Random Forest; Gradient Boosted Tree; Support Vector Machine

KEYWORDS

3

We are grateful to:

• UCI for their EEG dataset and their image used in the main title;

• Previous scholars cited in this study;

• RapidMiner, Microsoft and Stata Corp for their software; and

• Our friends for their support and encouragement.

ACKNOWLEDGEMENTS

4

INTRODUCTION

Increasingly, both supervised and unsupervised machine learning (ML) are being used to study the adverse medical effects of alcohol on the human brain (see the literature reviews by Rangaswamy and Porjesz, 2014 and Priya et al., 2018).

In ML, features are individual measurable characteristics or dimensions that best represent the data under study. Features are numeric values, strings or variables. Feature engineering is the science of generating and selecting optimal features for model building and validation. Conventional least-square statistical methods are inapplicable on account of autocorrelation / non-independence of these features.

Careful feature selection is the linchpin to optimal ML model performance in terms of accuracy, precision and recall. There are dangers in under- and over-fitted models, such as long machine processing times and poor model performance (see Nyakuengama 2019).

In this study, we successfully built five models in RapidMiner Auto Model namely Deep Learning, Decision Tree, Random Forest, Gradient Boosted Tree and Support Vector Machine. These models made use of RapidMiner’s Feature Engineering engine.

5

METHOD

The UCI EEG dataset was described previously (Wang et. al., 2014; Zhu et. al., 2014). Variables comprised: • Subject ID, Category (Control / Alcoholic), Time, Trial number, Sensor Position, Matching and Sensor Value (EEG).

Data preparation and preliminary data visualization of the dataset were carried out in R, Stata and MS EXCEL:• EEG signal data from the SMNI_CMI_TRAIN.tar.gz was used to both train and validate models (see below). Data

comprised eight controls and eight alcoholic subjects.• All 64 channels were used.• A minimum amount of data (14 % of the original dataset) was kept to minimize PC processing time:

o Kept a subset of data with time equal or greater than 0.79. This value was selected following inspection of original EEG 3D images (see figure on page 7), which showed big visual difference between the control and alcoholics for time equal or greater than 0.79.

o Kept data for only 44 (of the 64) sensor positions which showed significant visual difference between the control and alcoholics (see next two figures on pages 8 and 9).

Data was imported into RapidMiner Auto Model and ML experiments were undertaken as described previously (Nyakuengama 2018).

6

Visualization of EEG readings in control and alcoholic subjects (from original study)

7

Visualization of EEG readings in control and alcoholic subjects

Locations on head (after Wang et. al., 2014) Current study (summarized using Stata, plotted in MS EXCEL)8

Selected sensor positions

Locations on head (after Wang et. al., 2014)

AF1 CP2 FP2 P5

AF2 CP3 FPZ P6

AF7 CP4 O2 P7

AF8 CP5 P01 P8

AFZ CP6 P02 PZ

C1 CPZ P07 T7

C2 F1 P08 T8

C3 F2 P1 TP7

C4 F3 P2 TP8

CP1 F7 P3 X

F8 P4 Y

Used in current study

9

DATA IMPORTATION INTO RAPIDMINER AUTO MODEL

10

11

VARIABLE SELECTION

12

13

14

15

MODEL SPECIFICATION IN RAPIDMINER AUTO MODEL

16

17

RESULTS

18

19

20

21

22

23

24

25

26

27

MODEL APPLICATION

28

DISCUSSION

The best RapidMiner Auto Models all had high (100%) and similar performance parameters (i.e. Accuracy, Precision, ROC, AUC and Recall). These were Random Forest (shortest time), Gradient Boosted Trees, Decision Tree and Deep Learning (longest time). Support Vector Machine was comparatively the worst model. We note that the Naïve Bayes model failed completely. Experiments using this model on large datasets and many features often fail (see references cited by Nyakuengama 2019).

We would choose the Random Forest model among the best models on account of its shortest run-time. However, if this was not an issue, then we could have chosen any of the best models.

Wang et al. (2014) previously found that:• Some sensor positions were most useful in identifying alcoholic EEG signals. All of these sensors were also included in

this study;• Using a technique called K-Nearest Neighbour achieved a great accuracy of 95%;• A data reduction technique called PCA-GE achieved comparable accuracy but using only a third of the data and in only a

third of the run-time of other techniques; and• Using only a third of the data and 19 channels yielded a good accuracy of around 92%.

While the current study is not directly comparable to that of Wang et al. (2014), it is noteworthy that current experiments in RapidMiner Auto Model achieved the superior performance of 100% in each of accuracy, precision and recall with just 14% of the data and 44 channels.

29

DISCUSSION (continued)

RapidMiner Auto Model’s Feature Engineering engine automatically identified the most useful ML features (based on their weights) as sensor values, times, matching, channels and sensor position. It is unsurprising that the usefulness of the features reflected the very experimental design of the study – the observed EEG sensor values of each individual were measured at set times, after controlling for matching, channels and the sensor positions in what is known in statistical terms as a ‘nested’ experimental design.

In this study we successfully used the Random Forest model to identify alcoholics in a new EEG signals dataset (SMNI_CMI_TEST.tar.gz) that had not previously been used to develop the model. In our RapidMiner Auto Model experiment this step is called model application (see the self-titled figure).

30

CONCLUSION

Our study showed that RapidMiner is a machine learning tool-of-choice when investigating changes in human brain’s EEG signals arising from alcoholism primarily because of:

• Availability of an ensemble machine learning models;• A state-of-the-art, automatic Feature Engineering engine and its ability to handle large dataset with several

features;• Its rapid, accurate and precise results; and• Transparent processes around data cleansing, model building, model validation and model application on new data.

The reader is welcome to contact the author to discuss any aspect of this study (DatAnalytics@iinet.com.au).

31

BIBLIOGRAPHY

Nyakuengama , J.G. 2018: Use of RapidMiner - Auto Model To Predict Customer Churn: https://dat-analytics.net/2018/07/28/use-of-rapidminer-auto-model-to-predict-customer-churn/

Nyakuengama, J.G. 2019 Part I: Automatic Machine Learning Document Classification – An Introduction: https://provalisresearch.com/blog/automatic-machine-learning-document-classification/

Priya, A. ; Yadav, P., Jain, S.; Bajaj, V. 2018: Efficient method for classification of alcoholic and normal EEG signals using EMD. The Journal of Engineering. Vol. 2018, Issue 3, pp. 166–172.

Rangaswamy, M.; Porjesz B. 2014: Chapter 3 - Understanding alcohol use disorders with neuroelectrophysiology. Handbook of Clinical Neurology, Vol. 125 (3rd series). Alcohol and the Nervous System. E.V. Sullivan and A. Pfefferbaum, Editors.

Zhu, G.; Li Y.; Wen, P.P. 2014: Analysis of alcoholic EEG signals based on horizontal visibility graph entropy. Brain Inform.2014 Dec; 1(1-4): 19–25.

Wang, S. Li, Y.; Wen, P.P.; Lai D. 2014: Data selection in EEG Signals Classification -https://eprints.usq.edu.au/28810/1/Data%20Selection%20in%20EEG%20Signals%20Classification.pdf

32

top related