student alcohol consumption

Running head: CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 1

CLASSIFICATION AND PREDICTION ANALYSIS OF

STUDENT ALCOHOL CONSUMPTION

by

Venkadesan Gandhi Rajan

CS 7720 Data Mining Course Project

CLASSIFICATION AND PREDICTION ANALYSIS OF STUDENT ALCOHOL CONSUMPTION 2

TABLE OF CONTENTSABSTRACT..........................................................................................................................................3

INTRODUCTION.................................................................................................................................3

BACKGROUND OF THE PROBLEM.................................................................................................4

STATEMENT OF THE PROBLEM.....................................................................................................5

PURPOSE OF THE STUDY.................................................................................................................5

SIGNIFICANCE OF THE STUDY.......................................................................................................6

NATURE OF THE STUDY..................................................................................................................6

RESEARCH QUESTIONS...................................................................................................................7

HYPOTHESES.....................................................................................................................................7

THEORECTICAL FAMEWORK.........................................................................................................8

DATA ANALSYSIS...........................................................................................................................11

CONCLUSION...................................................................................................................................15

SUMMARY........................................................................................................................................16

REFERENCES....................................................................................................................................17

APPENDIX A:....................................................................................................................................20

APPENDIX B:....................................................................................................................................21

APPENDIX C:....................................................................................................................................26


ABSTRACT

The project work proposes to approach secondary level student addiction on alcohol using

Data Mining (DM) techniques in WEKA (Srivastava, 2014). The outcome shows that an

excellent predictive accuracy can be achieved, provided that addiction of alcohol can

influence to the student performance. Several attributes are the main cause for a student’s

alcoholic life and the attributes are classified using the best algorithms such as Bagging, K-

nearest algorithm (IBK), Random Forrest, and Decision Tree (Kabakchieva, 2013) to predict

the accuracy. Several attributes bring good cause to a student and the instances are to be

followed by each and every student to lead a growth of one’s life. With the project’s

outcome, tools for classification and prediction analysis (Cortez & Silva, 2008) can be

developed in order to pay more care to the student and share how the alcohol effect so

seriously in a student’s life.

Keywords: student alcohol consumption, decision tree, prediction, accuracy,

classification

INTRODUCTION

Consuming alcohol – Is it good or bad? Some drink alcohol to get away from stress

and some drink excessively thinking of own problems faced in life. Alcohol had lots of bad

impact in our life. Consuming alcohol on a single event or above some phase is injurious to

health (Benton & Downey, 2006). If a student consumes alcohol, the affected health

percentage gets higher day by day and leads to minimal life span. Alcohol have many short-

term and long term health effects. Taking alcohol in teenager age, lessen a kid’s

psychological and physical capabilities which in-turn affects decision and management skills

and later ends in anxiety. The alcoholism level becomes higher and the mind changes


drastically, which embrace breathing mechanism (Engs, 1977). Alcoholic consumers most

probably gets hurt or have calamities than non-drinkers and gets injured when a drinker is a

traveler in a drink-driving event.

Alcohol consumption is restricted to age limits in different country. According to the

source in the internet (Toomey, Rosenfeld & Wagenaar, 1996), USA has the restriction of 21

years, China as 18 years and Japan as 20 years. Drinking alcohol is illegal in Saudi Arabia

and if found subjected to prison. In reference to the above, alcohol consumption has been a

threat to all students in every country. The growth of a student’s life is affected with respect

to the failures in one’s life, free time spending with friends. The most important victim for

alcoholic consumption is the male students (Schaeffer & Parsons, 1986). When a child

drinks alcohol, their judgement gets exaggerated. Whereas extreme drinking by teenagers is a

unruly on its individual right, it is at times related to other injurious performances like

captivating illegal drugs.

In this project, a student alcohol dataset is taken and classification algorithms are

tested to predict the behavior of a student consuming alcohol based on several attributes

(Cortez & Silva, 2008).

BACKGROUND OF THE PROBLEM

Data Mining is widely used in scholastic arena to augment the thoughtful of education

process and to concentrate on finding, mining and estimating variables linked to the erudition

procedure of students. The problems faced in the recent study (Pham, Clough, Nguyen, Kim

& Buettner, 2009) of alcohol consumption proposes that many students involved are affected

with the growth of the life and with the short-term and long-term health effects. A children

gets attracted to alcohol in puberty time, when there is a problem with family, school friends

and to prove as a grown up man. In fact when a child is too focused on studies, alcohol is an


attractive tool for all students to relieve from stress. Underage drinkers (Bernstein, Wood &

Colby, 2016) mostly grieve from a variety of health problems comprising fat or lean, bad

skin, troubled nap, pains. In juvenile and adolescent years, the mind is still emerging.

Addition of liquor to that procedure is probing for distress and can disturb memory function,

responses, learning capability and attention span all particularly significant throughout the

college days. Consumption could disturb kid’s performance at college and prevent them from

reaching their full potential. Young people who drink extremely are more likely to also have

bothered psychological health, even self-harm. Each parent needs the kid to be the best of

doing and carrying out well at school plays a big part in that. The stats illustrate underage

drinking makes that less likely.

STATEMENT OF THE PROBLEM

According to the research (Pham, Clough, Nguyen, Kim & Buettner, 2009), an

alcohol consumption leads a secondary school student to proceed on to poorer grades, to miss

college and in the nastiest case situation to be debarred from college. Having fewer

willpower, the wits struggled to identify cautioning symbols which in-turn leads to violence

and fights. The jeopardy of being intricate in ferocity and serious destruction rises straight in

line with liquor consumption and can lead to seizure and a felonious record. A student

alcohol consumption is predicted based on the alcohol and the absent attribute in the dataset.

In fact it is classified and predicted using popular Data Mining algorithms in which several

techniques are used and finding the best suited algorithm yields a good accuracy (Fielding,

2007). In the dataset (Lichman, 2013) taken, the attributes play a major role in predicting the

student behavior.


PURPOSE OF THE STUDY

The main purpose of the study is to predict the alcohol consumption by the secondary

school student by adding an attribute “Alco” to the dataset and to find the best classification

(Kumar & Crawford, 2001) algorithm with the excellent accuracy. The other purpose is to

minimize the student drinker by predicting the results. Somehow the students are managed to

be predicted by the attributes associated in the dataset. The study would be extended to list

the alcohol consumption students based on certain factors. Specifically, the factors contribute

an effective result for the project and the student’s behaviour is found by analysis of

attributes such as age, sex, free time, absence and failures. The Data mining approaches are

applied in this dataset using WEKA tool (Srivastava, 2014) to pre-process, predict, classify,

cluster and visualize data and yield an effective result.

SIGNIFICANCE OF THE STUDY

The research study (Biesieda & Duch, 2007) could provide some important Data

Mining techniques to perform correlation and applying filter to the required dataset. A

supervised filter, one of the pre-process method which contains the attributes and the

instances are applied to the dataset. This supervised learning (Rokach & Maimon, 2009)

denotes that the normal objects and the anomalies are modelled and if any outlier is detected

it is collected. The usual objects are modelled with the task to learn with classifier. Moreover,

the steps are taken with Data pre-processing, Data Classification and Data Clustering to

analyse the student behaviour related to alcoholic consumption. The Clustering-based

approach (Berkhin, 2006) is also one of the outlier detection method to find a group of cluster

and a small cluster and collect them according to the dataset requirements.


NATURE OF THE STUDY

The study of student database with several attributes in spite of their difficulties using

Data Mining techniques to bring change in the student performance with or without the

alcohol consumption is calculated in this project. WEKA tool (Srivastava, 2014) contains

several classification algorithm and prediction accuracy of each algorithm differs by huge

value depending upon the dataset applied. The attributes are cleansed and applied to classify,

cluster in order to invent the finest predictive model for the dataset with respect to the factors

in the secondary school and the student performance in the school. This study further reduces

the secondary school student drinkers to limited one.

RESEARCH QUESTIONS

Intrinsically there are some undesirable impacts of students drinking behavior for the

educational and community outcomes of college and university. Moreover, it has been

divided into two different sub-divisions: children who consume and children who abjure. The

research questions are:

(1) How accurate will prediction and classification algorithm work in student alcohol

consumption?

(2) Whether the difficulties are much noticeable for children from a certain sub-division?

How the comparison does with the results of alcohol consumption of male and female

students? Do abjurers grieve more comparative to consumers?

(3) Does coverage to the effects of drinking have same significances for grades and

achievement for the students or varies for abjurers or consumers?


HYPOTHESES

College students who drink alcohol affected in grades due to certain factors like free

time and absence in school.

(1) Drinking behavior will diminish the satisfaction of attending a university or college to a

greater extent for students who abstain from alcohol compared to those who drink.

(2) Students who abjure from alcohol consumption are more likely to have anxiety over their

academic grades in comparison to students who drink.

THEORECTICAL FAMEWORK

Data model

The data model has three different types. They are Star scheme, Snowflake scheme

and Fact constellation (Cabibbo & Torlone, 1998). The star scheme is used in this dataset

which implies that it consist of a fact table with the multiple dimension tables. The student

dataset has several attributes which in turn contains key attributes.

OLAP Operations

OLAP (Online Transaction Processing) is used in multi-dimensional model with

multiple dimensions (Codd & Salley, 1993) and each one contains multiple abstraction with

the concept levels. Usually it has several operations which includes Drill-down, Roll-up,

Slice and dice, Pivot and Drill-through. These operations are performed for the quality of the

data to visualize it in the multiple dimensions. Now the OLAP is processed and then the bit

map indexing is also calculated as it is the most popular method in OLAP products and it

searches values in cubic data. The base table is formed with letters and the index table is

formed with 0 and 1 to find the value in the data cube.


Classification methods

Classification (Bauer & Kohavi, 1999) is the most important and often used Data

Mining and Machine Learning method to classify a give dataset and predict the values with

the other attributes associated in the dataset. In classification, there are numerous methods

which helps in predicting a model. They are Bayes Classification, Random Forest, Decision

tree, Support Vector Machine (SVM) and many more. In this project, the accuracy of all the

algorithm is calculated predicting the alcohol and absent attribute with the other attributes.

With the effect of the outcome, the classifier predicts an accurate model for certain attributes.

Eventually, the best algorithm best fits the dataset and the output resembles a correctly

classified instances with the dataset.

Bayes classification (Jordan, 2002) depends on the conditional probability and the

likelihood of a property with the independence between various attributes in a dataset.

Random forest is one of the popular data mining classifier to predict the model. It is a group

of tree unpruned Decision tree. Each tree is created on an arbitrary attribute selection from

training samples and its predictions are constructed by finding the mean of the outputs of the

trees. The Decision Tree (DT) is a branching model which signifies a set of instructions,

distinguishing values in a ordered form. This depiction can translated into a set of IF-THEN

rules, which are stress-free to comprehend by persons. In decision tree, the C 4.5 algorithm

(Kretschman, Fleischmann & Apweiler, 2001) is an effective one for the statistics to classify

a model. It is otherwise known as Statistical classifier. The Support Vector Machine (SVM)

is one of the method for mapping the original data with the higher dimension. In this project,

these algorithms are classified using a WEKA (Srivastava, 2014) tool and the result is

generated with the good predictive accuracy.

Linear Regression and Outliers


In view of regression it was planned to signify the instances in the leaves of the

decision using linear function. Typically this topic is seemed especially interesting in regards

to the data modeled in the environment.

Analysts (Witten & Frank, 2005) tackled issues brought about by uproarious

information in direct relapse by checking information for anomalies and physically

evacuating them. In the event of direct relapse, they can be recognized outwardly. Anomalies

dramatically affect the typical slightest square relapse on the grounds that the squared

separation measure complements the impact of focuses far from the relapse line. Measurable

strategies (Witten & Frank, 2005) that address the issue of exceptions are called strong.

Relapse can be made strong in the accompanying ways:

1. Use supreme quality separation measure rather than the typical squared one. This

debilitates the impact of anomalies.

2. Distinguish anomalies consequently and expel them from contemplations. Illustration:

Structure a relapse line and expel from though those 10% of the focuses that lie most remote

from the line.

3. Minimize the middle (rather than the mean) of the squares of divergences from the relapse

line. This technique adapts to anomalies in both the X-heading and the Y-bearing.

There is a genuine weakness to this procedure: high computational cost, which can

make them infeasible for reasonableness. In this anticipate lamentably there is no straight

relapse on the takes off. Later on in this report there is a nominalization of the numeric

properties.

Learning and Prediction

In data mining (Han, Kamber & Pei, 2011), the classification methods play an

important role in building and predicting a model. WEKA (Srivastava, 2014) tool learns the


dataset with the training data and the test data which is obtained using percentage split where

some part of a data is training data and the remaining is the test data. Several attributes are

taken and each one is treated as a class attribute and the prediction algorithm works with the

different classification algorithms to predict a model.

DATA ANALSYSIS

Dataset description

A data set (Lichman, 2013) about Portuguese students done by Paulo Cortez and

Alice Silva, University of Minho, Portugal. In Portugal, the schooling has 3 years of

schooling in secondary education following 9 years of standard tutoring and tailed by higher

education. Furthermost students join the open and free schooling scheme. This study (Cortez

& Silva, 2008) contains data to be mined during the period 2005-2006 from two public

schools located in the Alentejo region of Portugal. The questionnaires to the students

designed several attributes such as sex, age, mother’s education, family income,

social/emotional activities and class failures were predictable to affect student performance.

Finally, the data was integrated (Ceruti & Kamel, 1999) into two data-sets related to

Mathematics (395 instances) and the Portuguese language (649 records) classes.

Data to be mined

In this project, the Student alcohol consumption dataset is taken from the UCI

Machine Learning Repository (Lichman, 2013). The data mining has several types of data.

Here the flat file is used which is a CSV file but it is converted into arff file to use in WEKA.

Eventually a flat file is mined using various datamining classification algorithm.

Data Preprocessing


Real-life databases are extremely prone with inconsistent, missing, and noisy data

(Garcia, Luengo & Herrera, 2015) due to abundance amount of information in the recent era

(which may be a number of gigabytes or more) and the possible foundation from numerous,

varied sources. The quality of data is very important in mining results. Low quality data

should be filtered efficiently using supervised filter (Rokach & Maimon, 2009). Some data

pre-processing methods are involved to perform data cleansing to remove noise and correct

discrepancies in data and combine two diverse data-set by data integration.

Close by are numerous probable explanations for imprecise data (i.e., having

inappropriate attribute values). There may be fault in data collecting equipment. There is a

possibility of manual or system faults when entering data. Sometimes, students may

intentionally submit improper data values for obligatory fields when the students do not

desire to give in the personal information (e.g., for birthday students select default as

“January 1”). It is called as disguised lost data. In this project, a data-set (Lichman, 2013) is

collected by Portuguese professor and the test for lost data or value is done and found no lost

value or data in those dataset.

During this stage (Garcia, Luengo & Herrera, 2015), some attributes were rejected

due to the deficiency of discriminative value. For example, few scholars answered about the

family income (maybe due to confidentiality problems), whereas nearly 100 % of the scholars

stay with the parentages and have an individual mainframe at household.

Attribute selection

The aim is to find secondary school students who drink alcohol. In this data-set, there

are two additional attribute with the original dataset. The first attribute is alcohol consuming

in weekdays (Dalco) and the second attribute is alcohol consuming in weekend (Walco). But

the predictive analysis limits to one attribute (Alco) as a student who consume alcohol in a

week. Accordingly merging two attributes yields the following equation


Alco=Walco∗2+Dalco∗57

The other important attribute is find student absence rate at school. A student

frequently make absence in school takes more alcohol. Therefore this attribute becomes a

binary value, if he frequently make absence (Over 10 days) is 0, otherwise value is 1.

Prediction of absence rate is done using Weka and to perform this data reduction,

concatenation component for merging two data-set (Lichman, 2013): student-mat and

student-por into a new dataset and convert binary value of Alco and Fabs to nominal value to

perform correlations.

Functionalities of data

The data mining has certain important functionalities (Han, Kamber & Pei, 2011)

such as Characterization, Classification, Prediction, Clustering, Outlier analysis. Although

many concepts/classes were derived for characterization and discrimination, this dataset tells

that classes are schools and concepts are Drinkers and Non-drinkers (Wiers, Hoogeveen,

Sergeant & Gunning, 1997). Characterization is optimized by study of the class variables

whereas the discrimination compares the class data.

Linear correlation and filter low correlation value where correlation threshold value is

less than 0.35. It filtered four column attribute (Fed; G1; G2; G2) that is not correlated to Alc.

Elimination of backward feature is performed for a loop with cross validation (Kohavi,

1995). Cross validation is used to test data by using Random Forest. This procedure again

used to perform prediction and testing the result. The last procedure of data preprocessing

(Ceruti & Kamel, 1999) is filter backward elimination feature data to rescue original data.

Here 11 column attribute is filtered which is not correlated with Alco attribute.

Statistical presentation


The statistical representation of data contains the central tendency (Wilcox &

Keselman, 2003), range, Interquartile range (IQR) and variance. Eventually the mean is the

average of the value in the dataset. Median has the dataset selected with the middle value and

separates two halves of the data. Mode is the frequently occurring value in the dataset. Range

is defined as the change between maximum and minimum value in the dataset. IQR denotes

the difference between the third quartile and the first quartile. Variance is one of the measure

of data dispersion. It is the square of the standard deviation. In this dataset (Lichman, 2013),

there are several attributes associated with it. Each attribute has some values either it is

numeric or nominal. Therefore the statistical representation of these data with numeric

attribute in WEKA is shown in Appendix.

List of attributes

An attribute represents a characteristics of an object. There are several types of

attributes (Han, Kamber & Pei, 2011) depending on the values in the dataset. They are

Nominal, Numeric, Binary and Ordinal attributes. Nominal attributes relate to the name of

things. It may be symbols too depending on the value. Numeric attributes denote that the

values represented in integer or real. Binary attributes specify an object to be in two ways (0

or 1). Ordinal attributes are those having meaningful value in an object. In this project, the

student alcohol consumption dataset has numeric, nominal and binary attributes. By using

WEKA (Srivastava, 2014) tool, the class attribute is selected and the algorithms

corresponding to that attribute is classified and predicted for a good accuracy.

Visualization of data

The size of the dataset (Lichman, 2013) is 1044 instances and 33 attributes. The data

has no redundant information. Visually, the data is represented in WEKA in which the 33

attributes are normalized and the clusters are collected to define it with the attributes and the


classifier classifies with the predicting tool. Using R, the boxplot and histogram is drawn to

show the data is of good quality and the data has no missing values and duplicate

information. Data Mining (Han, Kamber & Pei, 2011) is the process of classifying new

outlines and visions in data. As the capacity of data collected and kept in files produces, there

is a rising essential to offer data summarization, identify important designs and tendencies,

and act upon the conclusions.

CONCLUSION

Education is a vital component in this civilization. Business Intelligence (BI)/Data

Mining (DM) strategies, which permit an abnormal state extraction of information from raw

information, offer intriguing conceivable outcomes for the training area. Specifically, a few

studies have utilized BI/DM techniques (Han, Kamber & Pei, 2011) to lessen the liquor

compulsion rate to youngsters and improve way of life for kid. In this work, the expectation

of young people liquor enslavement by utilizing past school records, demographic, family

and other information identified are studied. A few DM objectives and DM technique were

tried. Here a few restrictions, as number of folds we utilized is just 10 for all classification

algorithms. Since for vast number folds, it takes additional time and need capable PC to

process information. Additionally it takes just 25 models for test, since some model worth is

exceptionally lower. The acquired results uncover that it is conceivable to keep the kid far

from liquor. This affirm the conclusion found that, kid conduct is very influenced by

companions or gathering. By and by, an investigation to information gave by the best

prescient models has demonstrated that, for some situation, there are other pertinent elements,

for example, school related, demographic and social variables. This work depends on logged

off study. Every one of the strategies is used to an information set gathered by other

individuals. Nonetheless, there is a potential for a programmed on-line learning environment,


by utilizing Internet can discover more information about underage liquor habit. In addition,

chose on the off chance that somebody need work all the more profoundly on this theme and

beat the restrictions of the work. More research and greater information set are likewise

required keeping in mind the end goal to comprehend why and how a few variables impact on

kids and they are dependent on liquor.

SUMMARY

Finally the WEKA (Srivastava, 2014) is executed with the student data set and

obtained some results. The results shows that the classification algorithm works effectively

with good accuracy for algorithm and the prediction is analyzed with the Regression

algorithm. The prediction outcome is effective with an Alco attribute to be made as a class

attribute. In the dataset, the “Alco” attribute is used is an important factor for predicting the

model. The accuracy of all the classification algorithm (Cortex & Silva, 2008) is found and

the K-nearest neighbor algorithm predicts with 87.3563% tops the most. Other algorithms

such as J48, Random Forest, SMO yields 74.0421 %, 72.1264 %, 74.3295 % respectively.

Regression algorithms such as Logistic and Simple Logistic are used to determine 64.6552 %

and 73.3716 % respectively.

The gender is one of most significant attribute for a liquor drinker. Males are the

supreme convoluted with alcohol which proves that men consume alcohol more than women

in a 2014 alcohol report. Alternatively, the other significant feature of an alcoholics is the

social activity. An individual who verves out often with friends and cousins consume more

liquor. This is due to the drinking turn out to be a way to party a good news or in the

civilization and also to distinguish persons as typically a person are more existing when a

person are drunk. Roughly vital features are also: additional free time and also fewer study

time because usually they find in alcohol a way to relax and escape from problems).


REFERENCES

Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms:

Bagging, boosting, and variants. Machine learning,36(1-2), 105-139.

Benton, S. L., Benton, S. A., & Downey, R. G. (2006). College Student Drinking, Attitudes Toward

Risks, and Drinking Consequences. J. Stud. Alcohol Journal of Studies on Alcohol, 67(4),

543-551. doi:10.15288/jsa.2006.67.543

Berkhin, P. (2006). A survey of clustering data mining techniques. InGrouping multidimensional

data (pp. 25-71). Springer Berlin Heidelberg.

Bernstein, M. H., Wood, M. D., & Colby, S. M. (2016). A new paradigm for credibly administering

placebo alcohol to underage drinkers. Addictive Behaviors, 52, 22-27.

doi:10.1016/j.addbeh.2015.08.004

Biesiada, J., & Duch, W. (2007). Feature Selection for High-Dimensional Data — A Pearson

Redundancy Based Filter. Advances in Soft Computing Computer Recognition Systems 2,

242-249. doi:10.1007/978-3-540-75175-5_30

Blaheta, D. (2002, July). Handling noisy training and testing data. InProceedings of the ACL-02

conference on Empirical methods in natural language processing-Volume 10 (pp. 111-116).

Association for Computational Linguistics.

Cabibbo, L., & Torlone, R. (1998). A logical approach to multidimensional databases. In Advances in

Database Technology—EDBT'98 (pp. 183-197). Springer Berlin Heidelberg.


Ceruti, M. G., & Kamel, M. N. (1999). Preprocessing And Integration Of Data From Multiple

Sources For Knowledge Discovery. International Journal on Artificial Intelligence Tools Int.

J. Artif. Intell. Tools, 08(02), 157-177. doi:10.1142/s0218213099000117

Cortez, P., & Silva, A. (2008). Using Data Mining to Predict Secondary School Student Performance.

In Brito and Teixeira Eds., Proceedings of 5th Future Business Technology Conference

(FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-

7

Engs, R. C. (1977). Drinking patterns and drinking problems of college students. J. Stud. Alcohol

Journal of Studies on Alcohol, 38(11), 2144-2156. doi:10.15288/jsa.1977.38.2144

Fielding, A. H. (2007). Classification accuracy. Cluster and Classification Techniques for the

Biosciences, 179-199. doi:10.1017/cbo9780511607493.008

García, S., Luengo, J., & Herrera, F. (2015). Data Preprocessing in Data Mining. Intelligent Systems

Reference Library. doi:10.1007/978-3-319-10247-4

Hall, M. A., & Holmes, G. (2003). Benchmarking attribute selection techniques for discrete class data

mining. Knowledge and Data Engineering, IEEE Transactions on, 15(6), 1437-1447.

Han, J., Kamber, M., & Pei, J. (2011). Data mining: concepts and techniques. Elsevier.

Jordan, A. (2002). On discriminative vs. generative classifiers: A comparison of logistic regression

and naive bayes. Advances in neural information processing systems, 14, 841.

Kabakchieva, D. (2013). Predicting Student Performance by Using Data Mining Methods for

Classification. Cybernetics and Information Technologies, 13(1). doi:10.2478/cait-2013-

0006.

Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and

model selection. In Ijcai (Vol. 14, No. 2, pp. 1137-1145).

Kretschmann, E., Fleischmann, W., & Apweiler, R. (2001). Automatic rule generation for protein

annotation with the C4. 5 data mining algorithm applied on SWISS-

PROT. Bioinformatics, 17(10), 920-926.


Kumar, S., Ghosh, J., & Crawford, M. M. (2001). Best-bases feature extraction algorithms for

classification of hyperspectral data. Geoscience and Remote Sensing, IEEE Transactions

on, 39(7), 1368-1379.

Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA:

University of California, School of Information and Computer Science

Pham, D. B., Clough, A. R., Nguyen, H. V., Kim, G. B., & Buettner, P. G. (2009). Alcohol

consumption and alcohol-related problems among Vietnamese medical students. Drug and

Alcohol Review, 29(2), 219-226. doi:10.1111/j.1465-3362.2009. 00094.x

Rokach, L., & Maimon, O. (2009). Supervised Learning. Data Mining and Knowledge Discovery

Handbook, 133-147. doi:10.1007/978-0-387-09823-4_8

Schaeffer, K. W., & Parsons, O. A. (1986). Drinking practices and neuropsychological test

performance in sober male alcoholics and social drinkers. Alcohol, 3(3), 175-179.

doi:10.1016/0741-8329(86)90041-8

Srivastava, S. (2014). Weka: A Tool for Data preprocessing, Classification, Ensemble, Clustering and

Association Rule Mining. International Journal of Computer Applications IJCA, 88(10), 26-

29. doi:10.5120/15389-3809

Tong, S., & Koller, D. (2002). Support vector machine active learning with applications to text

classification. The Journal of Machine Learning Research, 2, 45-66.

Toomey, T. L., Rosenfeld, C., & Wagenaar, A. C. (1996). The minimum legal drinking age: History,

effectiveness, and ongoing debate. Alcohol Research and Health, 20(4), 213.

Underage Drinking. (n.d.). Retrieved April 26, 2016, from

http://www.niaaa.nih.gov/alcohol-health/special-populations-co-occurring-disorders/

underage-drinking

Wiers, R. W., Hoogeveen, K. J., Sergeant, J. A., & Gunning, W. B. (1997). High‐and low‐dose

alcohol‐related expectancies and the differential associations with drinking in male and

female adolescents and young adultsAddiction, 92(7), 871-888.


Wilcox, R. R., & Keselman, H. J. (2003). Modern robust data analysis methods: measures of central

tendency. Psychological methods, 8(3), 254.

Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques.

Morgan Kaufmann.

APPENDIX A:

Figure 1 Architecture of the student alcohol prediction

In this project, the two datasets (Lichman, 2013) from Mathematics and Portuguese

students are integrated into student dataset which is converted from csv file to arff file

supported by WEKA. Furthermore the dataset contains several attributes and by using

Attribute selection (Hall & Holmes, 2003) in supervised filter, the alcohol attribute is selected

Student Maths dataset

Student Portuguese

dataset

Student Dataset

Attribute selection

Classification by WEKA

Prediction analysis and

results

arff file Supervised filter SVM,

Decision tree and Random

forest


and it is represented as a class attribute to classify the dataset using certain classification

methods such as Bagging, K-nearest neighbor algorithm, Support Vector Machine (SVM),

Decision tree and Random Forest to predict the behavior of a student consuming alcohol.

This step-by-step process are executed in WEKA (Srivastava, 2014) and each algorithm has

several ways of classifying the data. Decision tree classifies the data based on IF_THEN class

with which the attributes are predicted and analyzed. Support Vector Machine (SVM)

handles the data with the normalization of each attribute value with the Precision, Recall

(Tong & Koller, 2002).

APPENDIX B:

Data preprocessing results

The dataset is preprocessed with Attribute Selection filter to remove the unwanted

noise and the evaluator is “CorrelationAttributeEval” and the search is “Ranker”. Typically

an outcome would be attributes without noise and the classification algorithms is predicted

with good accuracy. Usually, the attributes are taken as a nominal to specify the correctly and

incorrectly instances of a class.

Figure 2 Preprocessing Results


Figure 3 Attribute Selection method


Figure 4 Mean value of Alco attribute

Figure 5 Classification algorithm of IBK (K-nearest neighbor) results


Figure 6 Classification algorithm of SMO (Sequential Minimal Optimization) results

Figure 7 Classification algorithm of J48 results


Figure 8 Classification algorithm of Random Forest results

Figure 9 Regression algorithm of Logistic results


Figure 10 Regression algorithm of Simple Logistic results


APPENDIX C:

Graph 1 of Attribute distribution

student alcohol consumption

Documents