ANALYSIS OF SPM 2019/2020 STUDENT’SPERFORMANCE USING ID3 ALGORITHM AND
J48 CLASSIFICATION
AIMAN DZAFRI BIN HUSNI BALIS
BACHELOR OF COMPUTER SCIENCE(COMPUTER NETWORK SECURITY) WITH
HONOURS
UNIVERSITI SULTAN ZAINAL ABIDIN
2021
ANALYSIS OF SPM 2019/2020 STUDENT’S PERFORMANCE BYDATAMINING USINGMACHINE LEARNING
AIMAN DZAFRI BIN HUSNI BALIS
BACHELOR OF COMPUTER SCIENCE (COMPUTERNETWORK SECURITY) WITH HONOURS
Universiti Sultan Zainal Abidin
2021
i
DECLARATION
I hereby declare that the report is based on my original work except for quotations and
citations, which have been duly acknowledged. I also declare that it has not been
previously or concurrently submitted for any other degree at Universiti Sultan Zainal
Abidin or other institutions.
_______________________________Name: Aiman Dzafri Bin Husni Balis
Date:
ii
CONFIRMATION
This is to confirm that:
The research conducted and the writing of this report were under my supervision.
_______________________________Name: Mrs. Roslinda Binti Muda
Date:
iii
DEDICATION
In the name of Allah, the most gracious and the most merciful, I thank Allah,
all praise to Allah who has guided me and giving me the strength to proceed and finish,
to submit the report, Analysis of SPM 2019/2020 Student’s Performance by Data
Mining Using Machine Learning in due time and without whom help this study which
required untiring effort would have not been possible to complete with a time limit.
On this special opportunity is given to me, I would like to express my gratitude
to my supervisor, Mrs. Roslinda Binti Muda for her supervision and inspiration
throughout my final year project. Without her time, her support and guidance, it is
impossible for me to finish my project successfully. Thank you for the kindness. May
Allah bless her.
Besides, I would like to extend my appreciation to both of my parents that help
me in a various way in order to ensure I can complete my project whether in moral
support and financial support. Last but not least, to my classmates and course mates
that always helps in many ways in order to help me complete the project within the
time given.
iv
ABSTRACT
In back of the years, the analysis of student’s performance and retaining the
standard of education is a very important problem in all the educational institutions.
Data mining concerns with developing method for discover knowledge from data to
improve student’s performance and overcome the problem. Although, it can be used
for decision making in educational or academic systems. To mine the student’s
performance data, the data mining classification techniques such as decision tree with
the ID 3 algorithm and J48 classification model were built with 10 fold using WEKA.
Beside, the performance of the classification models used are tested and compared.
The results of such classification model deals with accuracy level, confusion matrices
and also the execution time.
v
CONTENTS
PAGE
DECLARATION iCONFIRMATION iiDEDICATION iiiABSTRACT ivCONTENTS vLIST OF TABLES viLIST OF FIGURES viiLIST OF ABBREVIATIONS viiiCHAPTER 1 INTRODUCTION 1
1.1 Introduction 11.2 Project Background 11.3 Problem Statement 41.4 Objectives 41.5 Scope 51.6 Limitation of Work 51.8 Gantt Chart 61.9 Summary of the Chapter 7
CHAPTER 2 LITERATURE REVIEW 82.1 Introduction 82.2 Related Work 92.3 Summary of the Chapter 13
CHAPTER 3 METHODOLOGY 143.1 Introduction 143.2 Research of Methodology 153.3 Framework of Knowledge Discovery Database 163.4 Decision Tree Classifiers 173.5 Iterative Dichotomiser 3 193.6 Entropy (H) 193.7 Information Gain (IG) 203.8 Project Flowchart 213.9 Summary of the Chapter 22
CHAPTER 4 CONCLUSION 234.1 Introduction 234.2 Future Work 24
REFERENCES 25
vi
LIST OF TABLES
Table No. Title Page
Table 2.0: Attributes and Its Possible Values 9
Table 2.1: Categorization of Attributes 11
vii
LIST OF FIGURES
Figure No. Title Page
Figure 1.2: Knowledge Discovery Database 3
Figure 3.2: Data Mining Work Methodology 15
Figure 3.3: Framework of KDD 16
Figure 3.4: Decision Tree Classification in Python 17
Figure 3.5: ID3 Algorithm 19
Figure 3.8: System Flowchart For Training of The Data 21
viii
LIST OF ABBREVIATIONS
UniSZA
SPM
KDD
MCO
ID 3
AVG
MLP
H
IG
Universiti Sultan Zainal Abidin
Sijil Pelajaran Malaysia
Knowledge Discovery Database
Movement Control Order
Iterative Dichotomiser 3
Average
Multilayer Perceptron
Entropy
Information Gain
1
CHAPTER 1
INTRODUCTION
1.1 Introduction
There are increasing research interest in using data mining in education. Data
mining is playing a vital role in educational institutions and one of the most important
areas of research with the objective to find important information from the data. Based
on the news from TheStar, Friday, 30 October 2020, Higher Education Minister Datuk
Dr. Noraini Ahmad said the ministry has taken steps by implementing several
initiatives to facilitate student’s access to the online teaching and learning process due
to COVID-19 issues. It’s clearly that student who will take the SPM examination also
will be affected.
The main objective of higher education is to provide quality education to the
students and to improve the quality of managerial decisions. One way to achieve
highest level of quality in education system is by discovering knowledge from
educational data to study the main attributes that may affect the student’s performance.
The discovered knowledge can be used to offer a helpful and constructive
recommendations to the academic planners in higher education institutes to enhance
their decision making process, to improve student’s performance and control the
failure rate, to understand student’s behavior, to improve teaching skills and many
other benefits.
2
1.2 Project Background
1.2.1 Data Mining
Data Mining is an interdisciplinary field of astronomy, business, computer
science, economics and others to discover new patterns from large data sets. The
actual data mining task is to analyze large quantities of data in order to extract
previously unknown patterns such as groups of data records (cluster analysis), unusual
records (anomaly detection) and dependencies (association rule mining).
These patterns can be seen as a kind of summary of the input data and used in
further analysis. Data mining tasks can be classified as;
Anomaly detection: Outlier/change/deviation detection, the identification of
unusual data records, that might be interesting or data errors which require further
investigation.
Association rule learning: Dependency modelling, search for relationship
between variables.
Clustering: It is a task discovering groups and structure in the data that are in
some way or another, without using known structures in the data.
Classification: It is a task of generalizing known structure to apply for new data.
Regression: It attempt to find a function which models the data with the least
error.
3
Educational data mining uses many techniques such as decision tree, neural
networks, rule induction and many others. By using these techniques, many kinds of
knowledge can be discovered such as association rules, classifications and clustering.
1.2.2 Knowledge Discovery Database (KDD)
Knowledge Discovery Database (KDD)[1] is the process of discovering useful
knowledge from a collection of data. This widely used data mining technique is a
process that includes data preparation and selection, data cleansing, incorporating
prior knowledge on data sets and interpreting accurate solutions from the observed
results. Here is a basic outline of KDD.
Figure 1.2: Knowledge Discovery Database
4
1.3 Problem Statement
Data mining concepts and methods can be applied in various fields.
Educational data mining is a new emerging technique of data mining that can be
applied on the data related to the field of education. Based on statement of Education
Minister, Dr. Mohd Radzi Md Jidin on The Star, he said that SPM and STPM
postponed to Feb 22 and March 8 2021 due to the conditional Movement Control
Order (MCO) in Malaysia. This paper focused on student from several schools who in
B40 category on Ladang, Kuala Terengganu, who might affected due to the online
learning these days. Beside, student who will be sitting on 2019/2020 SPM
examination are the next young generation that will be next leadership on this country.
In addition, their performance will be concerned either they are fully prepared for the
examination or still affected and need more attention before sitting on the examination.
1.4 Objectives
The objectives of this thesis is to solve the problem statement through the
suggested analysis of their performance by data mining using machine learning. So,
this project is mainly focus the objectives below:
I. To study the data and the pattern in WEKA.
II. To apply the Decision Tree method using ID3 algorithm on the data in WEKA.
III. To analyze and evaluate the performance of the data and overcome with the best
result in WEKA.
5
1.5 Scope
The scope in this thesis are to understand the student’s behavior and activities
based on the data sets. In addition, the scope is to study the student’s performance
either affected during online learning using WEKA that I used in this project.
1.6 Limitation of Work
The data set cannot be predict without real-world data because of certain cases:
I. Costly
Each data set will be conduct and collect from several schools at Ladang,
Kuala Terengganu area. There are hundreds of students will be participate on SPM
2019/2020 examination. Therefore, each data set will be sort first in the Microsoft
Excel before convert to the ARFF file data set.
II. Time
Time constraint might happen due to the pandemic COVID-19 and MCO at
certain place. It can take a long time to prepare and train the large data set. Beside,
limited time to study the whole concept of decision tree approach.
6
1.7 Gantt Chart
Months October November December January
No Week 3 4 1 2 3 4 1 2 3 4 1 2 3 4
1 Final Year Project IBriefingTopic Discussion &Determination
2 Project TitleProposal
3 Proposal Writing(Chapter 1 –Introduction)
4 Proposal Writing(Chapter 2 –Literature Review)
5 Proposal Writing(Continued)
6 Proposal ProgressPresentation andPanel’s Evaluation
7 Proposal Writing(Chapter 3 –Methodology)
MID SEMESTER BREAK
8 Proof of Concept(POC) MethodologyWorkshop
9 Final Year ProjectFormat WritingWorkshop
10 Drafting Report ofProposal
11 Submit Draft ofReport toSupervisorPreparation forFinal Presentation
12 Preparation forFinal Presentationand Final ReportSubmission
13 Final Presentationand Panel’sEvaluation
14 Final ReportSubmission andSupervisor’sEvaluation
7
1.8 Summary of the Chapter
This chapter describes a few topics included in the introduction of the project
such as the background of the project, the problem statement, objective for this project,
scope, and the limitation of work. Thus it helps to organize better documentation of
the project.
8
CHAPTER 2
LITERATURE REVIEW
2.1 Introduction
This chapter focuses on the process of analyzing the information gathered
about the topic which is in the context of analysis of student’s performance by using
data mining and machine learning based on the sources find which is an article journal
and few theses that discuss deeper on the educational data mining classification
technique to improve student’s performance. The outcome of the information
gathering will be analyzed, some constraints and limitations of the existing project
will be determined and few improvements will be applied in the project.
As described in chapter 1, it is clearly stated about the concept of data mining
in WEKA. The data collected were transformed in a form that is acceptable to the data
mining software and it was separate into two sets: The training data set and the testing
data set so that it can be imported into the system. The training set was used to enable
the system to observe relationships between input data and the resulting outcomes in
order to perform the prediction. The testing data set contains data used to test the
performance of the model.
9
2.2 Related Works
Attributes Description Values
Graduation Percentage of marks obtained in
graduation.
Good, Avg, Poor
Attendance Attendance of the student. Good, Avg, Poor
Assignment Assignment performance given
during the semester.
Good, Avg, Poor
Unit Test
Performance
Percentage marks obtained by a
student in Unit Test.
Good, Avg, Poor
University
Result
Percentage marks obtained by the
student in university examination.
Good, Avg, Poor
Table 2.0: Attributes and Its Possible Values
In this data collection and preparation, they considered student’s data that are
pursuing Master of Computer Application (MCA) degree from Pune University[2].
On the basis of the data collected some attributes have been considered to predict
student’s performance in the university examination. The variables used for judging
10
the student’s performance in university results are Graduation, Attendance,
Assignment, Unit Test and University Result.
One of the important steps of data mining process is data pre-processing. Data
pre-processing is used in identifying the missing values, noisy data and irrelevant and
redundant information from data set. In this study, it use data in percentage for the
above mentioned attributes.
Attribute Range
Graduation% Graduation% >= 70% = Good.
60% <= Graduation% < 70% = Avg.
Graduation% > 60% = Poor.
Attendance% Attendance% >= 70% = Good.
60% <= Attendance% < 70% = Avg.
Attendance% > 60% = Poor.
Assignment% Assignment% >= 70% = Good.
60% <= Assignment% < 70% = Avg.
Assignment% > 60% = Poor.
Unit Test% Unit Test% >= 70% = Good.
60% <= Unit Test% < 70% = Avg.
11
Unit Test% > 60% = Poor.
UniversityResult%
University Result% >= 70% = Good.
60% <= University Result% < 70% = Avg.
University Result% > 60% = Poor.
Table 2.1: Categorization of Attributes
In a research paper “Predicting Students Academic Performance Using
Education Data Mining” from Suchita Borkar, K. Rajeswari, they observed data set of
60 students from MCA course was obtained from M.C.A department of Pimpri
Chinchwad College of Engineering, Pune University. In this paper, they found various
association rules between attributes like students graduation percentage, attendance,
assignment work, unit test performance and how these attributes affect the student’s
university result. Number of association rule can be found for different confidence
values.
In a research paper “An Analysis of Student’s Performance Using
Classification Algorithms” from Mrs. M.S. Mythili and Dr. A.R. Mohamed
Shanavas[3], said that WEKA is open source software system that implements a large
collection of machine learning algorithms and is widely utilized in data mining
applications. The student’s academic performance is influenced by various factors like
parent’s education, locality, economic status, attendance, gender, result and many
others. The classify panel allow user to use any classification algorithms to the data
12
set to estimate the accuracy of the resulting predictive model and visualize the model.
The decision tree classifier C4.5 (J48), Random Forest, Neural Network (Multilayer
Perceptron) and Lazy based classifier (IB1) Rule based classifier (Decision Table)
were enforced in WEKA under the “Test Option”, the 10 fold cross validation is
chosen.
Extensive literature survey has been done on “Prediction and Analysis of
Student Performance by Data Mining in WEKA” from Agnik Dey, Abhirup
Khasnabis and Ajeet Kumar[4]. This paper present data mining in education
environment that identifies student’s failure patterns using association rule mining
technique. In this research, this technique used to find hidden patterns and evaluate the
student’s performance and trends. Apriori algorithm is used for finding associations
among attributes. Beside, The students’ academic performance was evaluated based
on academic and personal data collected from college’s last semester result. After that
J48 classification algorithms were used. The data mining tool used in the experiment
was WEKA 3.8.2. Based on the accuracy and the classification errors one may
conclude that the J48 Classification method was the most suited algorithm for the data
set. The Apriori algorithm was applied to the data set using WEKA to find analysis of
overall student performance by some of the best rules. The data may be extended to
collect some of the extra-curricular aspects and technical skills of the students and
mined with different classification algorithms to predict the student performance.
In other research, “Mining Educational Data to Improve Student’s
Performance: A Case Study” from Mohammed M. Abu Tair, Alaa M. El-Halees[5]
studied the data mining in higher education particularly to improve graduate student’s
performance. They applied data mining techniques to discover knowledge.
13
Particularly they discovered association rules and we sorted the rules using lift metric.
Also they clustered the students into groups using K-Means clustering algorithm.
Finally, they used outlier detection to detect all outlier in the data, two outlier methods
are used which are Distanced-Based Approach and Density-Based Approach. Each
one of these tasks can be used to improve the performance of graduate student.
2.3 Summary of the Chapter
This chapter is concluded all the simulation, methods and algorithms to
evaluate the performance of the data in WEKA. This study is essential in order to get
an idea and as a guide to the efficient project.
14
CHAPTER 3
METHODOLOGY
3.1 Introduction
This chapter is to introduce a methodology proposed for this project and
improve the idea by present a framework, system model, data set and flowchart of the
project. It starts with a case study that can be used in this project. Then a discussion
about the simulation technique used by using WEKA to simulate the flow of data set
to the real world.
15
3.2 Research of Methodology
In research methodology, the preparation of the project is relevant to develop
the project. A few phases of the methodology are usable for this project. Figure 3.2.1
show the phases for this project development.
Figure 3.2: Data Mining Work Methodology
16
Before applying the data mining technique on the data set, there should be a
methodology that governs our work. The methodology starts from the problem
definition, then preprocessing which are discussed in the introduction, then the data
mining method which are association, classification, clustering and outlier detection
followed by the evaluation of results and patterns, finally the knowledge
representation process.
3.3 Framework of Knowledge Discovery Database (KDD)
Figure 3.3: Framework of KDD
The overall process of finding and interpreting patterns from data involves the
repeated application of the following steps[6]:
17
i. Developing an understanding of the goals of the end-user.
ii. Creating a target data set (selecting a data set, focusing on a subset of variables,
data sample).
iii. Data preprocessing (strategies for handling missing data fields, collecting
necessary information to the model).
iv. Choosing the data mining task (deciding whether the goal of the KDD process is
classification, clustering, etc).
v. Choosing the data mining algorithm (selecting methods and parameters to be
used).
vi. Data Mining (searching for patterns of interest in a particular represtational form
or set).
vii. Evaluating result of discovered knowledge.
3.4 Decision Tree Classifiers
Figure 3.4: Decision Tree Classification in Python
18
Decision tree[7] can be a flow chart resembling a tree structure, where every
internal node is denoted by rectangle and the leaf nodes are denoted by ovals. This is
often used algorithm because of easy implementation and easier to understand
compared to different classification algorithms. Decision tree starts with a root node
that helps the users to take required actions. From this node, users split every node
recursively according to decision tree learning algorithm. The ultimate result is a
decision tree in which each branch represents an outcome.
Decision tree also a visual representation of a reasoning process and
particularly suitable for solving classification problems. Each leaf node is labelled
with a class label. The class label decided by the class of the records that ended up in
that leaf during training. A leaf node may also contain a value depending upon the
average of the values of such records.
19
3.5 Iterative Dichotomiser 3
ID3 stand for Iterative Dichotomiser 3[8], is a classification algorithm that
follows a greedy approach of building a decision tree by selecting a best attribute that
yields maximum Information Gain (IG) or minimum Entropy (H). ID3 make use of
information gain as an attribute selection method. The main structure of building a
decision tree based on ID3 algorithm is summarized in Figure 3.5.
Figure 3.5: ID3 Algorithm
3.6 Entropy (H)
Entropy is a measure of the amount of uncertainty in the dataset S.
Mathematical Representation of Entropy is shown here;
H(S) = ∑c∈C − p(c)log2p(c)H(S) = ∑c∈C − p(c)log2p(c)
where,
S - The current dataset for which entropy is being calculated(changes every
iteration of the ID3 algorithm).
C - Set of classes in S {example - C ={yes, no}}
20
p(c) - The proportion of the number of elements in class c to the number of
elements in set S.
In ID3, entropy is calculated for each remaining attribute. The attribute with
the smallest entropy is used to split the set S on that particular iteration. Entropy = 0
implies it is of pure class, that means all are of same category.
3.7 Information Gain (IG)
Information Gain IG(A) tells us how much uncertainty in S was reduced after
splitting set S on attribute A. Mathematical representation of Information gain is
shown here;
IG(A,S) = H(S) − ∑t∈Tp(t)H(t)IG(A,S) = H(S) − ∑t∈Tp(t)H(t)
where,
H(S) - Entropy of set S.
T - The subsets created from splitting set S by attribute A such that
S=⋃ tϵTtS=⋃ tϵTt.
p(t) - The proportion of the number of elements in t to the number of elements
in set S.
H(t) - Entropy of subset t.
In ID3, information gain can be calculated (instead of entropy) for each
remaining attribute. The attribute with the largest information gain is used to split the
set S on that particular iteration.
21
3.8 Project Flowchart
Figure 3.8: System Flowchart For Training of The Data
22
3.9 Summary of the Chapter
This chapter discusses the general architecture and technique in data mining.
Then, it briefly explained how the data mining workflow in this project and defining
about decision tree that will be used to analyze student’s performance. Although there
is an algorithm which is ID3 will be implement to get a better structure of decision
tree as the final result in this project. The algorithm will be implemented in decision
tree on the WEKA and the final expected result which is a graph can be represented in
the next chapter
23
CHAPTER 4
CONCLUSION
4.1 Introduction
This chapter concludes the documentation of this project in aspect of
contribution which is what this system can provide to the user for better used,
weakness and limitations of the project were also explaining in this chapter and also
some recommendations for future work that can make this system better in the future.
24
4.2 Future Work
This paper shows that how to improve students’ performance on education
related data. At the end, we concluded that student’s performance dataset are valuable
to predict the attributes that will affect their academic routine. Furthermore, this
process is important to improve the educational quality which is vital to attract
students to stay in the school. We used data mining techniques to discover the hidden
knowledge. We used classification technique using J48 algorithm, which is used to
predict the performance of students. In future we will improve classification accuracy
by using some other data mining techniques like K-Nearest Neighbor or Navie
Bayesian.
25
REFERENCES
[1] Agnik Dey, Abhirup Khasnabis, Ajeet Kumar, “Prediction and Analysis of
Student Performance by Data Mining in WEKA”, Canal South Road,
Beliaghata Kolkata - 700015, West Bengal University of Technology.
[2] Suchita Borkar, K. Rajeswari, "Predicting Students Academic Performance
Using Education Data Mining ", IJCSMC,Vol. 2, Issue. 7, July 2013, pg.
273– 279.
[3] Mrs. M.S. Mythili, Dr. A.R.Mohamed Shanavas,"An Analysis of students’
performance using classification algorithms ",ISSN: 2278-0661, p- ISSN:
2278-8727 Volume 16, Issue 1, Ver.III (Jan. 2014), PP 63-69.
[4] “Prediction and Analysis of Student Performance by Data Mining in WEKA”
from Agnik Dey, Abhirup Khasnabis and Ajeet Kumar. Retrieved from,
https://www.rcciit.org/students_projects/projects/it/2018/GR4.pdf
[5] Alaa M.El-Halees,Mohammed M. Abu Tair, “Mining Educational Data to
Improve Students’Performance: A Case Study”,International Journal of
Information and Communication Technology Research, 2012.
[6] Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge
Discovery: An Overview", in Fayyad, Piatetsky-Shapiro, Smyth,
Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI
Press / The MIT Press, Menlo Park, CA, 1996, pp.1-34.
[7] Ogunde A. O., Ajibade D. A., “A Data Mining System for Predicting
University Student’s Graduation Grades Using ID3 Decision Tree Algorithm”,
Computer Science and Information Technology, Vol.2(1), March 2014.
[8] Retrieved from, https://medium.com/datadriveninvestor/tree-algorithms-id3-
c4-5-c5-0-and-cart-413387342164