Download - ANALYSISOFSPM2019/2020STUDENT’S … · 2021. 2. 1. · Data mining concepts and methods can be applied in various fields. ... CHAPTER2 LITERATUREREVIEW 2.1 Introduction ... by using

ANALYSIS OF SPM 2019/2020 STUDENT’SPERFORMANCE USING ID3 ALGORITHM AND

J48 CLASSIFICATION

AIMAN DZAFRI BIN HUSNI BALIS

BACHELOR OF COMPUTER SCIENCE(COMPUTER NETWORK SECURITY) WITH

HONOURS

UNIVERSITI SULTAN ZAINAL ABIDIN

2021

ANALYSIS OF SPM 2019/2020 STUDENT’S PERFORMANCE BYDATAMINING USINGMACHINE LEARNING

AIMAN DZAFRI BIN HUSNI BALIS

BACHELOR OF COMPUTER SCIENCE (COMPUTERNETWORK SECURITY) WITH HONOURS

Universiti Sultan Zainal Abidin

2021

i

DECLARATION

I hereby declare that the report is based on my original work except for quotations and

citations, which have been duly acknowledged. I also declare that it has not been

previously or concurrently submitted for any other degree at Universiti Sultan Zainal

Abidin or other institutions.

_______________________________Name: Aiman Dzafri Bin Husni Balis

Date:

ii

CONFIRMATION

This is to confirm that:

The research conducted and the writing of this report were under my supervision.

_______________________________Name: Mrs. Roslinda Binti Muda

Date:

iii

DEDICATION

In the name of Allah, the most gracious and the most merciful, I thank Allah,

all praise to Allah who has guided me and giving me the strength to proceed and finish,

to submit the report, Analysis of SPM 2019/2020 Student’s Performance by Data

Mining Using Machine Learning in due time and without whom help this study which

required untiring effort would have not been possible to complete with a time limit.

On this special opportunity is given to me, I would like to express my gratitude

to my supervisor, Mrs. Roslinda Binti Muda for her supervision and inspiration

throughout my final year project. Without her time, her support and guidance, it is

impossible for me to finish my project successfully. Thank you for the kindness. May

Allah bless her.

Besides, I would like to extend my appreciation to both of my parents that help

me in a various way in order to ensure I can complete my project whether in moral

support and financial support. Last but not least, to my classmates and course mates

that always helps in many ways in order to help me complete the project within the

time given.

iv

ABSTRACT

In back of the years, the analysis of student’s performance and retaining the

standard of education is a very important problem in all the educational institutions.

Data mining concerns with developing method for discover knowledge from data to

improve student’s performance and overcome the problem. Although, it can be used

for decision making in educational or academic systems. To mine the student’s

performance data, the data mining classification techniques such as decision tree with

the ID 3 algorithm and J48 classification model were built with 10 fold using WEKA.

Beside, the performance of the classification models used are tested and compared.

The results of such classification model deals with accuracy level, confusion matrices

and also the execution time.

v

CONTENTS

PAGE

DECLARATION iCONFIRMATION iiDEDICATION iiiABSTRACT ivCONTENTS vLIST OF TABLES viLIST OF FIGURES viiLIST OF ABBREVIATIONS viiiCHAPTER 1 INTRODUCTION 1

1.1 Introduction 11.2 Project Background 11.3 Problem Statement 41.4 Objectives 41.5 Scope 51.6 Limitation of Work 51.8 Gantt Chart 61.9 Summary of the Chapter 7

CHAPTER 2 LITERATURE REVIEW 82.1 Introduction 82.2 Related Work 92.3 Summary of the Chapter 13

CHAPTER 3 METHODOLOGY 143.1 Introduction 143.2 Research of Methodology 153.3 Framework of Knowledge Discovery Database 163.4 Decision Tree Classifiers 173.5 Iterative Dichotomiser 3 193.6 Entropy (H) 193.7 Information Gain (IG) 203.8 Project Flowchart 213.9 Summary of the Chapter 22

CHAPTER 4 CONCLUSION 234.1 Introduction 234.2 Future Work 24

REFERENCES 25

vi

LIST OF TABLES

Table No. Title Page

Table 2.0: Attributes and Its Possible Values 9

Table 2.1: Categorization of Attributes 11

vii

LIST OF FIGURES

Figure No. Title Page

Figure 1.2: Knowledge Discovery Database 3

Figure 3.2: Data Mining Work Methodology 15

Figure 3.3: Framework of KDD 16

Figure 3.4: Decision Tree Classification in Python 17

Figure 3.5: ID3 Algorithm 19

Figure 3.8: System Flowchart For Training of The Data 21

viii

LIST OF ABBREVIATIONS

UniSZA

SPM

KDD

MCO

ID 3

AVG

MLP

H

IG

Universiti Sultan Zainal Abidin

Sijil Pelajaran Malaysia

Knowledge Discovery Database

Movement Control Order

Iterative Dichotomiser 3

Average

Multilayer Perceptron

Entropy

Information Gain

1

CHAPTER 1

INTRODUCTION

1.1 Introduction

There are increasing research interest in using data mining in education. Data

mining is playing a vital role in educational institutions and one of the most important

areas of research with the objective to find important information from the data. Based

on the news from TheStar, Friday, 30 October 2020, Higher Education Minister Datuk

Dr. Noraini Ahmad said the ministry has taken steps by implementing several

initiatives to facilitate student’s access to the online teaching and learning process due

to COVID-19 issues. It’s clearly that student who will take the SPM examination also

will be affected.

The main objective of higher education is to provide quality education to the

students and to improve the quality of managerial decisions. One way to achieve

highest level of quality in education system is by discovering knowledge from

educational data to study the main attributes that may affect the student’s performance.

The discovered knowledge can be used to offer a helpful and constructive

recommendations to the academic planners in higher education institutes to enhance

their decision making process, to improve student’s performance and control the

failure rate, to understand student’s behavior, to improve teaching skills and many

other benefits.

2

1.2 Project Background

1.2.1 Data Mining

Data Mining is an interdisciplinary field of astronomy, business, computer

science, economics and others to discover new patterns from large data sets. The

actual data mining task is to analyze large quantities of data in order to extract

previously unknown patterns such as groups of data records (cluster analysis), unusual

records (anomaly detection) and dependencies (association rule mining).

These patterns can be seen as a kind of summary of the input data and used in

further analysis. Data mining tasks can be classified as;

Anomaly detection: Outlier/change/deviation detection, the identification of

unusual data records, that might be interesting or data errors which require further

investigation.

Association rule learning: Dependency modelling, search for relationship

between variables.

Clustering: It is a task discovering groups and structure in the data that are in

some way or another, without using known structures in the data.

Classification: It is a task of generalizing known structure to apply for new data.

Regression: It attempt to find a function which models the data with the least

error.

3

Educational data mining uses many techniques such as decision tree, neural

networks, rule induction and many others. By using these techniques, many kinds of

knowledge can be discovered such as association rules, classifications and clustering.

1.2.2 Knowledge Discovery Database (KDD)

Knowledge Discovery Database (KDD)[1] is the process of discovering useful

knowledge from a collection of data. This widely used data mining technique is a

process that includes data preparation and selection, data cleansing, incorporating

prior knowledge on data sets and interpreting accurate solutions from the observed

results. Here is a basic outline of KDD.

Figure 1.2: Knowledge Discovery Database

4

1.3 Problem Statement

Data mining concepts and methods can be applied in various fields.

Educational data mining is a new emerging technique of data mining that can be

applied on the data related to the field of education. Based on statement of Education

Minister, Dr. Mohd Radzi Md Jidin on The Star, he said that SPM and STPM

postponed to Feb 22 and March 8 2021 due to the conditional Movement Control

Order (MCO) in Malaysia. This paper focused on student from several schools who in

B40 category on Ladang, Kuala Terengganu, who might affected due to the online

learning these days. Beside, student who will be sitting on 2019/2020 SPM

examination are the next young generation that will be next leadership on this country.

In addition, their performance will be concerned either they are fully prepared for the

examination or still affected and need more attention before sitting on the examination.

1.4 Objectives

The objectives of this thesis is to solve the problem statement through the

suggested analysis of their performance by data mining using machine learning. So,

this project is mainly focus the objectives below:

I. To study the data and the pattern in WEKA.

II. To apply the Decision Tree method using ID3 algorithm on the data in WEKA.

III. To analyze and evaluate the performance of the data and overcome with the best

result in WEKA.

5

1.5 Scope

The scope in this thesis are to understand the student’s behavior and activities

based on the data sets. In addition, the scope is to study the student’s performance

either affected during online learning using WEKA that I used in this project.

1.6 Limitation of Work

The data set cannot be predict without real-world data because of certain cases:

I. Costly

Each data set will be conduct and collect from several schools at Ladang,

Kuala Terengganu area. There are hundreds of students will be participate on SPM

2019/2020 examination. Therefore, each data set will be sort first in the Microsoft

Excel before convert to the ARFF file data set.

II. Time

Time constraint might happen due to the pandemic COVID-19 and MCO at

certain place. It can take a long time to prepare and train the large data set. Beside,

limited time to study the whole concept of decision tree approach.

6

1.7 Gantt Chart

Months October November December January

No Week 3 4 1 2 3 4 1 2 3 4 1 2 3 4

1 Final Year Project IBriefingTopic Discussion &Determination

2 Project TitleProposal

3 Proposal Writing(Chapter 1 –Introduction)

4 Proposal Writing(Chapter 2 –Literature Review)

5 Proposal Writing(Continued)

6 Proposal ProgressPresentation andPanel’s Evaluation

7 Proposal Writing(Chapter 3 –Methodology)

MID SEMESTER BREAK

8 Proof of Concept(POC) MethodologyWorkshop

9 Final Year ProjectFormat WritingWorkshop

10 Drafting Report ofProposal

11 Submit Draft ofReport toSupervisorPreparation forFinal Presentation

12 Preparation forFinal Presentationand Final ReportSubmission

13 Final Presentationand Panel’sEvaluation

14 Final ReportSubmission andSupervisor’sEvaluation

7

1.8 Summary of the Chapter

This chapter describes a few topics included in the introduction of the project

such as the background of the project, the problem statement, objective for this project,

scope, and the limitation of work. Thus it helps to organize better documentation of

the project.

8

CHAPTER 2

LITERATURE REVIEW

2.1 Introduction

This chapter focuses on the process of analyzing the information gathered

about the topic which is in the context of analysis of student’s performance by using

data mining and machine learning based on the sources find which is an article journal

and few theses that discuss deeper on the educational data mining classification

technique to improve student’s performance. The outcome of the information

gathering will be analyzed, some constraints and limitations of the existing project

will be determined and few improvements will be applied in the project.

As described in chapter 1, it is clearly stated about the concept of data mining

in WEKA. The data collected were transformed in a form that is acceptable to the data

mining software and it was separate into two sets: The training data set and the testing

data set so that it can be imported into the system. The training set was used to enable

the system to observe relationships between input data and the resulting outcomes in

order to perform the prediction. The testing data set contains data used to test the

performance of the model.

9

2.2 Related Works

Attributes Description Values

Graduation Percentage of marks obtained in

graduation.

Good, Avg, Poor

Attendance Attendance of the student. Good, Avg, Poor

Assignment Assignment performance given

during the semester.

Good, Avg, Poor

Unit Test

Performance

Percentage marks obtained by a

student in Unit Test.

Good, Avg, Poor

University

Result

Percentage marks obtained by the

student in university examination.

Good, Avg, Poor

Table 2.0: Attributes and Its Possible Values

In this data collection and preparation, they considered student’s data that are

pursuing Master of Computer Application (MCA) degree from Pune University[2].

On the basis of the data collected some attributes have been considered to predict

student’s performance in the university examination. The variables used for judging

10

the student’s performance in university results are Graduation, Attendance,

Assignment, Unit Test and University Result.

One of the important steps of data mining process is data pre-processing. Data

pre-processing is used in identifying the missing values, noisy data and irrelevant and

redundant information from data set. In this study, it use data in percentage for the

above mentioned attributes.

Attribute Range

Graduation% Graduation% >= 70% = Good.

60% <= Graduation% < 70% = Avg.

Graduation% > 60% = Poor.

Attendance% Attendance% >= 70% = Good.

60% <= Attendance% < 70% = Avg.

Attendance% > 60% = Poor.

Assignment% Assignment% >= 70% = Good.

60% <= Assignment% < 70% = Avg.

Assignment% > 60% = Poor.

Unit Test% Unit Test% >= 70% = Good.

60% <= Unit Test% < 70% = Avg.

11

Unit Test% > 60% = Poor.

UniversityResult%

University Result% >= 70% = Good.

60% <= University Result% < 70% = Avg.

University Result% > 60% = Poor.

Table 2.1: Categorization of Attributes

In a research paper “Predicting Students Academic Performance Using

Education Data Mining” from Suchita Borkar, K. Rajeswari, they observed data set of

60 students from MCA course was obtained from M.C.A department of Pimpri

Chinchwad College of Engineering, Pune University. In this paper, they found various

association rules between attributes like students graduation percentage, attendance,

assignment work, unit test performance and how these attributes affect the student’s

university result. Number of association rule can be found for different confidence

values.

In a research paper “An Analysis of Student’s Performance Using

Classification Algorithms” from Mrs. M.S. Mythili and Dr. A.R. Mohamed

Shanavas[3], said that WEKA is open source software system that implements a large

collection of machine learning algorithms and is widely utilized in data mining

applications. The student’s academic performance is influenced by various factors like

parent’s education, locality, economic status, attendance, gender, result and many

others. The classify panel allow user to use any classification algorithms to the data

12

set to estimate the accuracy of the resulting predictive model and visualize the model.

The decision tree classifier C4.5 (J48), Random Forest, Neural Network (Multilayer

Perceptron) and Lazy based classifier (IB1) Rule based classifier (Decision Table)

were enforced in WEKA under the “Test Option”, the 10 fold cross validation is

chosen.

Extensive literature survey has been done on “Prediction and Analysis of

Student Performance by Data Mining in WEKA” from Agnik Dey, Abhirup

Khasnabis and Ajeet Kumar[4]. This paper present data mining in education

environment that identifies student’s failure patterns using association rule mining

technique. In this research, this technique used to find hidden patterns and evaluate the

student’s performance and trends. Apriori algorithm is used for finding associations

among attributes. Beside, The students’ academic performance was evaluated based

on academic and personal data collected from college’s last semester result. After that

J48 classification algorithms were used. The data mining tool used in the experiment

was WEKA 3.8.2. Based on the accuracy and the classification errors one may

conclude that the J48 Classification method was the most suited algorithm for the data

set. The Apriori algorithm was applied to the data set using WEKA to find analysis of

overall student performance by some of the best rules. The data may be extended to

collect some of the extra-curricular aspects and technical skills of the students and

mined with different classification algorithms to predict the student performance.

In other research, “Mining Educational Data to Improve Student’s

Performance: A Case Study” from Mohammed M. Abu Tair, Alaa M. El-Halees[5]

studied the data mining in higher education particularly to improve graduate student’s

performance. They applied data mining techniques to discover knowledge.

13

Particularly they discovered association rules and we sorted the rules using lift metric.

Also they clustered the students into groups using K-Means clustering algorithm.

Finally, they used outlier detection to detect all outlier in the data, two outlier methods

are used which are Distanced-Based Approach and Density-Based Approach. Each

one of these tasks can be used to improve the performance of graduate student.


This chapter is concluded all the simulation, methods and algorithms to

evaluate the performance of the data in WEKA. This study is essential in order to get

an idea and as a guide to the efficient project.

14

CHAPTER 3

METHODOLOGY

3.1 Introduction

This chapter is to introduce a methodology proposed for this project and

improve the idea by present a framework, system model, data set and flowchart of the

project. It starts with a case study that can be used in this project. Then a discussion

about the simulation technique used by using WEKA to simulate the flow of data set

to the real world.

15

3.2 Research of Methodology

In research methodology, the preparation of the project is relevant to develop

the project. A few phases of the methodology are usable for this project. Figure 3.2.1

show the phases for this project development.

Figure 3.2: Data Mining Work Methodology

16

Before applying the data mining technique on the data set, there should be a

methodology that governs our work. The methodology starts from the problem

definition, then preprocessing which are discussed in the introduction, then the data

mining method which are association, classification, clustering and outlier detection

followed by the evaluation of results and patterns, finally the knowledge

representation process.

3.3 Framework of Knowledge Discovery Database (KDD)

Figure 3.3: Framework of KDD

The overall process of finding and interpreting patterns from data involves the

repeated application of the following steps[6]:

17

i. Developing an understanding of the goals of the end-user.

ii. Creating a target data set (selecting a data set, focusing on a subset of variables,

data sample).

iii. Data preprocessing (strategies for handling missing data fields, collecting

necessary information to the model).

iv. Choosing the data mining task (deciding whether the goal of the KDD process is

classification, clustering, etc).

v. Choosing the data mining algorithm (selecting methods and parameters to be

used).

vi. Data Mining (searching for patterns of interest in a particular represtational form

or set).

vii. Evaluating result of discovered knowledge.

3.4 Decision Tree Classifiers

Figure 3.4: Decision Tree Classification in Python

18

Decision tree[7] can be a flow chart resembling a tree structure, where every

internal node is denoted by rectangle and the leaf nodes are denoted by ovals. This is

often used algorithm because of easy implementation and easier to understand

compared to different classification algorithms. Decision tree starts with a root node

that helps the users to take required actions. From this node, users split every node

recursively according to decision tree learning algorithm. The ultimate result is a

decision tree in which each branch represents an outcome.

Decision tree also a visual representation of a reasoning process and

particularly suitable for solving classification problems. Each leaf node is labelled

with a class label. The class label decided by the class of the records that ended up in

that leaf during training. A leaf node may also contain a value depending upon the

average of the values of such records.

19

3.5 Iterative Dichotomiser 3

ID3 stand for Iterative Dichotomiser 3[8], is a classification algorithm that

follows a greedy approach of building a decision tree by selecting a best attribute that

yields maximum Information Gain (IG) or minimum Entropy (H). ID3 make use of

information gain as an attribute selection method. The main structure of building a

decision tree based on ID3 algorithm is summarized in Figure 3.5.

Figure 3.5: ID3 Algorithm

3.6 Entropy (H)

Entropy is a measure of the amount of uncertainty in the dataset S.

Mathematical Representation of Entropy is shown here;

H(S) = ∑c∈C − p(c)log2p(c)H(S) = ∑c∈C − p(c)log2p(c)

where,

S - The current dataset for which entropy is being calculated(changes every

iteration of the ID3 algorithm).

C - Set of classes in S {example - C ={yes, no}}

20

p(c) - The proportion of the number of elements in class c to the number of

elements in set S.

In ID3, entropy is calculated for each remaining attribute. The attribute with

the smallest entropy is used to split the set S on that particular iteration. Entropy = 0

implies it is of pure class, that means all are of same category.

3.7 Information Gain (IG)

Information Gain IG(A) tells us how much uncertainty in S was reduced after

splitting set S on attribute A. Mathematical representation of Information gain is

shown here;

IG(A,S) = H(S) − ∑t∈Tp(t)H(t)IG(A,S) = H(S) − ∑t∈Tp(t)H(t)

where,

H(S) - Entropy of set S.

T - The subsets created from splitting set S by attribute A such that

S=⋃ tϵTtS=⋃ tϵTt.

p(t) - The proportion of the number of elements in t to the number of elements

in set S.

H(t) - Entropy of subset t.

In ID3, information gain can be calculated (instead of entropy) for each

remaining attribute. The attribute with the largest information gain is used to split the

set S on that particular iteration.

21

3.8 Project Flowchart

Figure 3.8: System Flowchart For Training of The Data

22


This chapter discusses the general architecture and technique in data mining.

Then, it briefly explained how the data mining workflow in this project and defining

about decision tree that will be used to analyze student’s performance. Although there

is an algorithm which is ID3 will be implement to get a better structure of decision

tree as the final result in this project. The algorithm will be implemented in decision

tree on the WEKA and the final expected result which is a graph can be represented in

the next chapter

23

CHAPTER 4

CONCLUSION

4.1 Introduction

This chapter concludes the documentation of this project in aspect of

contribution which is what this system can provide to the user for better used,

weakness and limitations of the project were also explaining in this chapter and also

some recommendations for future work that can make this system better in the future.

24

4.2 Future Work

This paper shows that how to improve students’ performance on education

related data. At the end, we concluded that student’s performance dataset are valuable

to predict the attributes that will affect their academic routine. Furthermore, this

process is important to improve the educational quality which is vital to attract

students to stay in the school. We used data mining techniques to discover the hidden

knowledge. We used classification technique using J48 algorithm, which is used to

predict the performance of students. In future we will improve classification accuracy

by using some other data mining techniques like K-Nearest Neighbor or Navie

Bayesian.

25

REFERENCES

[1] Agnik Dey, Abhirup Khasnabis, Ajeet Kumar, “Prediction and Analysis of

Student Performance by Data Mining in WEKA”, Canal South Road,

Beliaghata Kolkata - 700015, West Bengal University of Technology.

[2] Suchita Borkar, K. Rajeswari, "Predicting Students Academic Performance

Using Education Data Mining ", IJCSMC,Vol. 2, Issue. 7, July 2013, pg.

273– 279.

[3] Mrs. M.S. Mythili, Dr. A.R.Mohamed Shanavas,"An Analysis of students’

performance using classification algorithms ",ISSN: 2278-0661, p- ISSN:

2278-8727 Volume 16, Issue 1, Ver.III (Jan. 2014), PP 63-69.

[4] “Prediction and Analysis of Student Performance by Data Mining in WEKA”

from Agnik Dey, Abhirup Khasnabis and Ajeet Kumar. Retrieved from,

https://www.rcciit.org/students_projects/projects/it/2018/GR4.pdf

[5] Alaa M.El-Halees,Mohammed M. Abu Tair, “Mining Educational Data to

Improve Students’Performance: A Case Study”,International Journal of

Information and Communication Technology Research, 2012.

[6] Fayyad, Piatetsky-Shapiro, Smyth, "From Data Mining to Knowledge

Discovery: An Overview", in Fayyad, Piatetsky-Shapiro, Smyth,

Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI

Press / The MIT Press, Menlo Park, CA, 1996, pp.1-34.

[7] Ogunde A. O., Ajibade D. A., “A Data Mining System for Predicting

University Student’s Graduation Grades Using ID3 Decision Tree Algorithm”,

Computer Science and Information Technology, Vol.2(1), March 2014.

[8] Retrieved from, https://medium.com/datadriveninvestor/tree-algorithms-id3-

c4-5-c5-0-and-cart-413387342164

https://www.rcciit.org/students_projects/projects/it/2018/GR4.pdf

https://medium.com/datadriveninvestor/tree-algorithms-id3-

https://medium.com/datadriveninvestor/tree-algorithms-id3-

Download - ANALYSISOFSPM2019/2020STUDENT’S … · 2021. 2. 1. · Data mining concepts and methods can be applied in various fields. ... CHAPTER2 LITERATUREREVIEW 2.1 Introduction ... by using

Top Related