sample journal

Society of Information Technology Students Journal 1

PROPOSED FORECASTING MODEL FOR THE STUDENTS’ ACADEMICPERFORMANCE OF BSCS STUDENTS

IN NEW ERA UNIVERSITY

Teddy Eddie Q. Dispo Jr. Libis Dike 1,Brgy. Balite St., Montalban,

Rodriguez Rizal [email protected]

Rexiel Kenneth P. Tugano#4 Manansala St., Krus na Ligas,

Diliman, Quezon [email protected]

mailto:[email protected]

mailto:[email protected]

ABSTRACTThe Data mining tool is accepted as a

decision making tool which is able to facilitate better resource utilization in terms of students’ performance. It is essential for decision-makers to obtain early feedback on academic performance and the effectiveness of different learning strategies. In this paper the data from Computer Science student has been taken and various data mining methods have been performed to improve students’ academic performance and to increase the decreasing population of Computer Science students’ from first year to fourth year. Descriptive method was used to analyze the data and forecast Bachelor of Science in Computer Science will finish course four years span and graduate on time. To ensure impartiality of data the researchers used the elements in the population as its sample making in more inclusive and represented so that the study will have sufficient and adequate data for greater statistical efficiency. The aim of this study is to apply different data mining techniques to analyze the best model that will fit in forecasting students’ academic performance. The result of study using two methods of decision tree is to represent rule that is easy to interpret and by the used of this method ID3 algorithm gives 98.48% accurate results.

Keywords: Data Mining, Classification, Forecasting, Decision tree, Regression, Performance

1. INTRODUCTIONThe ability to predict a students’ academic

performance is very important in educational environments. Prediction models that include all personal, social, psychological and other environment variables are necessitated for the effective prediction in the performance of the students. The prediction of student performance with high accuracy is beneficial to identify who among the students need a special attention in their studies. It is required that the identified students be assisted more by the teacher so that their performance will improve in the future [1].

Data mining extracts interesting non-trivial, implicit, previously unknown and potentially useful information or patterns from data. It can be applied to a number of different applications, such as data summarization, learning classification rules, finding associations, analyzing changes and detecting anomalies [2]. Data mining is a data analysis methodology used to identify hidden knowledge of a large data in databases and it has been successfully used in different areas including the educational

environment. Data mining methodology is used to study students’ performance and provide many tasks that could be used in predicting and forecasting academic performance.

The reasons of good or bad performances of the students should be one of the main interests of teachers. The teachers can plan and customize their teaching program, based on the feedback of the students [3]. Data mining is one of the powerful analytical tool approaches, which can provide an effective assistance in revealing complex relationships behind the students’ grades and performances [4].

2. METHODOLOGY 2.1 DESCRIPTIVE RESEARCH

This study described the phenomena and was analyzed in the discipline of quantitatively the main features of a collection of information. Descriptive study is one in which information is collected without changing the environment and can involve a one-time interaction with the groups. Correlational research determines the relationship between two or more variables. The data is collected from various variables and correlational statistical techniques are then used [5].

The researchers considered the elements in the population as its sample making in more inclusive and represented so that the study will have sufficient and adequate data for greater statistical efficiency. Also the researchers used different statistical tools to evaluate the criteria of the forecasting Model as well such as Percentage, Mean, Standard Deviation, Percentage Error, T-test, MAPE (mean absolute percentage error) and Multiple Linear Regression. Statistical software package such as RapidMiner, SPSS and WEKA used to process the data for faster and greater reliability of the results.

3. RESEARCH FRAMEWORK

Figure 1: Framework for Academic Performance

The data from the student or applicant will store into database. The system will get the data from the database and flat files to combine the possible data needed in order to get what indicator or predictor will used. The large data will filtered using cleanse and transform to utilize the predictors to know the input value to create forecasting model to predict the probability of the students to finish the Bachelor of Science in Computer Science course in four years in time and who among the student are not. The decision variable serve as the independent variable in this study and the probability of graduating will be the dependent variable. The pattern recognition provides the reasonable answer for all possible inputs and the decision makers involved on what are the results in visualization and validation for the probability of the graduating students. As a whole the decision makers have an influence to decide things and can iterate the process of the proposed study to make the model more efficient and accurate.

4. EXPECTED OUTPUTThis includes analyzes, interpretation and

implications of the findings from the data gathered by the researchers and to look forward to the probable occurrence or appearance which activate and modify a process. It also discusses the types of

testing performed on the forecasting model in this study.

The data that the researchers used in the study were tabulated and placed into the data file using statistical software packages.

4.1 PREDICTORS IN FORECASTING STUDENTS’ ACADEMIC PERFORMANCE

The variables used in this study were divided into two types of independent variable and dependent variable. An independent variable is also known as a predictor variable, it represented the inputs or causes to see if they were the cause while dependent variable represented the output or effect to see if it is effective.

The researchers had an internal variable which was the profile of the respondents including student name, student number, subjects/subject codes and grades. These variables considered as the predictors or the independent variable for BSCS students who can finish course four years in time while the graduates were the dependent variable or the output used in this study.

The researchers showed the predictors to be considered which were the subject code from mathematics and science subjects, major subjects, and general education subjects to easily visualize subjects in curriculum from the subjects in first year to fourth year in Computer Science such as CS_TECH, NSTP1, ENGL_1, CS_231 GWA, MAT_172, CS_442, CS_132 GWA, CS_242, PE_3,

NSTP2, ENGL_8, ENGL_2, MAT_171, CS_344, PHILO_1, CS_335 GWA, PE_2, CS_433, CS_434, MAT_341, CS_332, CS_242 GWA, PE_4, FIL_2A, POL_SCI_2, CS_142 GWA, PHY_2 GWA, CS_341 GWA, CS_141 GWA, ENGL_4, VALUES, CS_341 GWA, CS_241 GWA, PHY_1 GWA, CS_233 GWA, FIL_1, CS_331 GWA, LIT_1, CS_333, CS_432, CS_232 GWA and CS_342 GWA. These variables can be considered to have an influence on the performance of students [6].

4.2 CORRELATIONS OF THE PREDICTORS TO THE ACADEMIC PERFORMANCE OF BSCS STUDENTS

Correlation described the degree of correspondence between two or three variables. This type of Bivariate correlation test required that the variables both have a scale level of measurement order for the values and the distance in between the values can be determined [7]. The researchers simplified the predictor variables into 3 categories they are: Mathematics and Science (Mat & Sci.) subjects, Major subjects and General Education subjects (GenEd).

The Pearson's Correlation between variables is a measure of how well they are related. The most common measure of correlation in stats is the Pearson Correlation (technically called the Pearson Product Moment Correlation or PPMC). It shows the linear relationship between two sets of data. There is strong relationship between the variables if the p-value is close to 1, it means that changes in one variable are strongly correlated with the changes in the second variable. The Sig. (2-tailed) value tells if there is a statistically significance correlations between your variables. If the Sig. (2-tailed) value is less than to .01 it conclude that there is a significance correlation between your variables. In this case, p-value for Major subjects is equal to .650, Mat & Sci. subjects is equal to .449 and GenEd showed a number of .559 which means the relationship between the Major and GenEd subjects are more moderate association. The relationship of Mat & Sci. subjects is weak correlated while the Sig.

(2-tailed) value for Major subjects, Mat & Sci. subjects and GenEd subjects is .000 it means that there is a significance correlations between Major, Mat & Sci. and GenEd subjects.

4.3 DATA MINING TECHNIQUES AND ALGORITHMS

4.3.1 REGRESSION

Regression analysis is a statistical technique for studying linear relationships among variables and to predict a continuous dependent variable from number of independent variables and the act or an instance of regressing. The researchers used the regression analysis to help understand how the typical value of the dependent variable changes when any one of the independent variables is varied and to model the relationship of between scalars.

Figure 2: Model Summary in Multiple Linear Regression

R means is a companion to apply regression and its automatically process the log base 2 of income in the equation which is the Multiple Linear Regression model. R square measures the relationship between a portfolio and its benchmark. It can be measure how close the data are to the fitted regression line.

The researchers test coming from the historical data of the respondents, the value of R is equal to .780 and R square is equal to .609, it means that R indicates the model explains all the variability of the response data around its mean. The result value from the model summary in R = .780, R square = .0609 and adjusted R square = .419 is better, because in general the higher the R-squared the better model fits in the data. If the results of R square indicate 0% meaning the model explains none of the variability of the response data around its mean.

The standard error of the Estimate is closely related to the quantity of standard deviation. Standard error of the Estimate is equal to .063% it means from 100% accuracy of the model the test result is almost 6% equivalent of 94% to 100%. 6% is not that bad using standard error because the true value of the standard deviation is usually unknown. In such cases it is important to be clear about what

has been done and to attempt to take proper account of the fact that the standard error is only an estimate. The researchers test the true value or the accuracy of the Multiple Linear Regression using MAPE (mean absolute percentage error).

Normal Probability plot compares the distribution of the residuals to a normal distribution and assessing whether or not a data set is approximately normal distributed. The data are plotted against theoretical normal distribution in a way that the points should form an approximate straight line. The diagonal line represents the normal distribution. The closer the observed cumulative probabilities of the residuals are to this line, the closer the distribution of the residuals is to the normal distribution.

Figure 3: Normal Probability Plot Using Multiple Linear Regression

In expected cumulative probability shows that uniform distribution has an S shape and it matches the pattern of a set of paired data. The researchers believe that it indicates normal distribution into long-tailed because the curve starts below the normal line, bends to follow the curve and ends above. It means that more variance than you would expect in a normal distribution and the researchers agree that normal distribution can be improve upon as a model for testing.

4.3.2 DECISION TREE

Decision tree creates a tree-based classification model. It classifies cases into groups or predicts values of a dependent variable based on values of independent variables. The procedure provides validation tool for exploratory and confirmatory classification analysis [8].

The researchers used two methods of decision tree which are the CHAID, and ID3 algorithms. CHAID can be used for prediction as well classification and for detection of interaction between variables while ID3 uses information gain measure to choose the splitting attribute.

CHAID (Chi-squared Automatic Interaction Detection) chooses the independent predictor variable that has the strongest interaction with the dependent variable while ID3 construct the decision tree by employing a top-down, greedy search through the given sets to test each attribute at every tree node.

Figure 4: Model Summary Using CHAID Method

The researchers used CHAID method to categorize each predictor if each variable are not significantly different with respect to the dependent variable. Figure 4, indicates that only one of the selected independent variables made a significant enough contribution to be included in the model which is the OJT_441.

Figure 5: Model Summary Produced by J48 Decision Tree

J48 shows the error level when applying the classifier to the training data. The most important figures from model summary are the numbers of correctly and incorrectly classified instances. Using J48 classifier, correctly classified instances is equal to 62% while incorrectly classified instances is equal to 38%. Mean absolute error is equal to 0.0152 which is the measure how close the forecasts or prediction are to the eventual outcomes. The results using CHAID method is approximately high compared to J48 classifier.

Table 3 shows the accuracy of CHAID, ID3, and Multi-layer feed-forward algorithms for classification applied on the data. CHAID technique has highest accuracy of 72.6% compared to other methods. ID3 algorithm also showed an acceptable level of accuracy while Multi-layer feed-forward has a lowest accuracy of 50.8% [9].

Table 4 shows the accuracy and efficiency of the model. ID3 technique has a lowest percentage error of 0.0152% or 1.52% indicates that the accuracy level of the given model is 98.48% out of 100% [10]. CHAID method also showed an acceptable level of accuracy. In Multiple Linear Regression, the researchers used Mean Absolute Percentage Error (MAPE) in order to calculate the efficiency of the model which results to percentage error of 2.93%. It means that the accuracy level using Regression analysis is 97.07%. Multi-layer feed-forward algorithm showed a highest percentage error.

5. CONCLUSIONThis study could be a great help for

Computer Science students and for the teachers to improve students’ academic performance, trim down failure rate, to better understand students’ behavior, and to improve teaching. This study can help develop a faith on data mining techniques so that present education systems may adopt this as a strategic management tool. Grade point average (GPA) is used in higher learning institution to discover knowledge from education data and students’ performance plays an important role in producing the best quality graduates. Academic achievement, grades are the main factors that can secure a stable job in life and all the students must give the greatest effort. In simplifying the variables into three categories such as Mathematics & Science, Major, and General Education subjects there are significant relationship between them. The result of this study indicates that data mining techniques provided effective improving tools for students’ academic performance. It shows how useful data mining can be in higher learning institutions especially using Decision tree and Regression particularly to predict a number and estimates the value of the target as a function of the predictors for each case in the build data. Also SPSS gives an entire analytical process from planning to data collection, analysis and reporting deployment of the results.

6. RECOMMENDATIONBased from the summary of findings and

conclusions of the study, the researchers recommend applying this forecasting model in external variables that can influence grades of students such as location, social, behavior, and family support. Also apply other data mining techniques on an expanded data set with more distinctive attributes to get more accurate and efficient results. Application of data mining techniques in educational field can be used to develop performance monitoring and evaluation tools system.

7. REFERENCES[1] Bhardwaj, B.K. and Pal, S. 2011. Data Mining: A

prediction for performance improvement using classification. International Journal of Computer Science and Information Security.

[2] Hegland, M., Roberts, S., and Williams, G. “A Data Mining Tutorial”

[3] Singh, R., Tiwari, M. and Vimal, N. 2013. An Empirical Study of Application of Data Mining Techniques for Predicting Student Performance in Higher Education. International Journal of Computer Science and Mobile Computing.

[4] Tiwari, M., and Vimal, N. “Evaluation of Student performance by an Application of Data Mining Techniques”.

[5] http://www.ask.com/question/definition-of-descriptive-correlational-research

[6] Ahmad, W.F., Azrai, A., Nayan, Y., Nordin, S., and Yahya, N. 2012. A Conceptual Framework in Examining the Contributing Factors to Low Academic Achievement: Self-Efficacy, Cognitive Ability, Support System and Socio-Economic. International Conference on Management, Social Science and Humanities 2012.

[7] http://cooklibrary.towson.edu/helpguides/guides/correlationspss.pdf

[8] Chuchra, R. 2012. Use of Data Mining Techniques for the Evaluation of Student Performance: A Case Study

[9] Bharadwaj, B., Pal, S., and Yadav, S.K. 2011. Data Mining Applications: A comparative Study for Predicting Students’ performance. International Journal of Innovative Technology and Creative Engineering.

[10] Tiwari, M., and Vimal, N. “Evaluation of Student performance by an Application of Data Mining Techniques”.

http://cooklibrary.towson.edu/helpguides/guides/correlationspss.pdf

http://cooklibrary.towson.edu/helpguides/guides/correlationspss.pdf

sample journal

Documents