a strategy for detection of inconsistency in evaluation.pdf

8/10/2019 A strategy for detection of inconsistency in evaluation.pdf

1/14

A strategy for detection of inconsistency in evaluationof essay type answers

Archana Shukla & Banshi D. Chaudhary

Published online: 19 June 2013

# Springer Science+Business Media New York 2013

Abstract The quality of evaluation of essay type answer books involving multipleevaluators for courses with large number of enrollments is likely to be affected due toheterogeneity in experience, expertise and maturity of evaluators. In this paper, we present a strategy to detect anomalies in evaluation of essay type answers by multipleevaluators based on the relationship between marks/grades awarded and symbolicmarkers, opinionated words recorded in answer books during evaluation. Our strategyis based on the results of our survey with evaluators, analysis of large number of essay type evaluated answer books and our own experiences regarding grievances of students regarding marks/grades. Results of both survey and analysis of evaluatedanswer books identified underline, tick and cross as frequently used markers compared to circle and question mark. Further, both opinionated words and symbolicmarkers identified through the survey are used by evaluators to express either positiveor negative sentiments. They have differential usage pattern of these symbols assingle evaluator and as one amongst multiple evaluators. Tick and cross have welldefine purposes and have strong correlation with marks awarded. However, theunderline marker is being used for dual purpose of expressing both correctness andincorrectness of answers. Our strategy of inconsistency detection first identifiesoutliers based on the relationship between marks/grades awarded and number of symbols and/or opinionated words used in evaluation. Subsequently, marks andnumber of symbolic markers of outliers are compared with peer non-outlier answer books having same marks but different number of markers used. Such outlier answer books are termed as anomalous. We discovered 36 anomalies out of total 425evaluated answer books. We have developed a prototype tool to facilitate onlineevaluation of answer book and to proactively alert evaluators of possible anomalies.

Educ Inf Technol (2014) 19:899 912DOI 10.1007/s10639-013-9264-x

A. Shukla (* ) : B. D. ChaudharyComputer science & Engineering, Motilal Nehru National Institute of Technology, Allahabad, India e-mail: [email protected]

B. D. Chaudharye-mail: [email protected]


2/14

Keywords Assessment . Essay type answer . Symbolic marker . Opinionated words .Sentiments . Evaluation . Outlier . Anomaly detection

1 Introduction

Formal assessment practices in education are based on different models of learning processes, different levels of learning, knowledge to be acquired and of human mind.These formal assessments can be either for learning or of learning. Assessment for learning is diagnostic in nature and measures learners current knowledge and skillsfor the purpose of identifying suitable program of learning. On the other hand,assessment of learning is generally summative in nature and is used to measurelearning outcomes. Assessment for/of learning may be categorized into two types

conventional and alternative. Conventional evaluation method may include either multiple choice questions or essay type questions whereas alternative evaluationmethod includes presentation, case based evaluation and simulation. It has beenobserved (Struyven et al. 2005) that student strategy of learning is dependent ontheir perception about assessment methodology.

Majority of institutions of higher education, High School and Intermediate Boardsin India use essay type subjective assessment methodology to measure learningachievement of students. The number of students enrolled for different examinationsmay vary from less than 100 to few millions. In courses where student s enrollments

are very high, evaluation of answer books generally involves large number of evaluators with different level of expertise, experience and maturity. Work load of evaluation to evaluators is generally distributed either by question or by answer books. These heterogeneities in expertise, experience and maturity are possiblesource of inconsistency in evaluation of answer books. Most of elite institutionsshow the evaluated answer books to students for their comments, suggestions andgrievances. On the other hand, where this practice is not possible due to large number of enrollment, students are permitted to submit their grievances formally based onmarks awarded to them. Majority of the grievances are due to disagreement by anexaminee with respect to marks awarded to him/her either in individual question or intotal. Sometimes these disagreements may be due to relatively lower marks awardedto an examinee compared with his/her peer examinees. Most of the institutions haveeither informal or formal procedures to address grievances of examinee. However,redressal of these grievances may takes from few hours to few months.

Several Asian countries including India have taken policy decisions to increaseenrollment and to improve the quality of education through information technology.As a part of this initiative, all provincial and central governments in India have budgeted substantial fund to create online course contents and to provide computer hardware in the form of cheap laptop or tablet-pc s to students. It is reasonable toexpect that in the coming years, IT tools will provide services for online evaluation of essay type answer books and timely redressal of grievances of examinees.

Our work is motivated by desire to identify strategies to proactively detect inconsistency/anomalies in evaluation of essay type answer books to assist evalua-tors. We identified markers/symbols used by evaluators, their purposes and their relationships with marks awarded through a survey with teachers and research

900 Educ Inf Technol (2014) 19:899 912


3/14


4/14

due to large number of enrollments. It has been reported in (Finlayson ) that in-vestigations of Hartog, Rhodes and Cast have concluded that no matter which methodwas employed to evaluate essays, marks of individual examiner diverged widely.(Finlayson ) investigated the reliability of marking of essays in the context of their

inclusion in standardized tests. It has been concluded that inclusion of essay affectsthe reliability of evaluation.

Several non-proprietary and proprietary tools (Amaya: http://www.w3.org/Amaya/ ;Annotea Project: www.annotea.org ; Co-ment: www.Co-ment.net/ ; A.nnotate: http:// a.nnotate.com/cms-annotation.html ; Foxit Reader: http://www.foxitsoftware.com/pdf/ reader/ ; PDFFILLPDF Reader: http://www.pdfill.com/ ) are available to annotate text document to record opinions and sentiments using different markers. (Plimmer 2010)developed a software i.e. Penmarked for evaluating a programming assignments. It provided assistance to teachers in evaluation of programming assignment using

different annotation markers along with marks. ( Burstein et al.) developed an E-rater evaluation tool based on Natural language processing (NLP) methodologyfor evaluating especially Competitive Exam such as TOFEL, GMAT, GRE, PRAXISetc. E-Rater uses the MsNLP tool for parsing all sentences in the essay. E-Rater uses a combination of statistical and NLP techniques to extract linguistic features from theessays to be graded.

The semantics of these markers are implicit and may be interpreted differently bytheir users. However, these tools do not provide services which are required in thecontext of performance assessment in the domain of teaching and research. The

relationships between markers and performance indicator (marks/points) may helpin identifying the possible disagreement between examinee and examiner.All of the above tools provide services related to either non formal or formal

assessment of level of learning. However, they do not provide any assistance inregard to detection of anomalies based on symbolic markers or opinionated wordsrecorded by evaluators in the context of evaluation of essay type answers.

(Chandola et al. 2009) have presented an excellent survey of different algorithm todetect anomalies for different domains.

3 Sentiment indicators

The evaluation of essay type answers is relatively complex compared to multiplechoice objective questions. This complexity arises due to subjectivity in judgment about correctness and quality of the content of the essay. Evaluators, in general,express their judgment about both quality and correctness in terms of either categor-ical symbols or in terms of points or marks. They also provide justification for their judgment either in the form of textual comments or in the form of symbolic annota-tion. They generally use adjective, adverb, verb etc. in textual comments or symbolicmarkers such as tick, cross, underline, circle etc. to express their appreciation,correctness, incorrectness, disagreement of the content. Both opinionated words incomments as well as symbolic markers have semantics in terms of sentiments whichthey carry. Generally, it is assumed that the community of evaluators shared a common semantics of these indicators in terms of sentiments which they carry.However, these assumptions may not be valid or partially valid. Any disagreement

902 Educ Inf Technol (2014) 19:899 912
http://www.w3.org/Amaya/http://www.annotea.org/http://www.co-ment.net/http://a.nnotate.com/cms-annotation.htmlhttp://a.nnotate.com/cms-annotation.htmlhttp://www.foxitsoftware.com/pdf/reader/http://www.foxitsoftware.com/pdf/reader/http://www.pdfill.com/http://www.pdfill.com/http://www.foxitsoftware.com/pdf/reader/http://www.foxitsoftware.com/pdf/reader/http://a.nnotate.com/cms-annotation.htmlhttp://a.nnotate.com/cms-annotation.htmlhttp://www.co-ment.net/http://www.annotea.org/http://www.w3.org/Amaya/


5/14

in the semantics of opinionated words or marker in multi evaluator evaluation context is likely to lead inconsistency in evaluation from student perspective. Student ssatisfaction is dependent on consistent relationship between these indicators andmarks/grade awarded to them.

We decided to conduct a survey with evaluators to validate the followinghypothesis s that & They use either symbols or opinionated words from a common set of symbols and

opinionated words to justify marks/grades awarded by them.& They share a common semantics in terms of sentiments associated with above

indicators.& They use different symbols to indicate different levels of sentiments for awarding

different marks.

We conducted our survey using a questionnaire with 150 participants from four different institutions. These participants were either faculty members or graduateresearch students of branch different branch of engineering. These graduate studentswere either Ph.D. students with more than 2 year of experience or Master of Technology (M.Tech.) students doing their final year thesis. They are required to participate in conducting and evaluation of course work in order to get their schol-arship. All of them have evaluated answer books of at least two examinations.

Since these participants are faculty members and research scholars, they read published research articles, review research papers of their peers and write their

own research articles. In all of these activities especially reading published researcharticles, they use symbolic markers or textual comments to record their appreciation,agreement, doubts, disagreement etc. Due to this commonality of expressing judg-ment or opinion using symbolic marker and opinionated words, we decided to includequestions to solicit their inputs, regarding usage of these symbols on these activitiesin addition to evaluation of answer books.

Our questionnaire contained twenty questions with multiple choices of answers.Respondent could select one or more options depending on question. Questions wereframed to identify symbolic markers and opinionated words used in evaluating essaytype answer books and in literature survey/review of research papers. Further,questions also solicited answers to identify differential purposes of symbols and of opinionated words and their association with strong either positive or negativesentiments.

We received 137 questionnaires against 150 questionnaires which were distribut-ed. We did not consider seven of the responses as they were partially complete.

3.1 Survey results

The results of our analysis are summarized below:-

& Common set of symbols and opinionated words used during evaluation includethe following: Tick, Cross, Underline, Highlight, Question mark and circle are the common

set of symbols. Words and Phrases:

Educ Inf Technol (2014) 19:899 912 903


6/14

Positive Sentiment : good, very good, excellent, fair, well written, good work,well attempted, precise, neat and clean, better approached applied, interesting, tothe point answer, well, well done, clear, legible, neutral.Negative Sentiment : wrong, bad, irrelevant, based on marks secured, poor, need

to improve, may done better, incorrect, very bad, re read, work hard, copied,fundamental mistake, imprecise, not well explained, not clear, no to the point,wrong assumption, calculation mistake, illegible, unplanned, how explanationmissing.

& 46 % of the evaluators stated that they use underline whereas 36 %, 32 %, 27 %,26 % use question mark, circle, cross and tick symbol respectively, in evaluation.

& 13 % of the evaluator reported that they did not use any marker in evaluation.& To indicate correctness or agreement with answers, 46 % of evaluators use tick

symbol whereas 29 %, 17 %, 18 % and 8 % uses underline, circle, question mark and cross respectively.

& To indicate incorrectness, 46 % of evaluators use cross symbol whereas 31 %,26 %, 23 % and 4 % uses cross, question mark, circle and tick respectively.

It is surprising to note that a small percentage of evaluators use tick for incorrectnessand cross and question mark symbols for correctness. These findings are counter intuitive.On the other hand, a significant percentage of evaluators use underline for both correct-ness and incorrectness. Consequently, we decided to conclude that tick, cross and circlehave very well defined purpose in evaluation whereas underline has dual use. Evaluator s

uses tick symbol for correctness, agreement and appreciation of the content whereas cross,circle and question mark are being used for disagreement and incorrectness.We decided to verify these results of the survey by analyzing pattern of usage of

these symbols in evaluation of answer books and relationship with marks awarded.

4 Evaluators preferences of sentiment indicators

We analyzed 924 answer books of four subjects: CS456- Database Management System, CS702- XML & Application, CS606-Operating System and CS281- Func-tional Programming. The first two courses are offered in 5th and 7th semester of eight semester Bachelor of Technology (B.Tech.) degree program as compulsory andelective subjects respectively. On the other hand remaining two courses are offeredto two different post graduate courses Master of Computer Application (MCA) andMaster of Technology (M.Tech.) in Computer Science and Engineering. Out of thesetotal answer books, 460 answer books were of four class test (two for each CS702 andCS281) and remaining 464 were of end semester examination for all four subjects.Subject-wise details are given in Table 1. In this table, we use Si, T j and Ek to denoteith subject, j th test and k th end semester examination. Last column of the table containsnumber of evaluator involved in evaluation of subject.

It may be noted that answer books of S1 and S2 were evaluated by single evaluator whereas answer books of S3 and S4 were evaluated by 3 and 5 different evaluatorsrespectively. Each of the evaluator of S3 and S4 evaluated either 1 or 2 questions of each answer books respectively. These evaluator evaluated both test and final exam-ination both in respective subjects.

904 Educ Inf Technol (2014) 19:899 912


7/14

We counted the frequency of usage of these symbols in these answer books. Thetotal counts for different symbols are given in Table 2. In this table, letters C, E, S, Tstand for Class, End Semester, Subject and Test respectively. Numeric subscripts areused for unique identification of Subjects and Tests. Further S1T1, S1T2, S1T, S2T1,S2T2, S2T, C_T and E_T are Class Test 1 of Subject 1, Class Test 2 of Subject 1,Class Test 1 of Subject 2, Class Test 2 of Subject 2, count of Class Tests of Subject 1,count of Class Tests of Subject 2, Total Count of all Class Tests and Total count of allEnd Semester Examination respectively.

It is evident from columns (12 and 13) that evaluators dominantly used tick (42 %)and underline (30 %) symbols during evaluation as compared to others symbols.Average numbers of these symbols per answer book for end semester exam for allfour subjects are 7 and 5 respectively.

We also observed that there is an ordering in the frequency of usage of these symbols for class test (C_T) and for end semester examination (E_T). We called this ordering as usage pattern. However, this ordering in the frequency of usage of these symbols is not observedfor class test exams of Subjects S1and S2and end semester exams of four different subjects.

Further, our analysis indicated that difference in the usage of these symbols in

class test and end semester examination are statistically significant ( p-value


8/14

examination for S3 and S4. We decided to investigate further whether the usage patterns of these markers are different by an evaluator both as a single evaluator for a subject and in the role of one of multiple evaluators. Table 3 gives the count of

different symbols for different evaluators for different subjects per copy/per question.These counts are based on end semester exam answer books only.

We have observed from the above table that usage pattern of these symbols for all ten evaluators are distinct. For example Ev3 used more number of crossescompared to tick whereas Ev6 dominantly used underline marker and rarely usedcross and tick.

Ev1 and Ev10 represent same evaluator but in two different roles. As Ev1, he issingle evaluator for subject S1 and as Ev10, he is one of the evaluator amongst themultiple evaluators for subject S4. Usage pattern of symbols in two different roles are

not identical. Further, we tested the statistical significance of differences in thefrequency of usage of different symbols for Ev1 and Ev10 (same person) using ztest two tailed method. Our test indicated that differences are statistically significant except for circle and question mark.

The results of above analysis lead us to hypothesize that frequency of usage of different markers during evaluation are dependent on kind of subject, kind of testsand kind of role taken by an evaluator.

5 Strategy for anomaly detection

We have identified following types of mistakes in evaluation based on our experienceand of our colleagues.

& Non evaluation of a question or part thereof.& Mistakes in addition of marks awarded in the answer books of different questions

and their part.& Disagreement in marks awarded to questions or their part.& Disagreement over marks secured by students compared to marks secured by one

or more fellow students.First two mistakes can proactively eliminated with the help of a too, which

facilitates online evaluation of essay type answers. On the other hand, detection of likely disagreement between evaluators and students can be done proactively only if there is an association between marks awarded and indicators used for their justification.

Table 3 Counts of symbols per copy/per question

Markers Ev1 Ev2 Ev3 Ev4 Ev5 Ev6 Ev7 Ev8 Ev9 Ev10

Que_mark 0.14 0.16 0.19 0.03 0.02 0.41 0.49 0.11 0.15 1.54

Underline 0.55 0.2 1.07 0 0.08 25.83 0.0123 0.041 1.68 0.024

Cross 0.23 0.33 3.24 0.06 1.03 0.11 2.9 0.89 0.25 0.1527

Tick 1.32 0.23 2.7 0.55 3.69 4.88 1.83 2.199 2.05 0.0277

Circle 0.3 0.3 0.51 0 0.03 0.012 0.06 0.074 0.055 0.513

906 Educ Inf Technol (2014) 19:899 912


9/14

5.1 Relationship between sentiment indicators and marks awarded

We investigated the relationships between different markers and marks awarded. Wecomputed the correlation coefficient between the total marks awarded in each answer

books with total number of markers used to check the existence of association betweenthe marks and total symbols used. Correlation coefficient was 0.3976. In other words,there is no relationship between marks awarded and total symbols used. Further, we alsocomputed the correlation coefficient between the marks and total number of individualmarkers. Correlation coefficient for cross was 0.57 and for all other marker, it was in therange of 0 to 0.4. In other words, marks will be lower for higher number of cross. It verifiesintuitive expectation that cross is always used to indicate the disagreement and mistakes.

We also computed the correlation coefficient between the marks and the frequencyof usage of different symbols by different evaluators. Correlation coefficients for

different evaluators are given in Table 4.It is evident from the above table that there is a relationship between the marks

awarded and tick marker for all evaluators. Further, there is a strong correlation between the marks awarded and underline marker for Ev6. On the other hand, there isa negative correlation between the marks awarded and underline marker for Ev3. Boththese evaluator Ev3 and Ev6 belongs to part of multiple evaluator scenario. These positive and negative correlation coefficients for these evaluators indicate that underlineis used for both positive and negative sense. Use of underline by these two evaluators for expressing different sentiments in multiple evaluator scenario is likely to attract griev-

ances from students. Since these evaluators are evaluating only 1 or 2 questions out of several questions in each answer book, the concerned students are likely to interpret underline only either in a positive sense or negative sense. This situation is likely to leadto confusion and disagreement between examiner and examinee.

There is a shared perception among many evaluators that examinees who havesome grievances belong to either high or poor performing category of students. Thereare few grievances from examines who do not belong to either of these categories.Students belonging to high performing category are very competitive and are veryeager to get credit for all their efforts. They scrutinize their answer books minutely tolook for the opportunity to get better marks. On the other hand, students belonging tolow performing category scrutinize their answer books to improve the marks withobjective of crossing the qualifying criteria.

Accordingly, we clustered the marks of end semester examination of four subjects inthree categories using k-mean clustering algorithm and investigated the relationships

Table 4 Correlation coefficients

Symbols Ev1 Ev2 Ev3 Ev4 Ev5 Ev6 Ev7 Ev8 Ev9 Ev10

Underline 0.039 0.049 0.17 0.02 0.02 0.60 0.13 0.03 0.13 0.03Cross 0.012 0.38 0.41 0.57 0.02 0.09 0.52 0.25 0.54 0.13Tick 0.46 0.80 0.61 0.83 0.76 0.36 0.70 0.36 0.57 0.52

Q_mark 0.29 0.11 0.024 0.08 0.11 0.12 0.015 0.05 0.18 0.12Circle 0.08 0.049 0.075 0.14 0.07 0.03 0.09 0.04 0.12 0.14

Educ Inf Technol (2014) 19:899 912 907


10/14

between marks awarded and frequency of usage of markers for these clusters. Thesethree clusters were identified as high mark cluster (HC), low mark cluster (LC) andaverage mark cluster (AC).

We computed the correlation coefficients between the marks of three clusters andtotal number of markers and also with total number of individual markers. Individualmarkers included in investigations are underline, tick and cross only as shown in Fig. 1.

Since, there is a strong correlation between high marks and tick symbols for all

subjects, we conclude that evaluator use tick markers to express their agreement withthe given answer and also as justification for awarding the specific high marks.Similarly, for all subjects, cross is being used to express disagreement with the givenanswers and reasons for awarding low marks. On the other hand there is no definiteclarity about the purpose of usage of underline by evaluators. There is a correlation between high marks and underline for one subject with multiple evaluators whereasfor low marks cluster there is weak-positive correlation between underline and marksawarded. Under these circumstances, we hypothesize that this underline symbol isalso generally used for disagreement even though there are exceptions.

We have also tested the statistical significance of differences in the total count of these symbols in these three clusters of four different subjects. The counts of thesesymbols in these three clusters of different subjects are given below in Table 5.

Our test indicated that differences in the usages of these symbols in these threeclusters are statistically significant except for underline.

To summarize, tick symbols is used for appreciation, correctness and agreement andcross symbol is used for incorrectness and disagreement by all evaluators. On the other hand, underline symbols is used for both sense either agreement or disagreement.

Fig. 1 Correlation between marks awarded and different markers

Table 5 Count of symbols in different clusters for different subjects

Symbols S1HC S1LC S1AC S2HC S2LC S2AC S3HC S3LC S3AC S4HC S4LC S4AC

Tick 405 42 399 18 7 27 977 329 104 390 48 187

Cross 26 50 77 9 12 55 380 113 186 43 72 107

U_Line 143 173 43 8 13 26 1458 240 497 50 37 45

908 Educ Inf Technol (2014) 19:899 912


11/14

Based on these analyses, we decided to use the pattern in usage of markers and marksawarded to identify anomalies. Intuitively, the number of positive marker (tick) shouldincrease, with increase of marks. Any increase in awarded marks beyond the thresholdof standard deviation without corresponding increase in number of tick is a suspectedcase of anomaly. For example in Table 6, the answer books at serial number 1 and 6 incluster 1 and 1, 3 and 7 in cluster 2 are anomalous as marks awarded are lower, but thenumber of tick are higher compared with peers. For example, marks at serial number 2,3, 4 and 5 in cluster 1 are higher with serial number 1 but number of ticks are lower.

Based on this criterion, the number of suspected anomalies for all four subjects isgiven below in Table 7.

These anomalous answer books are likely candidate for attracting grievances fromstudents. We have recorded the grievances submitted by students of two subjects S1and S2 which are part of our analysis. The numbers of grievances received from themare also given in last column of above Table 7.

It is important to note that all grievances for subject S1 and S2 were included inidentified anomalous answer books for these subjects.

6 Tool description

We have developed a tool which provides services to teachers, research scholars,

examinees and academic administrator. Services provided to an evaluator includeonline evaluation of essay type answer books using different markers, computation of total marks, and preparation of consolidated mark sheet, highlighting suspectedanomalies and responding to grievances of examinees.

Services provided to researcher include annotation of published articles, manage-ment of these annotations including their sharing amongst peers with similar researchinterest. Different modules of the tool have been tested and partially integrated.

Table 6 Example anomalies

Cluster 1

S.No 1 2 3 4 5 6 7 8 9

Marks 14 19 21 25 29 37 38 38 38Tick 3 2 2 1 1 8 0 0 4

Cluster 2

Marks 25 26 33 38 40.5 41 46 46 46.5

Tick 5 2 4 3 0 2 8 3 4

Table 7 Anomalous copiesSubjects Total copies Anomalous Grievances

S1 130 13 7

S2 45 5 5

S3 81 10

S4 72 8

Educ Inf Technol (2014) 19:899 912 909


12/14

However, the tool had not been deployed for different stake holders. In this section,we present snapshots of GUI for evaluators only.

Figure 2 Shows a page of online evaluation of answer book using our tool alongwith marks. Top of this page shows different annotation symbols such as tick, cross,underline, question mark etc. These markers can be used anywhere on the answer

book by an evaluator. Our tool also restricts evaluators not to skip any question beforegiving marks after evaluation. It also takes care of rechecking of individual questionagain by other evaluator.

Our tool also provide alert message to evaluators as shown in Fig. 3 duringevaluation of answer books to re check a specific answer if they deviated from their

Fig. 2 Snap shot of evaluated question

Fig. 3 Snapshot to alert evaluator

910 Educ Inf Technol (2014) 19:899 912


13/14

preferential usage behavior of using symbols while awarding marks which we hadrecorded in our database based on their past history of evaluating answer books.

Consolidated mark list for a subject with suspected anomalies are given in Fig. 4.Records which are displayed in red are those records for which grievances may beclaimed by the students against the evaluation.

Our tool also allows students to see their answer books and claim their grievancesusing feedback option as shown in Fig. 5.

Fig. 4 Consolidated mark sheet

Fig. 5 Student grievance

Educ Inf Technol (2014) 19:899 912 911


14/14

7 Conclusion

Major findings of our survey and analysis of answer books have confirmed widelyheld perception that symbolic markers and opinionated words are used to communi-

cate justification for marks/grades awarded in evaluation of essay type answer books.Further, it is also validated the strong correlation between these marks and marksawarded. However, it is difficult to generalize the applicability of these findings because a significant percentage of evaluators reported that they do not use naymarker. This was also substantiated while analyzing the evaluated answer books.Another surprising result of our investigation was that evaluators have differential preferences in usage of symbols and their relationship with marks. Our proposedstrategy for detecting inconsistency/anomaly in evaluation of essay type answer books is simple and can be scaled up. This strategy can not be used if evaluators

don not provide any foot print of the justification for awarding marks. In absence of indicators such as tick, cross or opinionated words, only alternative seems to be using NLP technique for analysis of answers and relationship with marks. In our opinion,this is an alternative which is complex compare to our strategy.

References

Birenbaum, M. (2007). Assessment and instruction preferences and their relationship with test anxiety andlearning strategies. Higher Education, 53 , 749 768.

Burstein, J., Leacock, C., & Swartz, R. Automated evaluation of essay and short answers. In M. Danson(Ed.), Proceedings of the Sixth International Computer Assisted Assessment Conference , Loughbor-ough University, Loughbor-ough, UK. (2001).

Cano, M.-D. Student s involvement in continuous assessment methodologies: A case study for a distributedinformation systems course. IEEE Transactions on Education , 54(3)

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: a survey. ACM Computing Surveys,41(3), 1 58. ISSN: 0360 0300.

Finlayson, D.S. The reliability of the marking of essays. British Journal of Educational Psychology, 21 ,126 134. doi:10.1111/j.2044-8279.1951.tb02776.x

Gijbels, D. (2005). The relationship between students approaches to learning and the assessment of learning

outcomes. European Journal of Psychology of Education, XX (4), 327

341.Gijbels, D., & Dochy, F. (2006). Student s assessment preferences and approaches to learning: can

formative assessment make differences? Educational Studies, 32 (4), 399 409.Jain, S., Alkhawajah, A., Larbi, E., Al-Ghamdi, M., & Al-Mustafa, Z. (2005). Evaluation of student

performance in written examination in medical pharmacology. Scientific Journal of King Faisal University (Basic and Applied sciences), 6 (1), 1426 1435.

Mora, M. C., Sancho-Bru, J. L., Iserte, J. L., & Sanchez, F. T. (2012). An e-assessment approach for evaluation in engineering overcrowded groups. Computers in Education, 59 (2), 732 740.

Pai, P., Sanji, N. et al. (2010). Comparative assessment in pharmacology multiple choice questions versusessay with focus on gender differences. Journal of Clinical and Diagnostic Research, 4 (4), 2515

2520.Plimmer, B. (2010). A comparative evaluation of annotation software for grading programming assign-

ment. In the proceeding of Australian User Interface Conference.Struyven, K., Dochy, F., & Janssens, S. (2005). Student s perceptions about evaluation and assessment in

higher education: a review. Studies in Higher Education, 30 (4), 331 347.

912 Educ Inf Technol (2014) 19:899 912
http://dx.doi.org/10.1111/j.2044-8279.1951.tb02776.xhttp://dx.doi.org/10.1111/j.2044-8279.1951.tb02776.x

a strategy for detection of inconsistency in evaluation.pdf

Documents