algorithmic exam generation - technion … · in this work we present a novel algorithmic framework...

Algorithmic Exam Generation

Omer Geiger

Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016

Algorithmic Exam Generation

Research Thesis

Submitted in partial fulfillment of the requirements

for the degree of Master of Science in Computer Science

Omer Geiger

Submitted to the Senate

of the Technion — Israel Institute of Technology Adar

5776 Haifa February 2016


This research was carried out under the supervision of Prof. Shaul Markovitch, in the Faculty of

Computer Science. Some results in this thesis have been published in IJCAI-15 [GM15].

ACKNOWLEDGEMENTS

I would like to express my gratitude towards all who have helped me complete this work

successfully: First of all, to my academic advisor, Prof. Shaul Markovitch, for his precious

guidance and endless patience; to our research group, Lior Friedman, Maytal Messing, and

Sarai Duek, for their feedback and support throughout the research; to Prof. Orit Hazan, from

the department of education, for her unique and insightful perspective as an educator; to Prof.

Eran Yahav for his much appreciated comments and suggestions for improvements.

I would also like to thank several colleagues, each of which provided valuable consultations

at critical junctions on the road towards completion of this Thesis: Alon Gil-ad, Omer Levy,

Edward Vitkin, Ran Ben-Basat, Jonathan Yaniv, Nadav Amit, Shai Moran, and Daniel Genkin.

I thank my parents Dan and Ora for their moral support and for showing me a broader

perspective of things when needed. Lastly, I thank Noa, my wife, for her love and support, and

for bearing with me patiently through this challenging journey.

The Technion’s funding of this research is hereby acknowledged.


Contents

List of Figures

Abstract 1

List of Acronyms 3

List of Symbols 5

1 Introduction 7

2 Problem Definition 112.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Relative Grading Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Absolute Grading Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3 Model-based Exam Generation (MOEG) 153.1 Examination Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Student Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Target Student Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.4 Order Correlation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.5 Exam Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.6 Searching the Space of Exams . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.7 Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.8 Adapting to the Absolute Grading Setting . . . . . . . . . . . . . . . . . . . . 19

4 Educational Domains 234.1 Algebra Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Algebra Ability Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.2 Algebra Question Generation . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Trigonometry Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2.1 Trigonometric Ability Set . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 Trigonometric Question Pool . . . . . . . . . . . . . . . . . . . . . . . 25


5 Evaluation 275.1 Empirical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.2 Performance Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.3 The Effect of Exam Length on Performance . . . . . . . . . . . . . . . . . . . 28

5.4 The Effect of Sample Size on Performance . . . . . . . . . . . . . . . . . . . . 29

5.5 Framework Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.6 Domain Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.7 An Alternative Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.8 Absolute Grading Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Related Work 396.1 Item Response Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.1 IRT Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6.1.2 ICC Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1.3 Common IRT Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2 Testsheet Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.3 Intelligent Tutoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Discussion 497.1 Other Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7.2 Framework Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

Hebrew Abstract i


List of Figures

3.1 Pseudo-code for action landmark approximation method . . . . . . . . . . . . . . . 21

3.2 MOEG pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1 Trigonometry domain abilities . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.1 Performance over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 The effect of exam length on performance . . . . . . . . . . . . . . . . . . . . 30

5.3 The effect of sample size on performance . . . . . . . . . . . . . . . . . . . . 31

5.4 The effect of εw on performance . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.5 The effect of εp on performance . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.6 Algebra domain coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.7 Performance over time (absolute grading) . . . . . . . . . . . . . . . . . . . . 35

5.8 The effect of µ on performance (absolute grading) . . . . . . . . . . . . . . . . 36

5.9 The effect of σ on performance (absolute grading) . . . . . . . . . . . . . . . . 37

5.10 Question difficulty histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.1 The 3pl ICC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6.2 Example testsheet composition problem formulation [HLL06] . . . . . . . . . 44

6.3 Interaction between ITS components [BSH96] . . . . . . . . . . . . . . . . . . 46

7.1 Example inferences in different domains . . . . . . . . . . . . . . . . . . . . 50


Abstract

Given a class of students, and a pool of questions in the domain of study, what subset will

constitute a “good” exam? Millions of educators are dealing with this difficult problem world-

wide, yet the task of composing exams is still performed manually. In this work we present a

novel algorithmic framework for exam composition. Our main formulation requires two input

components: a student population represented by a distribution over a set of overlay models,

each consisting of a set of mastered abilities, or actions; and a target model ordering that, given

any two student models, defines which should be graded higher. To determine the performance

of a student model on a potential question, we test whether it satisfies a disjunctive action

landmark, i.e., whether its abilities are sufficient to follow at least one solution path. Based on

these, we present a novel utility function for evaluating exams. An exam is highly evaluated

if it is expected to order the student population with high correlation to the target order. In an

alternative formulation we devised, the target ordering is replaced with a target grade mapping

indicating the desired grade for each student model. In this case, good exams are those for

which the expected grades are close to those specified by the target mapping. The merit of

our algorithmic framework is exemplified with real auto-generated questions in two domains:

middle-school algebra and trigonometric equations.

1


2


List of Acronyms

MOEG : Model based Exam Generation

ITS : Inteligent Tutoring System

MOOC : Massively Open Online Course

MIP : Mixed Integer Programming

IRT : Item Response Theory

CTT : Classical Test Theory

ICC : Item Characteristic Curve

1-4pl : 1-4 Parameter Logistic Model

MLE : Maximum Likelihood Estimator

IIF : Item information function

TIF : Test information function

CAI : Computer Aided Instruction

3


4


List of Symbols

Q : Question set

q : Question

A : Ability set

ψ(·, ·) : Sufficiency predicate

s : Student model in set notation

s : Student model in vector notation

M : Set of possible models

PM : Distribution over student models

e : Exam

ke : Exam length

wi : Grading weights / Ability weights

g(·, ·) : Grading function

g∗(·) : Target grade mapping

�∗ : Target student order

�e : Exam induced student order

�w : Ability-weight-induced student order

�⊆ : Subset-relation-induced student order

�P : Question-pool-based student order

C(·, ·) : Order correlation measure

S(q) : Set of solution paths for question q

l(q) : The disjunctive action landmark for question q

l(q) : Approximation of the disjunctive action landmark for question q

Pr(·) : Probability

τ : Kendall’s Tau

εp : Experimental variable controlling student distribution

εw : Experimental variable controlling ability weights

Dlim : Depth limit in landmark approximation algorithm

SOLlim : Solutions limit in landmark approximation algorithm

Tlim : Time limit of landmark approximation algorithm

L(·|·) : Likelihood function

I(·) : Item information function

SE(·) : Standard error

5


6


Chapter 1

Introduction

Assessing the knowledge state of students is an important task addressed by educators worldwide

[Gro98]. Knowledge assessment is required not only for the purpose of determining the students’

deserved grades, but also for diagnostic evaluation used to focus the pedagogical resources on the

students’ observed shortcomings [LST12]. The most common method for such an assessment is

having the students answer an exam. Composing an exam from which the students’ knowledge

state can be precisely evaluated is a difficult task that millions of educators encounter regularly.

The importance of exam composition has increased with two major developments in

computer-aided education. The first is the growing popularity of massive open on-line courses

(MOOCs) such as Coursera, Kahn Academy, edX, and Academic-Earth, which offer new educa-

tional opportunities worldwide [YPC13]. The second is the improvement of intelligent tutoring

systems (ITS). These are software products that intelligently guide students through educational

activities [PR13].

Exams are still predominantly written manually by educators. Several attempts have been

made at automating this task, often referred to as testsheet composition [Hwa03, LST12], some

of which are based upon the statistical paradigm of item response theory [GC05, EAAA08]. In

many of these works, exam questions are considered to be atomic abstract objects represented

only by a vector of numeric features. Common features include difficulty level, solving time,

and discrimination degree. Using such a factorial representation, the problem is then defined as

a mixed integer programming (MIP) problem. Usually the objective (maximization) function is

the discrimination level of the entire exam while the remaining features compose the problem

constraints. Different optimization algorithms have been applied to solve such MIP problem

formulations [HLL06, HCYL08, DZWF12, WWP+09].

In these works, assuming a feature vector per question is given, the process of exam compo-

sition is effectively automated. However, in order to apply these methods in real educational

settings, the feature vectors of all candidate questions must be determined. Alas, it remains

unclear how this was done, and this major framework component remains an atomic blackbox.

The reader is left to speculate that perhaps the feature vectors are manually specified by a field

expert. If so, these methods may be regarded as only semi-automatic.

In this paper, we present a novel algorithmic framework for exam generation, which requires

7


minimal manual specification. We generate real candidate questions, and algorithmically

determine which domain abilities they test. Student models are used to represent possible

knowledge states, allowing us to determine their performance on candidate questions. Two

different problem formulations are proposed. In the first, the user (educator) specifies a target

order between knowledge states, indicating the relation “more proficient than”, used as input

for the algorithm. The algorithm then searches the space of possible exams for one that best

reflects this ordering over the student population. In the second formulation, the user specifies a

mapping from each possible knowledge state to a deserved grade. In this case the algorithm

searches for an exam for which the resulting student grades are as close as possible to those

specified. We will refer to these two formulations as relative grading and absolute grading, for

convenience.

Building the framework requires us to overcome several difficulties. First, we need to

define a method for the user to easily supply the desired specification of either the student

order or the grade mapping, according to the problem formulation used. We assume that the

domain knowledge is represented by a set of abilities, and the knowledge state of a student

is modeled as a subset of them which she has mastered. This is known in the ITS literature

as overlay model [Bru94]. The number of possible student models is therefore exponential,

deeming the specification of an explicit ordering or grade mapping unfeasible. We developed

a way to simplify the input specification based on an additive grading scheme, for both our

problem formulations. The idea is to have the educator specify a weight per ability and define

base grades for the students as the sum of ability-weights mastered by the student. Extracting

the student ordering from the base grades, as required by the relative grading formulation, is

straightforward: a student model is considered more proficient than another if it has a higher

base grade. For the absolute grading problem formulation, an additional step is required to

transform the base grades to fit a desired distribution. The user is therefore required to supply

this distribution (e.g, N(µ, σ)) as an additional input.

Second, we need to define a utility function for guiding the search through the exam space.

We developed two utility functions, matching the two problem formulations. In the absolute

grading formulation, the utility function computes for each student model the difference between

the grade mapping, as specified by the teacher, and the model’s exam grade. The goal is to

minimize the expected distance under the L1 norm (or any other). In the relative grading

formulation, we use a correlation measure (e.g. Kendall’s Tau) between the exam-imposed

grade order and the target order, specified by the teacher. Obviously, in this case the goal is to

maximize the correlation. In both formulations, since the space of student models is prohibitively

large, we use a sample of models for estimating the utility.

The most difficult hurdle is to determine the grade of a student model on an exam. We

have developed a novel method to solve this problem, using a technique based on graph search

and planning. To do so, we restrict our attention to procedural domains, such as algebra and

geometry, where the abilities are actions, and the answer to an exam question is a sequence of

such actions. For each question, we perform a graph search to compute its action landmarks —

a set of actions that are necessary and sufficient for solving the question. As alternative solutions

8


may be possible, we use disjunctive action landmarks. To determine whether a student model

can solve a question, we test whether the model’s set of abilities contains at least one of the

question’s disjuncts.

The remainder of this paper is structured as follows. Chapter 2 describes our generic

problem formulation for exam generation. Chapter 3 presents MOEG, our complete framework

for MOdel based Exam Generation, applicable primarily to procedural educational domains.

All components are defined, motivated and explained in the chapter. The chapter is concluded

with a pseudo-code figure showcasing integration of all components into a unified algorithm.

Chapter 4 describes two representative educational domains from secondary school math courses:

univariate linear equations, and trigonometric equations. In Chapter 5 we present an empirical

evaluation of the algorithmic framework over these two domains. Chapter 6 constitutes a brief

comparative literature review of related work, highlighting the uniqueness of our work in the

context of others. The paper is concluded in Chapter 7, dedicated to discussing the implications

of our work and possibilities for future expansions.

9


10


Chapter 2

Problem Definition

In this chapter we present our novel formulation for the problem of exam generation. The

following section describes some preliminary definitions and is followed by the definition of

two problem variations — relative grading and absolute grading.

2.1 Preliminary Definitions

We define an examination domain as a triplet 〈Q,A,ψ〉. It is composed of a set of candidate

questions Q, a set of abilities A = {a1, a2, ..., am}, and a sufficiency predicate ψ : 2A ×Q→{1, 0}, where ψ(A′, q) = 1 iff the ability set A′ ⊆ A is sufficient to answer the question q ∈ Q.

Next, we define a student model, using the relatively simple approach known as the binary

overlay model [Bru94]. By this approach, a student model is defined as a subset of domain

abilities, s ⊆ A, mastered by the student. Therefore, a student s answers a question q ∈ Qcorrectly iff ψ(s, q) = 1. The student model, also sometimes referred to as a knowledge state,

may be alternatively represented by a binary vector s with each coordinate indicating mastery

of a matching ability or lack thereof. These vector and set notations will be used throughout this

paper interchangeably.

We denote the set of all possible models asM = 2A, but assume that not all student models

are equally likely. Therefore we denote by PM = {〈si, pi〉} the distribution over the possible

student models, where si ∈M and pi is its proportion in the population.

An exam e of length ke is defined as a vector of ke questions, and a matching vector of

associated non-negative grading weights: e =⟨〈q1, ..., qke〉, 〈w1, ..., wke〉

⟩. The grade of a

student model s ∈M on exam e is simply the sum of grading weights for questions answered

correctly by the student model : g(s, e) =∑

1≤i≤ke wi · ψ(s, qi). Note that in some cases we

restrict ourselves to uni-weight exams, i.e., exams where all weights w1, ..., wke are equal. We

turn to define the notion of exam utility in the two problem formulations.

11


2.2 Relative Grading Formulation

Suppose we were to ask an educator to help us compose the perfect exam for some educational

domain. Ideally, we would like the educator to specify for each pair of students, which one is

more proficient and thus deserves a higher grade. Obviously a student knowing nothing should

be ranked inferior to all others, while a student knowing everything should be ranked superior

to all others. However, to determine the complete order, we rely upon the educator’s expert

knowledge to define the ordering. We call the desired order, given by the educator, the target

student order, and denote it �∗ ⊆M2. This is a partial order defining the binary relation “is

more proficient than” between pairs of students. Several compact and intuitive methods for

defining such an order are described in the following chapter.

Observe that any exam (e) also defines such a partial order between students (�e) according

to their grade. For s1, s2 ∈M, we have that s1 �e s2 ⇔ g(s1, e) ≤ g(s2, e). A good exam is

one for which the resulting student grades accurately reflect the target order, while taking into

account the model distribution PM. That is to say, it is more important to correctly order more

likely models than less likely ones. For this purpose we must make use of some correlation

function C between orders. A reasonable choice for such a correlation measure made throughout

this work is the weighted Kendall’s Tau, defined in the next chapter. We are now ready to define

a utility function for evaluating exams. Given an exam e, an order correlation function C, and a

target student ordering �∗, we define the utility of e as U(e) = C(�e,�∗).

Note that the actual exam grades of the students are of no importance in determining the

fitness of an exam. Therefore, exams of different difficulty levels may be considered equivalently

fit if their imposed student orders are similarly correlated with the target order. This may be

justified with the simple observation that the absolute exam grades may always be curved to

match any desired distribution, thereby fixing difficulty-based bias. Nonetheless, critics may

claim that post-exam grade curving is generally undesirable, and an attempt should be made to

minimize it when possible.

2.3 Absolute Grading Formulation

The alternative absolute grading formulation addresses the issue described above. Instead of a

target student order as input, this formulation uses a target grade mapping — a function defining

the deserved grade of each student model: g∗ :M→ [0, 1]. As the size ofM is exponential

in n, we cannot expect a user to fully specify the mapping explicitly, and therefore present a

compact specification method in the following chapter. The fitness of an exam is defined in

terms of the distance between the actual exam grades, expressed by g, and the target exam

grades, expressed by g∗. The smaller the distance between g and g∗ under a specified norm (e.g.

L1), the more fit the exam is considered to be. As in the case of relative grading, the utility

calculation is weighted by the model distribution PM.

While in the relative grading setting, exams of completely different difficulty levels may be

considered equally fit, this is not possible in the absolute grading setting. The user can control

12


the difficulty level of the exam through the target grade mapping, in addition to controlling the

ordering of students. Instead of relying upon a post-exam curving of grades to match the desired

grade distribution, this formulation enables “pre-exam curving” by which the base grades are

curved to facilitate the generation of a desirable exam.

13


14


Chapter 3

Model-based Exam Generation(MOEG)

In this chapter we present our algorithmic framework for MOdel-based Exam Generation

(MOEG). To facilitate readability, all sections of this chapter, except the last, assume the relative

grading setting. The last section is dedicated to discussing the changes needed in order to apply

the framework to the absolute grading setting.

3.1 Examination Domains

A reasonable source for candidate questions is a curriculum textbook or a collection of previous

exams. This means that Q is some finite set of questions selected by the educator or curriculum

supervisor and coded once for the purpose of generating all future exams.

A more generic approach is to devise a question-generating procedure. In section 4.1 we

present an algorithm for automatically generating questions in algebra. Such a procedure

takes input parameters controlling aspects of solving methods, difficulty, or topics, and creates

questions. It may be applied as desired to create the entire question pool Q. Naturally, this

approach becomes more difficult with the increasing complexity of the domain. A hybrid

approach is to algorithmically produce variations of existing questions based on user refinement

of constraints [SGR12].

The set of abilities A and a sufficiency predicate ψ are assumed to be given by the educator.

However, for procedural domains, we introduce an algorithm that automatically induces the

sufficiency predicate. Procedural domains are those where questions are solved by applying a

sequence of operators or actions. Such domains can therefore be represented as a search graph,

where the vertices are intermediate solution states S, and the actions are steps executed for

solving the exercise. We assume that the set of search-graph actions are in fact the set of domain

abilities, where each ability a ∈ A is successor function a : S → 2S . Examples of applicable

procedural domains include algebra, geometry, trigonometry, classical mechanics, and motion

problems.

We turn to define the sufficiency predicate ψ : 2A ×Q→ {1, 0} for procedural domains.

15


An ability set is sufficient to answer a question iff it contains all abilities needed in at least

one solution path. Helmert and Domshlak [HD09] define a disjunctive action landmark as

a set of actions such that any solution must include at least one of them. We expand the

definition to a set of sets of actions, such that each solution must contain at least one of the

sets. Let S(q) be the set of all solution paths for a question q. The disjunctive action landmark

of q is therefore l(q) , {{a|a appears in t}|t ∈ S(q)}, or can be expressed equivalently as a

DNF formula: l(q) ,∨t∈S(q)[∧a∈t a]. For A′ ⊆ A, q ∈ Q we have that ψ(A′, q) = 1 iff

∃Ai ∈ l(q) : Ai ⊆ A′.In very simple domains, the set of solutions S(q) can be obtained via exhaustive search. In

more complex domains, however, such a procedure is computationally infeasible. We propose

an approximation of ψ using an anytime algorithm that collects possible solutions. The idea is

to generate random operator sequences up to a certain length limit and test if they compose a

new solution path. After a certain number of unsuccessful attempts the length limit is increased,

thereby allowing adaptation to the domain at hand. But, when a new solution is found, the length

limit is reset to the effective length of that solution. This is done in order to keep the depth limit

reasonably small, yet large enough to find additional solutions. The method is defined fully in

the pseudo-code of Figure 3.1.

3.2 Student Population

Describing the student model distribution in the general case requires explicitly defining the

probability for each possible model inM = 2A. Due to the exponential size of this model set,

we adopt a simplifying independence assumption between abilities. By doing so, we reduce

the complexity of distribution specification from exponential to linear in |A|, while retaining

reasonable flexibility. Formally, this simplification means we assume that a randomly selected

student masters each ability ai ∈ A with probability pi ∈ [0, 1]. Furthermore, we assume

that the probability of mastering each ability is independent and that students are mutually

independent as well. It follows that the probability of a model is:

Pr(〈a1, a2, ..., a|A|〉) ≡∏i:ai=1

pi ·∏i:ai=0

(1− pi) .

In future work we intend to relax this assumption and use Bayesian networks for allowing

arbitrary dependencies [GVP90].

3.3 Target Student Order

Explicitly specifying an order over the set of student models is also infeasible in the general

case due to the exponential size of the model setM. We therefore propose three methods for

simple order specification. In the first method, the educator is required to specify a vector of

non-negative ability weights w = 〈w1, ..., w|A|〉, indicating the importance of each ability to

domain mastery. Given these, the proficiency level of a student model s = 〈s1, s2, ..., s|A|〉 ∈ M

16


is defined as the sum of its mastered ability weights, i.e., the dot product (s, w) = Σiwi · si.Having defined a scalar proficiency level per student, the target order definition is straightforward.

For any s1, s2 ∈M :

s1 �w s2 ⇔ (s1, w) ≤ (s2, w) .

The second method uses the order induced by the subset relation (�⊆). A student who has

mastered all the abilities of another, as well as some additional ones, is naturally considered

more proficient:

s1 �⊆ s2 ⇔ ∀i[s1[i] = 1→ s2[i] = 1] .

The advantage of this method over the first one is that it requires no input from the educator.

However, the first method allows more refined orders to be specified and is thus preferable when

the additional input is available.

The third method for target order specification is question-pool based: The teacher decides

upon a question pool P , rich enough to order the student models (almost) optimally. As exams

are much more restricted in length, we want to find a small set of questions that order the student

models similarly to the order induced by the pool. Using the question pool P , we define the

target order between student models according to the number of questions in the pool they

answer correctly:

s1 �P s2 ⇔ Σqi∈Pψ(s1, qi) ≤ Σqi∈Pψ(s2, qi) .

The pool is potentially much larger than the desired exam length, and thus cannot serve as an

exam. In some cases it may be reasonable to use the entire question set Q as the pool.

3.4 Order Correlation Measure

We have reduced the problem of evaluating an exam e to comparing its induced student order

(�e) with the target student order (�∗). For this, we require a correlation measure that evaluates

the similarity between the two orders. We considered several alternatives, such as Kendall’s τ

[Ken38], Goodman and Kruskal’s Γ [GK54], Somers’ d [Som62], and Kendall’s τb [Ken45].

Eventually we selected the classic Kendall’s τ for this work, but the others are also adequate

candidates. Kendall’s τ compares the number of concordant student pairs (Nc) with the number

of discordant student pairs (Nd):

Nc , |{s1, s2 ∈M : s1 �e s2 ∧ s1 �∗ s2}|

Nd , |{s1, s2 ∈M : s1 �e s2 ∧ s1 �∗ s2}|

τ ,Nc −Nd(|M|

2

) .

A value of 1 implies complete correlation, while a value of −1 implies complete inverse

correlation. Note that this measure does not account for ties in either of the orders, deeming the

full range [−1, 1] unreachable in the presence of ties.

17


3.5 Exam Utility Function

Calculating the order correlation measure over all possible models is computationally infeasible.

Due to this practical constraint, we resort to an approximation measure based on a model sample

drawn from the given distribution PM. We define our utility function as Kendall’s τ computed

over the model sample between the target order and the exam-induced order.

Recall from Chapter 2 that we require the order correlation measure to also reflect the

distribution of student models, stressing the importance of ordering more likely models than less

likely ones. Our sample-based correlation measure meets this requirement: The likely models

are more likely to be sampled, perhaps more than once, and thus have a stronger influence on

the measure. The resulting measure is in fact an approximation of a generalization of Kendall’s

τ to non-uniform element weights, introduced recently by Kumar and Vassilvitsii [KV10].

3.6 Searching the Space of Exams

Our local search algorithm for exam generation involves three steepest ascent hill climbing

phases: adding questions, swapping questions, and adjusting grading weights. Starting with

an empty set of questions, the algorithm iteratively adds the question for which the resulting

set yields maximal utility value, using uniform grading weights. When the question set has

reached the desired exam length, the algorithm turns to consider single swaps between an exam

question and an alternative candidate. The swap maximizing the utility is performed until a local

optimum is reached, at which point the algorithm proceeds to adjusting the grading weights.

Recall that the absolute grades of the students are of no importance to us as we are only

interested in the order imposed on the student sample. It follows that, theoretically, a good set

of weights would meet the desired property that every two subsets have a different sum. Using

such a weight set makes it possible to differentiate between any two students who answered

differently on at least one exam question. That means they’re grades will surely be different, but

due to conflicting constraints they may not reflect the desired order upon all pairs. Constructing

a weight set with this property is not difficult, for example: {1+ 1pi

: pi is the ith prime number}or {1 + 1

2i: i ∈ N} or even a weight set randomly generated from a continuous range (with a

theoretical probability of 1). Of course not all such candidate weight sets are equivalent in terms

of the orderings they may impose between subsets.

The weight adjustment performed by our algorithm enables the construction of such a

desired weight set by applying weight perturbations of exponentially decreasing granularity.

Starting from the local optimum reached at the end of the question swapping phase, the algorithm

proceeds to perform a local search over the space of weight vectors. The search operators include

the addition of a small constant ∆ (e.g. 0.05) or its negation to any question weight. When a

local optimum is reached, the increment step ∆ is halved and the process continues this way

until no improvement is made.

The algorithm produces exams that are expected to have good discriminating capabilities. It

rejects questions for which the student answers are extremely homogeneous, i.e., very difficult or

18


very easy ones, since such questions contribute little to the induced order correlation. Moreover,

the desired discrimination is defined by the target order, given as input. Questions which

proficient students, as defined by input, are more likely to answer correctly than others, are

preferred. The grading weights of exam questions are expected to behave similarly. Perhaps

contrary to initial intuition, difficult exam questions are not expected to receive high grading

weights. This behavior, attributed to the lower discriminative capability of such questions, is

reasonable in real educational settings. An exam weighted directly by difficulty generally results

in a distorted grading curve, as only the few most proficient students will answer correctly the

highly weighted questions.

3.7 Wrap-up

We show in Figure 3.2 a high-level pseudo-code for the entire exam generation procedure

described. It accepts as input the domain’s ability set A and the student model distribution PM.

For simplicity of presentation, the pseudo-code uses the default method for defining a target

student order (�w). Therefore the third input parameter is the teacher-specified ability weight

vector w.

3.8 Adapting to the Absolute Grading Setting

This section is dedicated to discussing the changes required in the framework components,

in order to be used in the alternative absolute grading setting. First, the target student order

�∗⊆ M2 is replaced with the target grade mapping g∗ : M → [0, 1]. As the size of Mis exponential in n, we cannot expect the user to specify the value of g∗ explicitly for each

student model. We therefore propose an alternative intuitive method for specifying g∗, by

supplying a triplet 〈w, µ, σ〉. The vector w = 〈w1, ..., wm〉 is composed of non-negative

weights indicating ability importance, as in the definition of �w in Section 3.3. The µ and σ

parameters are respectively the desired mean and standard-deviation of the grade distribution.

For practical use, computing g∗ requires a student sample taken from the model distribution

PM. To compute g∗(s) for all s in the sample, we first order the sampled students by their

proficiency level. Recall that the proficiency level of s is defined as the weight sum of mastered

abilities, (s, w) = Σiwi · si. Each student s is then assigned its theoretical percentile of the

Gaussian distribution N(µ, σ).

Another change is the removal of the order correlation measure used for the utility function in

the relative grading setting. Instead of searching for exams ordering students in high correlation

with the target order, we desire exams for which the resulting grades are close to the target

mapping. Therefore, a distance function is required instead of the correlation measure, and the

objective is changed from maximizing the correlation to minimizing the distance. In this work

we used the L1 norm between grade vectors of sampled students as our distance function. Other

candidate distance functions include cosine similarity, L2, L∞, and others.

19


Finally, we turn to inspect the changes required in the search method. The algorithm remains

very similar to that of the previous setting with only a couple of changes worth mentioning. In

the relative grading setting, the values of question weights were important only in comparison

to each other. Here, since the absolute grades are of interest the absolute weights are important

as well. Throughout the search, before evaluating an exam, the question weights must hence be

normalized, to sum up to a value of 1. Additionally, the local search selects the successor that

minimizes distance rather than the one that maximizes utility.

20


Procedure APPROXIMATE-ACTION-LANDMARK(q)

Parameters: Tlim - runtime limit, patience parameter

SolutionOperators← {} # set of solution-operator setsdlim ← 1

tries← 0

Repeat until (TIMEUP(Tlim))

path← Follow random action sequence from q upto length dlimIf IsSolution(path)

ops← {a ∈ A|a ∈ path}If ops /∈ SolutionOperatorsSolutionOperators← SolutionOperators ∪ {ops}dlim ← length(path)

tries← 0

Elsetries← tries+ 1

Elsetries← tries+ 1

If tries ≥ patience

tries← 0

dlim ← dlim + 1

Return SolutionOperators

Figure 3.1: Pseudo-code for action landmark approximation method

21


Procedure MOEG(A,PM, w)

Q← GENERATE-QUESTIONS()

Foreach q ∈ Ql(q)← APPROXIMATE-ACTION-LANDMARK(q, A)

M ← SAMPLE-STUDENT-POPULATION(PM)

Foreach s ∈ MProficiency(s)← Σiwi · si�∗← {(s1, s2)|Proficiency(s1) ≤ Proficiency(s2)}

Foreach (s, q) ∈ M ×Q

ψ(s, q)←{

1 ∃t ∈ l(q) s.t. t ⊆ s

0 otherwise# set notation for student s ∈M is used for simplicty

For any exam e and students s1, s2 ∈M:grade(s1, e) ,

∑1≤i≤ke wi · ψ(s1, qi)

�e, {(s1, s2)|grade(s1, e) ≤ grade(s2, e)}U(e) , τ(�∗,�e) # or other correlation measure

EXAM BUILD:Initialize exam← Empty Exam

exam← ADD-QUESTIONS(exam,U, ke)

exam← SWAP-QUESTIONS(exam,U)

∆← 0.05 # or any other small valueimproved← TRUE

While improved(exam, improved)← ADJUST-WEIGHTS(exam,U,∆)

∆← ∆/2

Figure 3.2: MOEG pseudo-code

22


Chapter 4

Educational Domains

In this chapter we describe two procedural domains, implemented fully, and used for empirical

evaluation of the MOEG framework. The first domain, typical of middle-school algebra courses,

is that of single-variable linear equations. The second domain is trigonometric equations, a

subject typically found in high school mathematics curricula.

4.1 Algebra Domain

In the Algebra domain students are asked to solve for x in single-variable linear equations such

as:

• 2x+ 5 = 13

• x− 3 = −8 + 8x+ 3(−2x+ 3)

• 2− (−4x+ 2(x+ 6− 3x)) = 4x

4.1.1 Algebra Ability Set

The ability set A, consists of 18 types of algebraic manipulations. We define the following 5

main types of actions and later decompose them into subtypes:

(U ) Unite: Merge two consecutive terms of the same type (variable or constant), e.g., −2x+

7x⇒U 5x.

(O) Open multiplication: Apply the distributive property to eliminate one pair of parentheses,

e.g., −3(x+ 2)⇒O −3x− 6.

(D) Divide: If both sides of the equation contain one term and the coefficient of one is a

divider of the other, divide both sides by it, e.g., 2x = −8⇒D x = −4.

(M ) Move: Move a single term from one side of the equation to the other and place its negation

following a term of the same type, e.g., 5x− 2 = −6 + 3x⇒M 5x = −6 + 2 + 3x.

23


(R) Rearrange: Move a single term within one side of the equation so that it follows a term of

the same type, e.g., x− 8 + 7x = 10⇒R x+ 7x− 8 = 10.

Each such action type is further decomposed into subtypes according to the parameters it

is applied over. For example, Unite (U ) is decomposed into 8 subtypes according to 3 binary

arguments: the type of terms united (variable or constant), the sign of the first term (’+’ or ’-’),

and the sign of the second. In a similar manner each action type is decomposed, giving us a

total of |A| = 18 abilities:

U(8) : {U〈t, s1, s2〉 : t ∈ {v, c}, s1, s2 ∈ {−,+}}

O(2) : {O〈s〉 : s ∈ {−,+}}

D(2) : {D〈s〉 : s ∈ {−,+}}

M(4) : {M〈t, s〉 : t ∈ {v, c}, s ∈ {−,+}}

R(2) : {R〈t〉 : t ∈ {v, c}}.

4.1.2 Algebra Question Generation

For this algebraic domain we devised a question-generating algorithm. It starts with an equation

representing the desired solution, and repetitively applies complicating operations while retaining

equation equivalence.

The algorithm receives two parameters: depth (d) and width (w). It begins with a solution

equation of the sort x = c and manipulates it by applying a short random sequence of basic

operations, resulting in an equation of the form a1x + b1 = a2x + b2, where the parameters

a1, a2, b1, b2 may be 0, 1 or any other value. The algorithm then iteratively performs d “deepen-

ing” manipulations, transforming expressions of the sort ax+ b to a′x+ b′+ c(a′′x+ b′′) while

maintaining that a = a′ + ca′′ and b = b′ + cb′′. These deepening manipulations are performed

on random levels of the equation tree structure, and so may be applied to an inner part created

by a previous iteration.

The algorithm continues with w “widening” iterations where a random term is split into a

pair of terms, i.e., b⇒ b′+ b′′ or ax⇒ a′x+ a′′x (where b = b′+ b′′, a = a′+ a′′). Finally, all

terms are shuffled recursively to produce a random permutation. In all manipulations performed,

the algorithm ensures that the newly formed coefficients are bounded by some constant (100).

The set of candidate exam questions in this domain, Q, was produced by applying the

described procedure 10 times with each (w, d) value pair in {0, 1, 2, 3}2, resulting in a set of

size 160.

4.2 Trigonometry Domain

The second domain is trigonometric equations, in which students are asked to solve for x

representing an angle parameter. Example questions from this domain include:

24


• 2− 4sin(2x) = 0

• −sin2(x) + cos2(x)− cos(2x) + sin(x) = 0

• 2cos2(x)− 2sin2(x) = 1

4.2.1 Trigonometric Ability Set

The ability set A consists of 25 types of solving actions based on known trigonometric iden-

tities. The definition chosen for these actions abstract out basic algebraic knowledge, which

is considered a prerequisite for this domain. For example, cos(x+ π2 ) = 0, cos(x) = 1

2 , and

1−2cos(3x) = 0, are all solved using the same ability: solving an atomic trigonometric equation

of canonical form cos(mx+ n) = c. We call this atomic ability TrigEqConst:Cos and call

its counterparts for other trigonometric functions TrigEqConst:Sin and TrigEqConst:Tan.

Similarly, we have 3 matching abilities, TrigEqTrig:〈fun〉, for solving equations of the sort

fun(m1x+ n1) = fun(m2x+ n2), where 〈fun〉 ∈ {sin, cos, tan}. A third type of atomic

solution abilities, QuadTrig:〈fun〉, solve quadratic equations with a basic trigonometric func-

tion as its parameter, e.g., 2cos2(3x)+5cos(3x)−3 = 0. The rest of the ability set is composed

of trigonometric identity applications, summarized in Figure 4.1.

4.2.2 Trigonometric Question Pool

The question pool used in this domain was collected from 3 different educational websites1,

along with others we authored ourselves. All together the pool consists of 75 questions testing

all domain abilities defined. Domain questions could have also been automatically generated

but we preferred to exemplify this alternative question source for our second domain.

1 http://tutorial.math.lamar.edu/Extras/AlgebraTrigReview/SolveTrigEqn.aspxhttp://dbhs.wvusd.k12.ca.us/ourpages/auto/2009/5/8/55918698/ch7 trig equation worksheet.pdfhttp://perrysprecalculus.weebly.com/uploads/1/3/4/5/13450290/study guide 7.7-7.8 answers.pdf

25


Ability name Pre-condition⇒ Post-resultTrigEqConst:Sin sin(mx+ n) = c⇒ SOLTrigEqConst:Cos cos(mx+ n) = c⇒ SOLTrigEqConst:Tan tan(mx+ n) = c⇒ SOLTrigEqTrig:Sin sin(m1x+ n1) = sin(m2x+ n2)⇒ SOLTrigEqTrig:Cos cos(m1x+ n1) = cos(m2x+ n2)⇒ SOLTrigEqTrig:Tan tan(m1x+ n1) = tan(m2x+ n2)⇒ SOLQuadTrig:Sin a · sin2(mx+ n) + b · sin(mx+ n) + c = 0⇒ SOLQuadTrig:Cos a · cos2(mx+ n) + b · cos(mx+ n) + c = 0⇒ SOLQuadTrig:Tan a · tan2(mx+ n) + b · tan(mx+ n) + c = 0⇒ SOL

SinToCos sin(mx+ n)⇒ cos(π2 −mx− n)CosToSin cos(mx+ n)⇒ sin(π2 −mx− n)CosNegate cos(−(mx+ n))⇒ cos(mx+ n)SinNegate sin(−(mx+ n))⇒ −sin(mx+ n)

SinSqToCosSq sin2(mx+ n)⇒ 1− cos2(mx+ n)CosSqToSinSq cos2(mx+ n)⇒ 1− sin2(mx+ n)

SinSqPlusCosSq sin2(mx+ n) + cos2(mx+ n)⇒ 1

DoubleParamSinFrom sin(2(mx+ n))⇒ 2 · sin(mx+ n)cos(mx+ n)DoubleParamSinTo 2 · sin(mx+ n)cos(mx+ n)⇒ sin(2(mx+ n))

DoubleParamCosFrom cos(2(mx+ n))⇒ cos2(mx+ n)− sin2(mx+ n)DoubleParamCosTo cos2(mx+ n)− sin2(mx+ n)⇒ cos(2(mx+ n))

SqrtEquation∏i t

2i =

∏j s

2j ⇒

∏i ti = ±

∏j sj

where {ti} and {si} are basic trig. functions or constantsDivByTrig:Sin sin(mx+ n) ·

∏i ti = sin(mx+ n) ·

∏j sj ⇒

∏i ti =

∏j sj

s.t. sin(mx+ n) 6= 0DivByTrig:Cos cos(mx+ n) ·

∏i ti = cos(mx+ n) ·

∏j sj ⇒

∏i ti =

∏j sj

s.t. cos(mx+ n) 6= 0DivByTrig:Tan tan(mx+ n) ·

∏i ti = tan(mx+ n) ·

∏j sj ⇒

∏i ti =

∏j sj

s.t. tan(mx+ n) 6= 0MultCosForTan e.g., tan(x)− 2sin(x) = 0⇒ sin(x)− 2 · sin(x)cos(x) = 0

s.t. cos(x) 6= 0

Figure 4.1: Trigonometry domain abilities

26


Chapter 5

Evaluation

In this chapter we present an empirical evaluation of the MOEG framework over the two

procedural domains described in the previous chapter.

5.1 Empirical Methodology

Ideally we would have liked to evaluate the exams by measuring their fitness over the entire

population of models. Of course this is computationally infeasible for the same reasons that lead

us to use the utility sample in the first place. We therefore compromise and use another sample,

the “oracle”, for evaluation. Two things are noteworthy regarding this oracle sample: first, it is

taken independently from the utility sample, and second, it is considerably larger. The resulting

evaluation function is therefore (1) unbiased in evaluating the algorithm’s produced exams, and

(2) a better approximation of the entire student model distribution. For completeness, we present

graphs showing the values of both the guiding utility and the oracle evaluation.

We conducted experiments for both the relative grading and absolute grading settings. In

the relative grading setting four main independent variables were tested. The first two are

the size of the sample used for the utility guiding the search and the exam length ke. The

third and fourth variables control the distribution of students PM and the simulated teacher-

specified ability weights w. The student distribution is defined by a collection of ability mastery

probabilities {pi} as described in Section 3.2. The specification of these mastery probabilities

were abstracted by one variable εp controlling the variance between abilities. With a set value

for εp, each pi value was independently sampled from the uniform distribution Uni(0.75± εp).

A similar approach was used to sample the ability weights: wi ∼ Uni(1± εw) for a set value

for εw. The third and forth independent variables are therefore εp and εw. In the absolute

grading setting, addressed in Section 5.8, we experimented with the parameters controlling

the target grade mapping, namely µ, σ. Default values, used unless mentioned otherwise, are

SampleSize = 400, ke = 10, εp = 0.15, εw = 0.5, µ = 0.5, σ = 0.1.

A sample of size 1000 was used for the oracle. All results presented are based on 50

independent experiment runs using the same question set Q. The derivation of their action

landmarks through our anytime approximation method (Figure 3.1) was performed once (with

27


Tlim = 30 sec, patience = 10).

5.2 Performance Over Time

We tested the performance of the MOEG algorithm and compared it to four baseline algorithms

we defined1. Uniform generate & test generates random uni-weight exams and evaluates them

using the search utility, maintaining the best exam found yet. A variation is the weighted

generate & test, which makes a biased selection of exam questions inspired by the item response

theory concept of item information [DA09], reflecting a question’s usefulness in exams. It

selects questions with probability proportional to their information level, defined as p(1− p),

where p is the proportion of utility sample students answering the question correctly. Note that

these two baseline algorithms use a key component of our framework — the utility function.

Two additional baseline algorithms were devised in the algebra domain. The first is the

diversifier, which attempts to maximize the diversity of exam questions in terms of syntactical

features, because we assume that this will better differentiate between students. We defined

six question features: number of constant terms, variable terms, positive coefficients, negative

coefficients, parentheses, and overall terms. The feature values are normalized as Z-scores to

account for the different scales. The algorithm starts with a random question, then iteratively

selects a question to add, maximizing the sum over pairwise Euclidean distances between exam

questions. This is followed by a similar swapping phase.

The second is the similarity maximizer. It takes an opposite approach to the diversifier,

which could also be considered reasonable — find a good question and others similar to it, in

terms of features. The first question selected is the one maximizing utility value, and thus the

algorithm is identical to MOEG at the first data point. Further question selections are made so

that the feature distance, as defined by the diversifier, is minimized.

Figure 5.1 displays MOEG’s improvement over time during the question selection and

swapping phases, compared to the baseline competitors. We can see that the MOEG curve

surpasses all others in both domains, even with partial exams. Data shows that the performance

of weighted generate and test and uniform are practically equivalent. The diversifier and

similarity maximizer both exhibit surprisingly poor performance. Hence, we did not attempt to

define the question features these algorithms require in the trigonometric domain.

5.3 The Effect of Exam Length on Performance

We expect that longer exams will allow a better ordering of the student population. Figure 5.2

presents how the exam length ke affects the performance of the exam generation algorithm,

averaged over 50 runs. Each run was executed with independently drawn oracle and utility

samples. As expected, both curves are monotonically increasing with a diminishing slope. We

note that the utility is always higher than the oracle. This is to be expected as the search algorithm

1We could not compare MOEG to testsheet composition methods such as [HLL06], as they work with completelydifferent input and cannot be applied to the setup we use.

28


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Ora

cle v

alu

e (

Kend

all

Tau)

Time [exam evaluations]

MOEGWeighted G&T

Uniform G&TDiversifier

Similarity maximizer

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000 2500 3000 3500

Ora

cle v

alu

e (

Kend

all

Tau)


MOEGWeighted G&T

Uniform G&T

Figure 5.1: Performance over time

(Algebra-top, Trigonometry-bottom)

tries to optimize this utility. The difference between the search utility and the oracle may be

considered analogous to the difference between training and testing accuracies in classification

tasks.

5.4 The Effect of Sample Size on Performance

The search process is guided by a sample-based utility. We expect that increasing the size of the

sample will improve the quality of the utility function. Figure 5.3 shows the effect of the utility

sample’s size on performance. Each point in the curve represents an average of 50 runs over

29


0.4

0.45

0.5

0.55

0.6

0.65

0.7

0 5 10 15 20 25

Valu

e (

Kend

all

Tau)

Exam length

OracleUtility

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0 5 10 15 20 25

Valu

e (

Kend

all

Tau)

Exam length

OracleUtility

Figure 5.2: The effect of exam length on performance


different utility samples.

Indeed, we can see that the performance of the algorithm, as measured by the oracle,

improves with the increase in sample size. The difference between the value of the oracle and

that of the search utility can be viewed as the estimation error of the search utility. We can see

that this error decreases as we use larger sample sizes.

30


0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 50 100 150 200 250 300 350 400 450 500

Valu

e (

Kend

all

Tau)

Sample size

OracleUtility

0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0 50 100 150 200 250 300 350 400 450 500

Valu

e (

Kend

all

Tau)

Sample size

OracleUtility

Figure 5.3: The effect of sample size on performance


5.5 Framework Stability

Figure 5.4 and Figure 5.5 show how εw and εp effect the oracle evaluation. The values displayed

are the mean and standard deviation of the oracle evaluation over 50 runs for each εw value

with the default εp and vice versa. The general observation is that variance in ability weights

and mastery probabilities have a relatively minor influence over the algorithm’s performance.

We conclude that the MOEG framework is stable with respect to the model population and the

ability weights.

31


0.57

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0 0.1 0.2 0.3 0.4 0.5

Valu

e (

Kend

all

Tau)

Epsilon w

Oracle

0.4

0.42

0.44

0.46

0.48

0.5

0.52

0.54

0.56

0.58

0.6

0 0.1 0.2 0.3 0.4 0.5

Valu

e (

Kend

all

Tau)

Epsilon w

Oracle

Figure 5.4: The effect of εw on performance


5.6 Domain Coverage

In this section we examine another desirable property from an exam — domain coverage. Any

educator would, no doubt, agree that a good exam should test a large portion of the domain. We

wish to test if exams produced by MOEG exhibit good domain coverage, but to do so must first

define this concept. An ability is considered required in order to answer a question iff it is used

in all solution paths. Now, the coverage rate of an exam is defined as the number of abilities

required for at least one of the exam questions.

Figure 5.6 compares the coverage rate of the 50 uni-weight MOEG-generated exams in

32


0.57

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0 0.05 0.1 0.15 0.2 0.25

Valu

e (

Kend

all

Tau)

Epsilon p

Oracle

0.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

0 0.05 0.1 0.15 0.2 0.25

Valu

e (

Kend

all

Tau)

Epsilon p

Oracle

Figure 5.5: The effect of εp on performance


algebra, using default parameters, against a pool of 1000 random exams. It is evident that

MOEG exams exhibit significantly higher coverage rates — a desirable property. One may

wonder why the maximal coverage displayed is 10 abilities while there are 18 abilities in the

algebra domain. Data shows that 10 abilities is, in fact, the maximal coverage rate possible,

achieved by an exam of all 160 questions in Q. The reason for this is that many abilities are

not required for any question since they are interchangeable. For example, −2x = −8 can

be solved by dividing by a negative value (−2), or by swapping the sides of both terms and

then dividing by a positive value (2). Therefore, no abilities are considered to be required for

this question. In the trigonometric domain this phenomenon is even more pronounced: most

33


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

4 5 6 7 8 9 10

RandomMOEG

Figure 5.6: Algebra domain coverage

questions have no required abilities at all. The matching graph for the trigonometric domain

was therefore omitted.

5.7 An Alternative Search Algorithm

Our search algorithm first finds a set of questions and then finds a set of weights. An interesting

variation of this algorithm is to mix these two tasks by allowing question swapping also during

the weight adjustment phase. It may have been reasonable to expect that this would result in

better performance due to the additional flexibility we allow the algorithm. However, results

show that this is not the case. Over 50 runs with the same samples used for both algorithms, the

proposed alternative algorithm yields nearly equivalent results. The runtime required, however,

is significantly shorter for the original, as may be expected due to the smaller branching factor

in the search.

5.8 Absolute Grading Setting

In what follows, we present the results of our experiments with the absolute grading setting.

Figure 5.7 shows the improvement of MOEG over time compared to the baselines. The values

displayed are in terms of average grade error, i.e., the L1 distance between exam grade vector

and target grade vector divided by their length.

Next, we examine the effect µ and σ, which define the target grade mapping, have upon

performance. The results of these two experiments are presented in Figure 5.8 and Figure 5.9.

Both figures exhibit U-shaped plots, but for different reasons. The µ parameter effectively

34


0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Ora

cle v

alu

e (

Kend

all

Tau)


MOEGUniform G&T

Weighted G&T

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 500 1000 1500 2000

Ora

cle v

alu

e (

Kend

all

Tau)


MOEGUniform G&T

Weighted G&T

Figure 5.7: Performance over time (absolute grading)


controls the difficulty level of exam questions. The higher the mean target grade is, the easier

exam questions need to be with respect to the student population. Therefore, when this parameter

is taken to any extreme, the fitness of most candidate questions decreases. To confirm this

explanation, we extracted a difficulty histogram from our question sets, displayed in Figure 5.10.

The difficulty of each question is defined as the proportion of oracle models able to answer them

(over all 50 runs). The concentration of questions around the optimal µ values (in both domains)

allows the algorithm more flexibility in selecting domain questions of the appropriate difficulty,

thereby resulting in better performance there.

We turn to explain the effect σ has on performance as depicted in Figure 5.9. Consider the

35


0.07

0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Err

or

(avera

ge d

iff)

Target mean grade (Mu)

Oracle

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.2 0.3 0.4 0.5 0.6 0.7 0.8

Err

or

(avera

ge d

iff)

Target mean grade (Mu)

Oracle

Figure 5.8: The effect of µ on performance (absolute grading)


problem of matching the grade distribution with the default mean µ = 0.5 and extreme STD

σ = 0. In such a case, the algorithm is required to find an exam for which all student grades

tend to 0.5, regardless of ability level. This is, of course, a very difficult task since all questions

inherently favor the more proficient students, i.e., they are more likely to be answered by them.

As expected, the algorithm’s performance improves as σ increases from 0 up to an optimal

performance for some σ. After this point, the target grade mapping gradually becomes harder

to comply with. This is because students with similar ability vectors usually answer the same

questions correctly. Increasing the value of σ induces a target grade mapping where similar

students are assigned grades which are more dissimilar, making the problem more difficult.

36


0.08

0.09

0.1

0.11

0.12

0.13

0.14

0.15

0.16

0.17

0.18

0.19

0 0.1 0.2 0.3 0.4 0.5

Err

or

(avera

ge d

iff)

Target grade STD (sigma)

Oracle

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0 0.1 0.2 0.3 0.4 0.5

Err

or

(avera

ge d

iff)

Target grade STD (sigma)

Oracle

Figure 5.9: The effect of σ on performance (absolute grading)


37


0

10

20

30

40

50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Freq

uency

Success proportion

0

10

20

30

40

50

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Frequency

Success proportion

Figure 5.10: Question difficulty histogram


38


Chapter 6

Related Work

The following chapter is dedicated to the presentation of three inter-related research fields

relevant to the task of exam generation. We begin with a short survey of item response theory

(IRT), a statistical paradigm for test design and analysis. The purpose of the survey is to provide

a flavor of a different approach for dealing with problems similar to ours. It is followed by

the discussion of a group of works addressing testsheet composition as integer programming

problems, utilizing various optimization techniques. The chapter is concluded with an overview

of Intelligent Tutoring Systems (ITS), a promising class of educational software. Our work was

inspired by ITS theory, and the notion of the student model was adopted from it. In future work it

may be desirable to use more sophisticated student models than those used in this work. Several

such options are discussed here. Test generation serves an important role in tutoring; therefore

we believe our work may prove beneficial to the ITS research community.

6.1 Item Response Theory

In the following section we present a short survey of Item Response Theory (IRT), introducing

the core principles, assumptions, and concepts of the theory.

6.1.1 IRT Overview

IRT is a statistical paradigm dealing with the design, analysis, and scoring of both tests and

questionnaires based on statistical models deducted from response data. Two major populations

use IRT to aid their cause: educators administering educational tests, and psychologists adminis-

tering psychological questionnaires [ER13]. The main difference between these applications of

the theory is that responses to educational tests generally indicate ability, while responses to

psychological questionnaires generally indicate beliefs or attitudes. Other than these different

interpretations of item responses, the models are quite similar. We shall focus the discussion

in this survey on educational applications. Common IRT educational tasks include test devel-

opment, test equating, and evaluation of item bias [HS85], as well as adaptive testing [GC05],

referenced measurement, and inappropriateness measurement [LR79, HT83].

39


The term item serves as an abstraction for different types of questions, including true or

false, multiple choice, fill in the blank, and short-answer questions. Most attention has been

devoted to dichotomous items in which an answer may be either correct or incorrect, but

generalizations to polytomous items, in which each response has a different score value, have

also been made [ON06, NO10]. A correct response is assumed to suggest a certain level of

ability expressed in the form of an unobserved ability parameter θ, associated with the student.

In some cases the ability tested may be identified explicitly, for example, as general intelligence,

mathematical proficiency, or reading comprehension capability. However, often times such

an explicit identification is not possible. Fortunately it is not actually necessary. The theory

assumes the existence of some underlying ability even when the questions are of different nature

and common grounds are not easily identified. Therefore, the ability level is sometimes also

referred to as a latent trait. Classic IRT models make an assumption of unidimensionality,

by which test performance depends on a single ability parameter. Although multidimensional

models have also evolved [Rec09], unidimensional models are still most commonly used as

they simplify mathematical analysis considerably while retaining reasonable fit with respect to

empirical data.

IRT, also known as modern test theory, is generally viewed as an improvement over clas-

sical test theory [HVdL82, HS85]. it relaxes the common assumption of its predecessor by

which all items are parallel instruments [AHHI94]. IRT characterizes each item with an Item

Characteristic Curve (ICC) - mapping a student’s ability level to success probability. The theory

provides methods for estimating item parameters (e.g., difficulty, discrimination) and student

ability levels from response data. A nice property of these methods is that item parameter

estimation does not depend on the overall ability and diversity levels of the examinees. The

inverse is also true; the student ability levels may be estimated using any set of items. Several

new testing problems have been made addressable with IRT: test design, identification of biased

items, equating of test scores, and identifying maximum discriminatory power of items.

6.1.2 ICC Models

The core assumption of IRT is that the probability of student success on a certain item is

dependent on the student’s ability level, or latent trait, denoted θ [Tuc46]. The mapping

between the ability level and the probability of success on an item is defined by its ICC (Item

Characteristic Curve). The first ICC models defined were of the normal-ogive family [Lor52].

These imposed considerable computational difficulties leading to the development of the logistic

model family used today. The two-parameter logistic model, (2pl) [Bir57, Bir58b, Bir58a,

Bir68], defines the ICC as follows

P 2pli (θ) =

eDai(θ−bi)

1 + eDai(θ−bi). (6.1)

Observe that success probability increases as θ − bi increases. That is, the “stronger” a

student is with respect to bi, the more probable he is to succeed, where for θ = bi we get a

success probability of 0.5. The ai parameter is proportional to the slope of Pi(θ) at the point

40


θ = bi. A steep slope at bi indicates that an item discriminates very well between students with

abilities in the proximity of θ = bi. The bi and ai parameters are therefore called the item’s

difficulty and discrimination parameters accordingly. The D parameter is used for scaling and is

typically set to the value of 1.7 [Hal52].

Other dichotomous IRT models are described by the number of parameters they employ

[TO01]. The 1pl model assumes equal discrimination for all items and its ICC is obtained by

substituting ai = 1 in Equation 6.1. A special property may be observed when comparing

ICCs of two items in this model, known as specific objectivity. This sometimes desirable

property states that, given a set of students of different abilities, they will all agree regarding

the difficulty order of an item set. Two 1pl ICC curves with different difficulty parameters

will never overlap, hence the model is regarded as sample independent. Although theoretically

appealing on these grounds, the model makes two strong assumptions often criticized as being

unrealistic simplifications [Tra83]. The first is that all items have equal discriminating power,

and the second is that the probability of a student guessing the correct answer is negligible.

Varying discrimination power of items is enabled in the 2pl model, while accommodating

for the possibility of guessing led to the introduction of the 3pl model. The third parameter

introduced (ci) serves as a horizontal lower asymptote of the ICC. Therefore, the success

probability of a student with ability level approaching −∞ is defined to be ci.

The ci parameter reflects situations where students holding no relevant knowledge success-

fully guess correct responses. It is therefore called the guessing or pseudo-guessing parameter.

For example, in a 5-option multiple choice question setting, ci = 0.2 would be considered

reasonable. Figure 6.1 graphically shows the ICC function of a 3-parameter logistic model

P 3pli (θ) = ci + (1− ci)

(eDai(θ−bi)

1 + eDai(θ−bi)

). (6.2)

The 4pl model [McD67] accommodates for the option that even the most proficient students

may sometimes be wrong. Thus, the fourth parameter (γi) may be viewed as the counterpart of

ci in that it serves as a horizontal upper asymptote of the ICC. The 4pl model does not provide

any particular practical gains [BL81], and is thus less often used the field than the 1pl, 2pl, and

3pl models.

A large variety of other model types models were developed for addressing different test

settings, specifically items which are not dichtomously scored. One notable variation is by

Bock and Samejima, who address multichomously scored items [Boc72, Sam72]. In this setting,

differential scoring of answer options is said to improve reliability and validity of mental test

scores [WS70]. The ICC curve for dichotomous items is replaced by a set of item option

characteristic curves, each representing the probability of a certain answer given the ability

level. It may be expected that for a reasonable item, the curve for the correct option will be

monotonically increasing, but this is not necessarily true for curves of other options.

41


0

0.2

0.4

0.6

0.8

1

-1 -0.5 0 0.5 1 1.5 2

Succ

ess

prob

abili

ty

Latent ability trait

psudo-guessing paramter (c=0.2)

difficulty paramter (b=0.5)

discrimination paramter (a=2)

Figure 6.1: The 3pl ICC curve

6.1.3 Common IRT Tasks

IRT aims to provide a basis for making predictions, estimates, or inferences both about student

abilities and item characteristics. Since the ability score θ is never explicitly observable, it must

be deduced from response data. A complication arises since the selection of an accurate model

and its parameter values need also be made congruently. A common work-flow for obtaining

ability score estimates includes the following steps: data collection, model selection, parameter

estimation, scaling score values. [HS85]

A common IRT task is the estimation of a student’s ability score: Given a student’s response

data to a set of items, and assuming that the item ICC model and parameters are known, what is

a good estimate for the student’s ability score? A classic approach for dealing with this problem

uses the maximum likelihood estimator (MLE). Since the ICC functions in the typical models

are non-linear, numerical procedures, e.g., Newton-Raphson, are usually applied. The task of

ability score estimation is naturally expanded to cases in which we wish to estimate student

ability and item difficulty simultaneously. Procedures based on conditional-MLE have been

developed, to handle such simultaneous estimation tasks. Another approach for dealing with

these is the Bayesian approach, building upon distributions of student ability scores [Owe75].

We conclude this short survey with a discussion of the test construction task in IRT. Recall

from the discussion of ICC models that an item generally discriminates best between students

with ability scores in proximity of the item’s difficulty level. Intuitively one may say that the

item provides the most “information” regarding students around that ability level. This notion

of item information is formally defined as the Item Information Function (IIF), mapping ability

level to item information. It is naturally expanded to Test Information Functions (TIF), defined

as the IIF sum of test items. The TIF provides valuable information regarding the ability ranges

for which the test is most effective. Student ability estimates are generally most accurate for

levels in which test-information is high. With some a priori knowledge of the student ability

42


distribution, tests may be designed to optimize information in the dominant ability regions, thus

improving test reliability [The85]. Another use of TIFs is for qualification exams, where the

students either pass or fail and the actual grades are of no importance. The test designer may

then focus on selecting items with difficulty levels near the qualification threshold, resulting in a

test with maximal information where desired.

6.2 Testsheet Composition

A group of works address the task of exam construction as a combinatorial optimization problem,

known as testsheet composition. The definition of an exam in these works is similar to ours:

a subset of questions from a given collection of candidates, usually without grading weights

assigned. Candidate items are represented by feature vectors expressing different aspects of

them, e.g., difficulty level, solving time, discrimination degree, and association with various

topics [HLL06]. These item parameters, given to the model as input, are assumed to be somehow

acquired from real items. In web-based systems, or when response data is available, previous

test records are analyzed in order to approximate and update the difficulty level, discrimination

degree, and other item parameters [LBE90].

Given the identifying parameters of all candidate items and the input constraints on the

exam, the problem is formulated in terms of a mixed integer programming (MIP) problem with

multiple assessment criteria [Hwa03]. A decision variable is associated with each candidate

item, indicating whether or not the item is selected for the testsheet being constructed. Typical

problem constraints include a required expected solving time, a specified exam length, and a

desired coverage level of topics. Finally, the objective function to be maximized is typically

defined as the discrimination degree of the entire exam. We note that the underlying assumption

in such formulations is similar to ours, i.e., good exams are those capable of discriminating well

between students of different proficiency levels.

Figure 6.2 shows a typical problem formulation of this type by Hwang et al. [HLL06].

The authors explain that the problem expands upon the knapsack problem, which is NP-hard;

they therefore propose a heuristic approximation based on sequential optimization of problem

parameters. An expansion of the model was later proposed for composing a collection of

equivalent testsheets simultaneously without multiple appearances of items [HCYL08]. Creating

parallel testsheets was addressed in other works using different problem formulations. Belov

et al. define the task as a maximum set packing problem, where a maximal number of non-

overlapping testsheets is sought [BA06]. Ishii et al. formulated the problem as a maximum

clique problem, where nodes represent feasible tests, edges represent pairwise-compliance with

overlapping constraints, and the objective is to maximize the number of testsheets complying

with all pairwise-constraints [ISU14].

A large variety of other problem variations have also been investigated. One such example is

the adaptive composition of testsheet requirements according to the various assessment purposes,

addressed in a work by Lin, Su, and Tseng [LST12]. They defined four different types of such

assessment purposes: displacement, summative, formative, and diagnostic. An item selection

43


Given the discrimination coefficient of item i di (i = 1, 2, ..., n)association degree of item i and concept j rij (i = 1, ..., n,

j = 1, ...,m)expected time to answer item i ti (i = 1, 2, ..., n)lower bound on expected relevance of concept j hj (j = 1, ...,m)bounds on expected time for answering testsheet [l, u]

Find item decision variables xi (i = 1, 2, ..., n)

Maximizing Z =∑n

i=1 dixi/∑n

i=1 xi

Subject to∑n

i=1 rijxi ≥ hj j = 1, 2, ...,m

l ≤∑n

i=1 tixi ≤ u

xi = 0 or 1 i = 1, 2, ..., n

Figure 6.2: Example testsheet composition problem formulation [HLL06]

strategy was applied to filter item candidates, and thus reduce problem complexity to be more

manageable. Then genetic algorithms were used to approximate the optimal solution, as in other

testsheet composition works [HLTL05].

Integer programming formulations have been linked to item response theory, previously

discussed, as well as to classical test theory [AJW98]. A group of works define a binary integer

program where a minimal set of questions is desired, admitting to constraint on item information,

defined in IRT [ABTvdL91, The85]. Another presents a maximin model for test construction

that requires the relative shape of the target test information function [vvLBT89]. Yet another

[EAAA08] uses abductive machine learning techniques for identifying item subsets informative

of student proficiency levels, based on IRT models.

A wide range of techniques have been used to solve different testsheet composition formu-

lations. Hwang [Hwa03] applies a fuzzy logic approach to determine difficulty levels of test

items, according to the learning status and personal features of students. He then uses clustering

techniques with dynamic programming, for constructing feasible testsheets in accordance with

specified requirements. Armstrong et al. define a mathematical programming model maximiz-

ing the classical test theory notion of test reliability and uses network theory and Lagrangian

relaxation techniques to solve their formulation [AJW98]. Other works [HYY06, DZWF12]

use tabu search [Glo89, Glo90], a metaheuristic method for guiding exploration of the solu-

tion space. Duan et al. [DZWF12] apply an analytic hierarchy process [Saa80] in order to

merge multiple objective functions into a unified weighted sum objective. They then combine

biogeography-based optimization [Sim08] with tabu search to solve the composition problem.

Wang et al. [WWP+09] use differential evolution [SP97], a parallel direct search algorithm, for

testsheet composition.

In our work, we searched the papers described above for baseline algorithms to compete

44


against. The objective of most testsheet composition formulations is the maximization of

overall test discrimination, as in ours. Nonetheless, the definition of discrimination differs

substantially. In testsheet composition, test discrimination is defined as the sum (or average)

of item discrimination parameter over included items. In the absence of additional constraints,

selecting the most discriminatory items would lead to an optimal solution. But of course

additional constraints are present, thereby complicating the problem and requiring different

approximation solutions.

In our formulation, the notion of test discrimination is more subtle and defined through the

order correlation measure. It is not enough for a test to distinguish between different students;

the discrimination must be in accordance with the target student order. This means that proficient

students must be more highly evaluated than less proficient ones, and not the other way around.

In order to unify the problem formulations, we need to define a discrimination parameter as a

single scalar per item. Using our framework components, a reasonable such definition is the

order correlation of the singleton exam composed of the one item being evaluated. But selecting

items maximizing this parameter leads to poor exams. In the extreme case, equivalent items

with maximal discrimination will be selected. The resulting exam will be of same utility as the

singleton exam of any selected item. Therefore, it makes little sense to define discrimination

parameters per item in our formulation. Other item parameters (difficulty, length, topic coverage,

etc.) may also be derived in our setting, but no constraints on these are included in the problem

definition. The non-linear nature of exam utility in our formulation, as well as the absence of

constraints on other item parameters, make reasonable comparison impossible.

6.3 Intelligent Tutoring Systems

The most common teaching methods in schools and academic institutions today are based

upon frontal presentations by the educator. This traditional approach, derived from educational

essentialism, places the teacher as responsible for transferring knowledge to the students, and

the student as a passive listener [SSZ08]. It has been empirically established that one-on-

one tutoring is a significantly more effective method of teaching [Blo84]. This difference is

quantified as higher grade level, more time on-task, and is also expressed in qualitative aspects

such as interest and attitude towards learning. Tutoring allows the educator more adaptivity to

the academic, motivational, and affective needs of the student, and shifts the student’s role in

the learning process from passive listener to active participant.

The obvious problem with tutoring is that it is impossible to implement in the general public

system due to practical constraints of time and human resources. This lead to the development

of a large variety of educational software in the last fifty years or so, starting with computer

aided instruction applications (CAI). A CAI typically contains a collection of educational items

such as reading materials, exercises, exams, and audio clips, sequenced in a predefined order.

Alternatively, the student is allowed to navigate through the material as desired, using some

hypermedia interface. In either case, such educational software generally serves as a passive

electronic book.

45


Student

Domain

Pedagogical

Expert

Communication

Figure 6.3: Interaction between ITS components [BSH96]

Intelligent Tutoring Systems (ITSs) are the next evolutionary step in educational software.

An ITS is intended to be more of an interactive tutor then a passive knowledge source, adapting

to the student through learning interactions. Brusilovsky states that knowledge sequencing and

task sequencing are key functions of true intelligent tutoring systems [Bru92]. Additionally to

such knowledge of teaching strategies, an ITS possesses knowledge of the domain and of the

learner [HS73, SP94]. An ITS consists of five main components [BSH96, Woo]: The domain

knowledge, student model, expert model, pedagogical module, and communication model. The

mutual interaction between system components is depicted in Figure 6.3, adopted from Beck,

Stern, and Haugsjaa [BSH96].

The domain module generally maintains a structured representation of the domain taught,

perhaps an ontology of topics and subtopics related to one another by relations such as “prereq-

uisite of,”“generalization of,”“harder than,” etc. [MB+00, SMM04]. The expert module uses

the domain knowledge and serves as somewhat of a model for the optimal student, where the

goal of the tutoring process to bring student proficiency to expert level, as depicted by the expert

module. This module often takes the form of a “runnable,” meaning that it provides functionality

for generating a complete correct solution. The pedagogical module is responsible for guiding

the tutoring session. This includes both macro-level tutoring strategies and micro-level exercise

guidance. Ideally, it will support a variety of teaching strategies suitable for a heterogeneous

population of students who may differ in perceptual, cognitive, and affective aspects [JG95].

Lastly, the student module includes a model of the student using the system, used to make

inferences regarding the appropriate tutoring actions. Chrysafiadi and Virvou [CV13] state three

guiding questions which need to be addressed when building a student model: What to model?

How to model? and Why (i.e, for what purpose) to model? The student model may contain

both domain dependent and independent characteristics [YKG10], including dynamic features

such as knowledge and skills, errors and misconceptions, learning styles, references, affective

and cognitive factors, and meta-cognitive factors. In addition, static features such as age, class,

grades, mother tongue, etc. may also prove informative to the tutor.

46


Nine major approaches for student modeling are currently established in ITS research, all

implemented in existing systems [CV13]:

• Overlay model

• Buggy/Perturbation model

• Stereotyping models

• Bayesian-network models

• Fuzzy-logic models

• Machine-learning methods

• Ontology-based models

• Constraint-based models

• Cognitive-theory approaches

By the overlay model approach, a student’s knowledge state is composed of incomplete but

correct knowledge. Such a model may include binary variables indicating mastery or lack thereof,

but expansions have been made to variables indicating finer proficiency levels. An expansion

to belief vectors indicating ability level estimations has been proposed [BSW97], along with

an updating scheme accounting for acquisition and retention factors of the student. Another

expansion is the perturbation model where mal-knowledge or buggy knowledge may also be

included in the student model [ND08]. Stereotyping models assign users to common stereotypes

based on similarity of initial information such as class and grade level. Stereotyping offers a

reasonable way to deal with new users for whom little information is known [Ric79, Ric83].

More complicated models address uncertainty in student behavior using bayesian networks or

fuzzy logic. Models based on a machine learning approach, use observations of user behavior as

input examples and attempt to learn predictions of future actions [WPB01]. These sophisticated

student modeling approaches offer interesting expansion opportunities for our work, which

currently uses the relatively simple binary overlay model.

47


48


Chapter 7

Discussion

The new algorithmic framework presented enables automatic composition of well-balanced

exams, expected to produce the desired ordering of students by proficiency level. Different

exam versions are commonly required for the purpose of make-up exams, practice exams, or

for testing different classes [HCYL08]. Educators may now easily produce any number of

alternative exam versions with absolutely no additional effort. Automation of the composition

process introduces a new level of confidence in the fairness of an exam and in the equivalence

of different exam versions.

Our framework uses a distribution of student models to guide its search for good exams.

In Section 3.2 we show a method for parametric specification of such a distribution. These

pi parameters can be estimated from a population of real students. With the development of

MOOCs, an abundance of educational data has become available. This data may potentially be

used to build more accurate model distribution approximations.

The following section is dedicated to discussing the applicability of MOEG to other domains,

followed by a section discussing some expansion possibilities for the framework.

7.1 Other Domains

The MOEG framework is applicable to any domain where a set of cognitive abilities A and

a sufficiency predicate ψ may somehow be defined. Automatically deducing ψ via action

landmarks requires further that solutions be represented as paths in search graphs, with A as the

operator set. Several such procedural domains may come to mind, one of which is geometry.

A classic ITS paper [ABY85] presents a tutor for teaching geometrical reasoning. The tutor

utilizes a library of inference rules: triangle congruency/similarity, properties of polygons,

transitivity of angle/segment congruency, etc. It is implemented as a production system and

used to generate proofs, using different combinations of production rules. This solving process

may naturally be formalized as a graph search: graph states represent collections of statements

deduced so far, and search operators are applications of the production rules. An operator,

successfully applied, results in a new state with the inferred statement added to those of the

previous state. Goal states are those containing the desired claim of the geometry question.

49


Analytical geometry:Circle C with origin (1, 0) and radius 2

⇒ C : (x− 1)2 + y2 = 4

Point (3, y) on Circle C ⇒ y = 0

Trigonometry:In4ABC : |BC| = 2,]B = 90◦,]A = 30◦

⇒ |AC| = |BC|/sin(]A) = 4

Motion problems:Position at t = 0 is x0 = 10[kmh], Constant velocity V = 5[kmh]

⇒ x(t) = 10 + 5t

Figure 7.1: Example inferences in different domains

The idea of inference steps as operators is also applicable in various computational domains.

Consider, for example, inferring the intersection point of two functions in analytical geometry,

the length of a right triangle’s hypotenuse in trigonometry, or the location at time t of an object

moving according to ~x(t) in motion problems. These example inferences may serve as arcs on

solution paths of three different domains in which diverse sets of useful inference rules exist.

Automatically deducing ψ for questions is also possible for domain types other than search

spaces. For example, domains where solutions may be obtained by automated theorem provers

are also MOEG-applicable with a simple extension. The abilities A will be the axioms, lemmas,

and theorems that a student should know, while solutions will be complete proof trees. Given a

proof tree representing a solution, the axioms at the leaves will be considered the set of required

abilities for the solution. Collecting these ability sets from different proofs found by the theorem

prover results in sufficiency predicates of familiar form: disjunctive action landmarks.

In other domains, different methods for deducing ψ may exist. Consider the domain of

combinatorial problems in which problems are given in text, e.g., “How many non-empty

subsets does a set of size N have?” It is natural here to define domain abilities as combinatorial

concepts such as non-redundant combination (nrc), summation principle (sum), redundant

permutation (rp), or subtraction principle (sub). Automatically deducing ψ from the question

text alone is beyond the state of the art, but given the set of possible solutions, the task

becomes feasible using standard syntactic parsers. As an example, the solution set for the

question above would be{∑i=n

i=1

(Ni

), 2N − 1

}, where the resulting sufficiency predicate is

ψ = (nrc ∧ sum)∨

(rp ∧ sub).

Lifting the requirement for automatic deduction of ψ altogether expands the group of can-

50


didate domains substantially. The framework may even be applied to declarative domains

such as history, for example. In history, the set of abilities can be a list of historical thinking

skills1: historical causation, patterns of continuity, periodization, comparison, contextualization,

historical argumentation, appropriate use of relevant historical evidence, historical interpreta-

tion, and historical synthesis. For such a domain, designing an algorithm for computing the

required abilities per question is beyond the state of the art, as it requires deep natural language

understanding. However it should be relatively easy for an educator to manually identify these.

7.2 Framework Expansions

The main contribution of our work is the presentation of a complete framework for exam

generation based on a new notion of exam fitness. In order to demonstrate the utility of our

approach, some simplifying assumptions had to be made in the various framework components.

This paper serves as a starting point from which several expansions can be made in future work.

This section is dedicated to discussing some possible expansions and their implications for the

framework.

Let us start by considering the student models. Using student models from ITS literature

for population-based exam generation, rather than student tailored on-line curriculum planning,

is a new idea that we have not found in the literature. Nonetheless, in this work we adopt

a basic form of student model, the binary overlay model. This selection imposes two main

limitations. First, an ability may either be mastered or not, with no intermediate options possible.

And second, no incorrect knowledge may be expressed in the model. Incorporating non-binary

overlay models or perturbation models [CV13] to deal with these limitations will allow more

realistic student modeling.

Consider, for example, using models composed of belief vectors [BSW97], indicating

non-binary levels of abilities. Along with the incorporation of such models, a more refined

sufficiency predicate is also required. A reasonable option is to define the sufficiency predicate

by a vector of threshold ability levels per solution path of a question. Using such a predicate,

we can conclude that a student model answers a question correctly iff all the model ability

levels are above their matching thresholds in at least one solution path. Several options exist for

defining the threshold ability values per solution path and deducing them automatically. One

option is to define a range of ability levels depending on aspects of the ability-instantiation (e.g.,

characteristics of the arguments in the algebra domain). The threshold abilities of a solution

path will be the maximum ability levels of all ability instantiations on the path. Note that the

resulting predicate is still a disjunctive action landmark, but with threshold conditions on ability

levels instead of atomic binary levels.

Another expansion of the sufficiency predicate involves lifting the assumption of determin-

ism and replacing the logical predicate with a probabilistic estimator. This seems to be a more

reasonable model of the true world, where even the most skilled students may sometimes fail to

1See advancesinap.collegeboard.org/english-history-and-social-science/historical-thinking-skills

51


answer a question, even if they have mastered the required skill. Somewhat similar models are

used in item response theory, where the predicted student performance is defined by the item

characteristic curve, mapping ability level to success probability. However, the unidimensional-

ity assumption of standard IRT models, by which a single latent ability parameter is used to

predict student performance, imposes severe limitations on our setting. A core characteristic of

the MOEG framework is that students differ in the abilities they have mastered. This means

that a student may be able to answer one question but not another, while a different student may

be able to answer the second but not the first. Collapsing the student model to a single ability

parameter will cancel this desirable characteristic of the framework. Therefore, incorporating

probabilistic IRT performance prediction into MOEG reasonably would require using the less

common, multidimensional latent traits [Rec09]. In such a case the dimensions of the latent trait

may converge with our notion of abilities, desirably after expansion to non-binary values.

The method for acquiring the action landmarks of questions is another framework aspect

which may be improved. The algorithm for this purpose, presented in Figure 3.1, is simple

yet generic. It requires only the representation of solutions as paths in a search space and

nothing more regarding the specific domain. Although appealing in this aspect, as well as

being theoretically applicable to any search-space domain, practical constraints may impose

difficulties. Consider cases in which very specific combinations of actions are required to reach

a solution and the ability set is considerably large. In such cases, the current method will require

a long time, perhaps even unreasonably so, to find solutions. A desirable expansion of the

landmark algorithm is to use a heuristic function in order to lead the search in an informed

manner. With a heuristic function at hand, a simple modification of the current algorithm

may be to select the next action weighted by the heuristic improvement of each action, either

stochastically or deterministically. This variation has the potential of shortening the algorithm

runtime and generating more “human-like” solutions, assuming a quality heuristic function.

Finally, we turn to discuss the methods used for specifying the student population and

sampling it. Recall that we have made a simplifying assumption of independence between

abilities for simplicity. This simplification is rather gross and is generally expected to reflect

reality quite inaccurately. Consider a student, in our algebra domain, who has mastered the

abilities of uniting variable terms of all argument sign combinations. It would be reasonable to

expect that such a student has also mastered the simpler ability of uniting two positive constant

terms. Bayesian networks may be used to model generic ability dependencies in the student

population [GVP90].

7.3 Concluding Remarks

This paper presents a generic framework for automating exam composition, an important task in

the pedagogical process. Our work is driven by the philosophical question “what constitutes a

good exam?” Existing testsheet composition methods answer that a good exam is one with a

high discrimination degree. But what hides behind this mysterious numeric value, assumed to be

a given feature per question? In order to assess how well an exam discriminates between students

52


of different proficiency levels, we define student models and estimate their performance on

real exam questions. We use these models to deduce their performance on real exam questions.

Then, the discrimination between student models in the form of a grading order may be obtained.

With a teacher-specified order of student models by proficiency, we may define what constitutes

a “good” exam. By our approach, a good exam is one which orders students according to their

proficiency level, as defined by the teacher. We believe this is an important step towards making

AI techniques practical for improving education.

53


54


Bibliography

[ABTvdL91] Jos J Adema, Ellen Boekkooi-Timminga, and Wim J van der Linden.

Achievement test construction using 0–1 linear programming. European

Journal of Operational Research, 55(1):103–111, 1991.

[ABY85] John R Anderson, C Franklin Boyle, and Gregg Yost. The geometry tutor.

In IJCAI, pages 1–7, 1985.

[AHHI94] Arnold Alphen, Ruud Halfens, Arie Hasman, and Tjaart Imbos. Likert or

rasch? nothing is more applicable than good theory. Journal of Advanced

Nursing, 20(1):196–201, 1994.

[AJW98] Ronald D Armstrong, Douglas H Jones, and Zhaobo Wang. Optimization

of classical reliability in test construction. Journal of Educational and

Behavioral Statistics, 23(1):1–17, 1998.

[BA06] Dmitry I Belov and Ronald D Armstrong. A constraint programming

approach to extract the maximum number of non-overlapping test forms.

Computational Optimization and Applications, 33(2-3):319–332, 2006.

[Bir57] A Birnbaum. Efficient design and use of tests of a mental ability for

various decision-making problems. Randolph Air Force Base, Texas: Air

University, School of Aviation Medicine, 26, 1957.

[Bir58a] A Birnbaum. Further considerations of efficiency in tests of a mental ability.

Series repot no 17. Project no 7755-23, USAF School of Aviation Medicine,

Randolph Air Force Base, 1958.

[Bir58b] A Birnbaum. On the estimation of mental ability. Series Rep, (15):7755–23,

1958.

[Bir68] A Birnbaum. Some latent trait models and their use in inferring an exami-

nee’s ability. Statistical theories of mental test scores, 1968.

[BL81] Mark A Barton and Frederic M Lord. An upper asymptote for the three-

parameter logistic item-response model*. ETS Research Report Series,

1981(1):i–8, 1981.

55


[Blo84] Benjamin S Bloom. The 2 sigma problem: The search for methods of group

instruction as effective as one-to-one tutoring. Educational Researcher,

13(6):4–16, 1984.

[Boc72] R Darrell Bock. Estimating item parameters and latent ability when re-

sponses are scored in two or more nominal categories. Psychometrika,

37(1):29–51, 1972.

[Bru92] Peter L Brusilovsky. A framework for intelligent knowledge sequencing

and task sequencing. In Intelligent Tutoring Systems, pages 499–506.

Springer, 1992.

[Bru94] PL Brusilovskiy. The construction and application of student models in

intelligent tutoring systems. Journal of Computer and Systems Sciences

International, 32(1):70–89, 1994.

[BSH96] Joseph Beck, Mia Stern, and Erik Haugsjaa. Applications of ai in education.

Crossroads, 3(1):11–15, 1996.

[BSW97] J. Beck, M. Stern, and B.P. Woolf. Using the student model to control prob-

lem difficulty. Courses and Lectures - International centre for mechanical

sciences, pages 277–288, 1997.

[CV13] Konstantina Chrysafiadi and Maria Virvou. Student modeling approaches:

A literature review for the last decade. Expert Systems with Applications,

2013.

[DA09] RJ De Ayala. The Theory and Practice of Item Response Theory. The

Guilford Press, 2009.

[DZWF12] Hong Duan, Wei Zhao, Gaige Wang, and Xuehua Feng. Test-sheet compo-

sition using analytic hierarchy process and hybrid metaheuristic algorithm

ts/bbo. Mathematical Problems in Engineering, 2012, 2012.

[EAAA08] El-Sayed M El-Alfy and Radwan E Abdel-Aal. Construction and analysis

of educational tests using abductive machine learning. Computers &

Education, 51(1):1–16, 2008.

[ER13] Susan E Embretson and Steven P Reise. Item response theory for psycholo-

gists. Psychology Press, 2013.

[GC05] Eduardo Guzman and Ricardo Conejo. Self-assessment in a feasible,

adaptive web-based testing system. IEEE Transactions on Education,

48(4):688–695, 2005.

56


[GK54] Leo A Goodman and William H Kruskal. Measures of association for

cross classifications*. Journal of the American Statistical Association,

49(268):732–764, 1954.

[Glo89] Fred Glover. Tabu search-part i. ORSA Journal on Computing, 1(3):190–

206, 1989.

[Glo90] Fred Glover. Tabu search-part ii. ORSA Journal on Computing, 2(1):4–32,

1990.

[GM15] Omer Geiger and Shaul Markovitch. Algorithmic exam generation. In

Proceedings of the 24th International Conference on Artificial Intelligence,

pages 1149–1155. AAAI Press, 2015.

[Gro98] Norman E Gronlund. Assessment of student achievement. ERIC, 1998.

[GVP90] Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in

bayesian networks. Networks, 20(5):507–534, 1990.

[Hal52] David C Haley. Estimation of the dosage mortality relationship when the

dose is subject to error. Technical report, DTIC Document, 1952.

[HCYL08] Gwo-Jen Hwang, Hui-Chun Chu, Peng-Yeng Yin, and Ji-Yu Lin. An inno-

vative parallel test sheet composition approach to meet multiple assessment

criteria for national tests. Computers & Education, 51(3):1058–1072, 2008.

[HD09] Malte Helmert and Carmel Domshlak. Landmarks, critical paths and

abstractions: what’s the difference anyway? In ICAPS, pages 162–169,

2009.

[HLL06] Gwo-Jen Hwang, Bertrand MT Lin, and Tsung-Liang Lin. An effective

approach for test-sheet composition with large-scale item banks. Computers

& Education, 46(2):122–139, 2006.

[HLTL05] Gwo-Jen Hwang, Bertrand MT Lin, Hsien-Hao Tseng, and Tsung-Liang

Lin. On the development of a computer-assisted testing system with genetic

test sheet-generating approach. IEEE Transactions on Systems, Man, and

Cybernetics, Part C: Applications and Reviews, 35(4):590–594, 2005.

[HS73] JR Hartley and Derek H Sleeman. Towards more intelligent teaching

systems. International Journal of Man-Machine Studies, 5(2):215–236,

1973.

[HS85] Ronald K Hambleton and Hariharan Swaminathan. Item Response Theory:

Principles and Applications, volume 7. Springer Science & Business

Media, 1985.

57


[HT83] DL Harnisch and KK Tatsuoka. A comparison of appropriateness indices

based on item response theory. Applications of item response theory, pages

104–122, 1983.

[HVdL82] Ronald K Hambleton and Wim J Van der Linden. Advances in item

response theory and applications: An introduction. 1982.

[Hwa03] Gwo-Jen Hwang. A test-sheet-generating algorithm for multiple assess-

ment requirements. IEEE Transactions on Education, 46(3):329–337,

2003.

[HYY06] Gwo-Jen Hwang, Peng-Yeng Yin, and Shu-Heng Yeh. A tabu search

approach to generating test sheets for multiple assessment criteria. IEEE

Transactions on Education, 49(1):88–97, 2006.

[ISU14] Takatoshi Ishii, Pokpong Songmuang, and Maomi Ueno. Maximum clique

algorithm and its approximation for uniform test form assembly. IEEE

Transactions on Learning Technologies, 7(1):83–95, 2014.

[JG95] Waynne Blue James and Daniel L Gardner. Learning styles: Implications

for distance learning. New directions for adult and continuing education,

(67):19–31, 1995.

[Ken38] Maurice G Kendall. A new measure of rank correlation. Biometrika, pages

81–93, 1938.

[Ken45] Maurice G Kendall. The treatment of ties in ranking problems. Biometrika,

pages 239–251, 1945.

[KV10] Ravi Kumar and Sergei Vassilvitskii. Generalized distances between rank-

ings. In Proceedings of the 19th International Conference on World Wide

Web, pages 571–580. ACM, 2010.

[LBE90] Pedro Lira, M Bronfman, and J Eyzaguirre. Multitest ii: A program for

the generation, correction, and analysis of multiple choice tests. IEEE

Transactions on Education, 33(4):320–325, 1990.

[Lor52] Frederic M Lord. The relation of test score to the trait underlying the test.

ETS Research Bulletin Series, 1952(2):517–549, 1952.

[LR79] Michael V Levine and Donald B Rubin. Measuring the appropriateness

of multiple-choice test scores. Journal of Educational and Behavioral

Statistics, 4(4):269–290, 1979.

[LST12] Huan-Yu Lin, Jun-Ming Su, and Shian-Shyong Tseng. An adaptive test

sheet generation mechanism using genetic algorithm. Mathematical Prob-

lems in Engineering, 2012, 2012.

58


[MB+00] Riichiro Mizoguchi, Jacqueline Bourdeau, et al. Using ontological en-

gineering to overcome common ai-ed problems. Journal of Artificial

Intelligence and Education, 11:107–121, 2000.

[McD67] Roderick P McDonald. Nonlinear factor analysis. Psychometric mono-

graphs, 1967.

[ND08] Loc Nguyen and Phung Do. Learner model in adaptive learning. World

Academy of Science, Engineering and Technology, 45:395–400, 2008.

[NO10] M.L. Nering and R. Ostini. Handbook of Polytomous Item Response Theory

Models. Routledge, 2010.

[ON06] Remo Ostini and Michael L. Nering. Polytomous item response theory

models. Number 144. Sage, 2006.

[Owe75] Roger J Owen. A bayesian sequential procedure for quantal response in

the context of adaptive mental testing. Journal of the American Statistical

Association, 70(350):351–356, 1975.

[PR13] Martha C Polson and J Jeffrey Richardson. Foundations of Intelligent

Tutoring Systems. Psychology Press, 2013.

[Rec09] Mark Reckase. Multidimensional Item Response Theory. Springer, 2009.

[Ric79] Elaine Rich. User modeling via stereotypes. Cognitive science, 3(4):329–

354, 1979.

[Ric83] Elaine Rich. Users are individuals individualizing user models. Interna-

tional Journal of Man-Machine Studies, 18(3):199–214, 1983.

[Saa80] Thomas L Saaty. The analytic hierarchy process: Planning, priority setting,

resources allocation. McGraw-Hil, New York, 1980.

[Sam72] Fumiko Samejima. A general model for free-response data. Psychometrika

Monograph Supplement, 1972.

[SGR12] Rohit Singh, Sumit Gulwani, and Sriram K Rajamani. Automatically

generating algebra problems. In AAAI, 2012.

[Sim08] Dan Simon. Biogeography-based optimization. IEEE Transactions on

Evolutionary Computation, 12(6):702–713, 2008.

[SMM04] Pramuditha Suraweera, Antonija Mitrovic, and Brent Martin. The role of

domain ontology in knowledge acquisition for itss. In Intelligent Tutoring

Systems, pages 207–216. Springer, 2004.

59


[Som62] Robert H Somers. A new asymmetric measure of association for ordinal

variables. American Sociological Review, pages 799–811, 1962.

[SP94] Valerie J Shute and Joseph Psotka. Intelligent tutoring systems: Past,

present, and future. Technical report, DTIC Document, 1994.

[SP97] Rainer Storn and Kenneth Price. Differential evolution–a simple and

efficient heuristic for global optimization over continuous spaces. Journal

of Global Optimization, 11(4):341–359, 1997.

[SSZ08] Myra Sadker, David Miller Sadker, and Karen Zittleman. Teachers, Schools,

and Society. McGraw Hill New York, 2008.

[The85] TJJM Theunissen. Binary programming and test design. Psychometrika,

50(4):411–420, 1985.

[TO01] David Thissen and Maria Orlando. Item Response Theory for Items Scored

in Two Categories. Lawrence Erlbaum Associates Publishers, 2001.

[Tra83] RE Traub. A priori considerations in choosing an item response model.

Applications of Item Response Theory, 57:70, 1983.

[Tuc46] Ledyard R Tucker. Maximum validity of a test with equivalent items.

Psychometrika, 11(1):1–13, 1946.

[vvLBT89] Wim J van ver Linden and Ellen Boekkooi-Timminga. A maximin

model for irt-based test design with practical constraints. Psychometrika,

54(2):237–247, 1989.

[Woo] Beverly Woolf. AI in Education. Encyclopedia of Artificial Intelligence,

Shapiro, S., ed. New Jersey: A Wiley Interscience Publication, New York.

[WPB01] Geoffrey I Webb, Michael J Pazzani, and Daniel Billsus. Machine learning

for user modeling. User Modeling and User-Adapted Interaction, 11(1-

2):19–29, 2001.

[WS70] Marilyn W Wang and Julian C Stanley. Differential weighting: A review

of methods and empirical studies. Review of Educational Research, pages

663–705, 1970.

[WWP+09] Feng-rui Wang, Wen-hong Wang, Quan-ke Pan, Feng-chao Zuo, and

JJ Liang. A novel online test-sheet composition approach for web-based

testing. In IEEE International Symposium on IT in Medicine & Education,

volume 1, pages 700–705, 2009.

[YKG10] Guangbing Yang, K Kinshuk, and S Graf. A practical student model for a

location-aware and context-sensitive personalized adaptive learning system.

60


In 2010 International Conference on Technology for Education (T4E),

pages 130–133. IEEE, 2010.

[YPC13] Li Yuan, Stephen Powell, and JISC CETIS. Moocs and open education:

Implications for higher education. Cetis White Paper, 2013.

61


שני בעבור בתיכון. המתקדמות בהקבצות רוב פי על שנלמד יותר מתקדם תחום טריגונומטריות,

פותר בהן הבסיס צעדי את המדמה התרגילים לפתרון שלם יכולות אוסף ומימשנו הגדרנו התחומים

התבססנו ובטריגונומטריה תרגילים ליצירת אוטומטי מנגנון יצרנו באלגברה בתחום. טיפוסי תלמיד

הבחינה. תורכב ממנו גדול שאלות מקבץ ליצירת אינטרנטיים מקורות על

יציבות את וכן המופקות הבחינות טיב את הבודק אמפירי מחקר ביצענו הנ"ל התחומים שני בעבור

פני על הערכה בפונקציית השתמשנו הבחינות, הערכת לצורך השונים. לפרמטרים ביחס המערכת

באלגוריתם התועלת פונקציית להגדרת ששימש מזה תלוי בלתי באופן שנדגם ידע מצבי של מדגם

שהבחינות מראות התוצאות מתחרים. ארבעה הצבנו שלנו האלגוריתם כנגד משמעותית. ממנו וגדול

הבעיה. הגדרות ובשתי התחומים בשני מהמתחרים טובות שלנו האלגוריתם ע"י המופקות

לפונקציית המשמש הידע מצבי מדגם וגודל הבחינה אורך של ההשפעה לבדיקת ניסויים ביצענו בנוסף,

ככל המקרים. בשני ציפיותינו את תאמה הנצפית האיכותית המגמה הפלטים. איכות על התועלת,

וכתוצאה לבחינה שאלות מגוון לבחור האלגוריתם של הבחירה חופש גם גדל גדל, הבחינה שאורך

פונקציית של האיכות עולה גדל, הידע מצבי מדגם כאשר כמוכן, יותר. איכותיות הבחינות מכך

ביחס המערכת יציבות את בדק נוסף ניסויים אוסף המופקות. הבחינות איכות ובהתאם התועלת

הניסויים המטרה. פונקציית לקביעת והקלט הידע מצבי אוכלוסיית ובפרט קלט־משתמש לפרמטרי

אלו. לפרמטרים ביחס יציבה שהמערכת מראים

ישימה כולה השיטה שהודגמו. לשניים מעבר אקדמיים תחומים פני על השיטה ביישומיות נדון לסיום,

באמצעות השאלות לפתרון ומספיקים הכרחיים ותנאים יכולות אוסף להגדיר ניתן בו תחום בכל

פרוצדוראליים, תחומים פני על כאן מתבצעת אלה לוגיים תנאים של אוטומטית הסקה היכולות.

במרחב, טריגונומטריה בגאומטריה, הוכחות כוללים אלו תחומים מצבים. כמרחבי לייצג בחרנו אותם

היסקים סדרת ע"י נפתרות שאלות בו תחום כל כמעט ובעצם אנליטית גאומטריה תנועה, בעיות

בשיטות אחרים מרחבים בסוגי גם אפשרית הפתרון תנאיי של אוטומטית הסקה חישוביים. או לוגיים

להגדיר אפשר כאלה בתחומים אוטומטיים. הוכחה כלי באמצעות הנפתרות בעיות בסוגי למשל שונות,

נעשה בהם ללמות בהתאם שאלה לפתרון התנאי את ולקבוע הלמות קבוצת בתור היכולות אוסף את

שאלה, על המענה תנאיי את אוטומטית לקבוע קשה בהם אחרים בתחומים השונים. בפתרונות שימוש

ביטויים אוסף בהינתן בקומבינטוריקה, לדוגמה הפתרונות. אוסף סמך על זאת לעשות אולי אפשר

המבנה ולפי אוטומטיים כלים באמצעות תחבירית אותם לנתח אפשר קביל, פתרון המהווים מתמטיים

פתרון. דרך לכל הדרושות היכולות את לקבוע התחבירי


נדרש המורה בראשונה, בחינה. כתיבת של הבעיה להגדרת שונות פורמולציות שתי מציעים אנו

לאלגוריתם. כקלט ומשמש מאשר" מיודע "יותר קשר שמציין השונים הידע מצבי בין סדר יחס להגדיר

פני על הרצוי הסדר את שמשקפת בחינה אחר האפשריות הבחינות במרחב מחפש האלגוריתם

ציונים לבין ידע מצבי בין מיפוי מגדיר המורה השנייה, הבעיה בהגדרת התלמידים. אוכלוסיית

הציונים עקומת עבורה בחינה אחר מחפש האלגוריתם זה במקרה הבנתו. לפי מתאימים אבסולוטיים

הגדרות לשתי נתייחס המורה. שהגדיר למיפוי הניתן ככל קרובה תהיה התלמידים אוכלוסיית של

הנוחות. למען האבסולוטית", ו"הבעיה ההשוואתית" "הבעיה בתור הנ"ל הבעיה

שיטה להגדיר נדרשנו ראשית, קשיים. מספר על התגברות דרשה שלנו האלגוריתמית המסגרת בניית

בהתאם ציונים, מיפוי ־ לחלופין או הסדר, יחס לתיאור קומפקטית ספציפיקציה לספק יוכל המורה בה

ידע מצב לכך, ובהתאם יכולות של אוסף ע"י מיוצג הנבחן בתחום שהידע מניחים אנו הבעיה. לסוג

האפשריים הידע מצבי מספר התלמיד. שולט בהן היכולות של תת־הקבוצה ע"י מיוצג תלמיד של

ישים. אינו ציונים) מיפוי (או בינם סדר יחס של מפורשת הגדרה ולכן היכולות במספר מעריכי הוא

ומשמשת חיבורית ניקוד סכמת על המבוססת אלו, קלטים הגדרת על להקלה שיטה פיתחנו לפיכך,

בתחום חשיבות ציון יכולת כל עבור יגדיר שהמורה הוא הבסיסי הרעיון הבעיה. הגדרות בשתי אותנו

ידע מצב של הבסיס ציון יחיד. באופן אפשרי ידע מצב לכל בסיס ציון ייקבע זו ומהגדרה הנבחן

הסדר יחס חילוץ ההשוואתית, בבעיה הידע. במצב הנשלטות היכולות של החשיבות ציוני סכום הינו

ציון עם מאחר יותר מיודע ייחשב ידע מצב הבסיס: ציוני מתוך ישיר באופן כעת נעשה ידע מצבי בין

לציונים הבסיס ציוני את להעביר מנת על נוסף צעד דרוש האסולוטית בבעיה יותר. גבוה שלו הבסיס

הציונים בעקומת התקן וסטיית התוחלת את לקבוע נדרש המורה זה במקרה עצמה. בבחינה הרצויים

המתאימה. גאוס להתפלגות הבסיס ציוני נרמול ע"י יחיד באופן נקבע הציונים ומיפוי הרצויה

הבחינות במרחב האלגוריתם את תנחה אשר הבחינות פני על תועלת פונקציית להגדיר נדרשנו שנית,

בבעיה הבעיה. הגדרות משתי אחת כל עבור פונקציה תועלת, פונקציות שתי הגדרנו האפשריות.

הידע מצבי ציוני לבין המורה שהגדיר הציונים ווקטור בין המרחק את מחשבים אנו האבסולוטית

מנהטן נורמת תחת המרחק את למינימום להביא היא זו בבעיה המטרה משרה. המוערכת שהבחינה

להתאמה במדד שימוש עושים אנו ההשוואתית בבעיה הנ"ל. הציונים ווקטורי שני בין אחרת) (או

להביא היא המטרה כאן, הבחינה. שמשרה זה לבין המורה שהגדיר הסדר יחס בין קנדל) של (טאו

מעריכי הינו הידע מצבי מרחב הבעיה הגדרות בשתי אלה. סדרים בין ההתאמה את למקסימום

הידע. מצבי של מייצג במדגם שימוש נעשה ולפיכך

חדשנית שיטה פיתחנו כך לצורך בחינה. פני על ידע מצב של הציון קביעת היה ביותר הגדול הקושי

אלגברה כגון פרוצדוראליים ידע בתחומי ישימה זו שיטה בגרפים. וחיפוש בתכנון שימוש שעושה

של סדרה דרך מושג שאלה של והפתרון היסק או חישוב פעולות הן היכולות בהם וגאומטריה,

הכרחי תנאי לחישוב בגרף חיפוש מבצעים אנו הנתון באוסף פוטנציאלית שאלה כל עבור פעולות.

של כדיסיונקציה מוגדר התנאי שונים פתרון מסלולי מספר שיתכנו כיוון השאלה. לפתרון ומספיק

על המודל הצלחת את לקבוע מנת על פתרון. במסלול יכולות של אוסף אחד, כל המייצגים גורמים,

השאלה. של הלוגי התנאי את מקיימות שולט הוא בהן היכולות אוסף האם בודקים אנו שאלה,

על־ בלימודים הנפוצים ידע תחומי שני פני על כאן מודגמת בחינות לכתיבת שפיתחנו השיטה

שנלמד תחום אחד, נעלם עם לינאריות אלגבריות משוואות פתרון הינו הראשון התחום יסודיים.

משוואות פתרון הוא השני התחום במתמטיקה. בסיסי לידע ונחשב העולם ברחבי ביניים בחטיבות


תקציר

תלמידים הערכת העולם. ברחבי חינוך אנשי המאתגרת חשובה משימה היא תלמידים של הידע הערכת

מיקוד לצורך דיאגנוסטיות למטרות גם אלא מתאימים, ציונים עבורם לקבוע מנת על רק לא נדרשת

באמצעות היא הרווחת ההערכה שיטת הנצפות. המרכזיות החולשה בנקודות הפדגוגיים המשאבים

ניתן ממנה אשר בחינה כתיבת של המשימה לפתור. התלמידים מתבקשים אותה אחידה, בחינה

מתמודדים חינוך אנשי עמה מורכבת, משימה היא מדויק באופן תלמיד של הידע מצב את להעריך

תדיר. באופן

חינוכיות טכנולוגיות שתי של התפתחותן עם לאחרונה עלתה הבחינות כתיבת משימת של חשיבותה

אקדמיית קורסרה, של אלו כגון מקוונים קורסים של הפעולה התרחבות היא הראשונה מרכזיות.

במערכות השיפור היא השנייה המידע. בעידן להמונים אקדמאיים לימודים שמנגישות ואחרות, קאן

אמת. בזמן לתלמידים ואדפטיבית אינטליגנטית הנחייה המאפשרות אינטליגנטיות, הוראה

לאוטומציה ניסיונות כמה בוצעו ידני. באופן רוב, עפ"י חינוך, אנשי ע"י נכתבות עדיין בחינות בימינו,

בתכנון העוסקת סטטיסטית תאוריה על התבססו אלו מניסיונות חלק הבחינה. כתיבת תהליך של

לשאלות ההתייחסות אלה, מעבודות ניכר בחלק פריטים". מענה "תאוריית בשם בחינות וניתוח

מאפיינים בלבד. נומריים מאפיינים של אוסף ע"י המאופיינים אטומיים פריטים כאל היא הבחינה

שכזה ייצוג ואחרים. תלמידים) (בין ההפרדה יכולת משוער, פתרון זמן קושי, רמת כוללים שכאלה

יכולת היא המטרה פונקציית לרוב בשלמים. תכנות כבעיית בחינה כתיבת בעיית של הגדרה מאפשר

של רחב מגוון הבעיה. אילוציי להגדרת משמשים המאפיינים יתר כאשר הבחינה, של הכוללת ההפרדה

טאבו, חיפוש גנטיים, אלגוריתמים כולל שכאלו, פורמולציות לפתרון הוצעו אופטימיזציה אלגוריתמי

הנחיל. ואופטימיזציית סימפלקס אלגוריתם

מבוצע אכן הבחינות כתיבת תהליך ידועים, השאלות של המאפיינים שערכי בהנחה אלו, בעבודות

נתפסים והם מספריים ערכים אותם של למקור מפורשת התייחסות אין לרוב אך אוטומטי. באופן

את לקבוע יש בשטח, אלו שיטות ליישם מנת על מאידך, האופטימיזציה. לבעיית מראש ידוע כקלט

אוטומטיות־ כשיטות אלו לשיטות להתייחס ניתן לפיכך המועמדות. השאלות מאפייני כל של ערכם

למחצה.

הגדרות של מינימום הדורשת בחינות לכתיבת חדשנית אלגוריתמית מסגרת מציגים אנו זו, בעבודה

קוגניטיביות יכולות אילו אלגוריתמי באופן וקובעים אמתיות שאלות אוסף יוצרים אנו מהמורה. ידניות

את ומסיקים אפשריים ידע מצבי לייצג מנת על תלמידים של במודלים משתמשים אנו בודקות. הן

אלגוריתמי. באופן השאלות על במענה ביצועיהם

i


זה מחקר מתוצאות חלק המחשב. למדעי בפקולטה מרקוביץ', שאול פרופ' בהנחיית בוצע המחקר

.IJCAI15־ בכנס פורסמו

תודות

האקדמי למנחה ראשית בהצלחה: זו עבודה להשלים לי שסייעו אלה לכל לב מקרב להודות ברצוני

שלנו, המחקר לקבוצת הגבולות; חסרת וסבלנותו מפז היקרה הנחייתו על מרקוביץ', שאול פרופ' שלי

חזן אורית לפרופ' המחקר; במהלך ותמיכתם משוביהם על דואק, ושרי מסינג מיטל פרידמן, ליאור

על יהב ערן לפרופ' חינוך; כאשת שלה הייחודית הפרספקטיבה על המדעים, להוראת מהמחלקה

העבודה. לשיפור והצעותיו הערותיו

הקריטיים בצמתים בהתייעצות לי סייע בתורו אחד כל אשר קולגות, למספר להודות ארצה בנוסף,

עמית, נדב יניב, יונתן בן־בסט, רן ויטקין, אדוארד לוי, עומר גיל־עד, אלון התיזה: להשלמת הדרך על

גנקין. ודניאל מורן שי

יותר רחבה פרספקטיבה של ההצגה ועל המוראלית תמיכתם על ואורה, דני להוריי, מודה אני

במהלך הרבה סבלנותה ועל ותמיכתה אהבתה על נועה לאשתי מודה אני לבסוף, צריך. כשהייתי בפניי

זה. מאתגר מסע

זה. מחקר מימון על לטכניון מסורה תודה הכרת


בחינות של אלגוריתמית בנייה

מחקר על חיבור

מגיסטר התואר לקבלת הדרישות של חלקי מילוי לשם

המחשב במדעי למדעים

גייגר עומר

לישראל טכנולוגי מכון ־ הטכניון לסנט הוגש

2016 פברואר חיפה התשע"ו טבת


בחינות של אלגוריתמית בנייה

גייגר עומר


algorithmic exam generation - technion … · in this work we present a novel algorithmic framework...

Documents