algorithmic exam generation - technion … · in this work we present a novel algorithmic framework...
TRANSCRIPT
Algorithmic Exam Generation
Omer Geiger
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Algorithmic Exam Generation
Research Thesis
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Computer Science
Omer Geiger
Submitted to the Senate
of the Technion — Israel Institute of Technology Adar
5776 Haifa February 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
This research was carried out under the supervision of Prof. Shaul Markovitch, in the Faculty of
Computer Science. Some results in this thesis have been published in IJCAI-15 [GM15].
ACKNOWLEDGEMENTS
I would like to express my gratitude towards all who have helped me complete this work
successfully: First of all, to my academic advisor, Prof. Shaul Markovitch, for his precious
guidance and endless patience; to our research group, Lior Friedman, Maytal Messing, and
Sarai Duek, for their feedback and support throughout the research; to Prof. Orit Hazan, from
the department of education, for her unique and insightful perspective as an educator; to Prof.
Eran Yahav for his much appreciated comments and suggestions for improvements.
I would also like to thank several colleagues, each of which provided valuable consultations
at critical junctions on the road towards completion of this Thesis: Alon Gil-ad, Omer Levy,
Edward Vitkin, Ran Ben-Basat, Jonathan Yaniv, Nadav Amit, Shai Moran, and Daniel Genkin.
I thank my parents Dan and Ora for their moral support and for showing me a broader
perspective of things when needed. Lastly, I thank Noa, my wife, for her love and support, and
for bearing with me patiently through this challenging journey.
The Technion’s funding of this research is hereby acknowledged.
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Contents
List of Figures
Abstract 1
List of Acronyms 3
List of Symbols 5
1 Introduction 7
2 Problem Definition 112.1 Preliminary Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Relative Grading Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Absolute Grading Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Model-based Exam Generation (MOEG) 153.1 Examination Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Student Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Target Student Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Order Correlation Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Exam Utility Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Searching the Space of Exams . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.7 Wrap-up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.8 Adapting to the Absolute Grading Setting . . . . . . . . . . . . . . . . . . . . 19
4 Educational Domains 234.1 Algebra Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Algebra Ability Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Algebra Question Generation . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Trigonometry Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Trigonometric Ability Set . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Trigonometric Question Pool . . . . . . . . . . . . . . . . . . . . . . . 25
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
5 Evaluation 275.1 Empirical Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.2 Performance Over Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 The Effect of Exam Length on Performance . . . . . . . . . . . . . . . . . . . 28
5.4 The Effect of Sample Size on Performance . . . . . . . . . . . . . . . . . . . . 29
5.5 Framework Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.6 Domain Coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.7 An Alternative Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.8 Absolute Grading Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6 Related Work 396.1 Item Response Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.1 IRT Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1.2 ICC Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
6.1.3 Common IRT Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Testsheet Composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Intelligent Tutoring Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Discussion 497.1 Other Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.2 Framework Expansions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Hebrew Abstract i
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
List of Figures
3.1 Pseudo-code for action landmark approximation method . . . . . . . . . . . . . . . 21
3.2 MOEG pseudo-code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1 Trigonometry domain abilities . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1 Performance over time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 The effect of exam length on performance . . . . . . . . . . . . . . . . . . . . 30
5.3 The effect of sample size on performance . . . . . . . . . . . . . . . . . . . . 31
5.4 The effect of εw on performance . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 The effect of εp on performance . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.6 Algebra domain coverage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.7 Performance over time (absolute grading) . . . . . . . . . . . . . . . . . . . . 35
5.8 The effect of µ on performance (absolute grading) . . . . . . . . . . . . . . . . 36
5.9 The effect of σ on performance (absolute grading) . . . . . . . . . . . . . . . . 37
5.10 Question difficulty histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1 The 3pl ICC curve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2 Example testsheet composition problem formulation [HLL06] . . . . . . . . . 44
6.3 Interaction between ITS components [BSH96] . . . . . . . . . . . . . . . . . . 46
7.1 Example inferences in different domains . . . . . . . . . . . . . . . . . . . . 50
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Abstract
Given a class of students, and a pool of questions in the domain of study, what subset will
constitute a “good” exam? Millions of educators are dealing with this difficult problem world-
wide, yet the task of composing exams is still performed manually. In this work we present a
novel algorithmic framework for exam composition. Our main formulation requires two input
components: a student population represented by a distribution over a set of overlay models,
each consisting of a set of mastered abilities, or actions; and a target model ordering that, given
any two student models, defines which should be graded higher. To determine the performance
of a student model on a potential question, we test whether it satisfies a disjunctive action
landmark, i.e., whether its abilities are sufficient to follow at least one solution path. Based on
these, we present a novel utility function for evaluating exams. An exam is highly evaluated
if it is expected to order the student population with high correlation to the target order. In an
alternative formulation we devised, the target ordering is replaced with a target grade mapping
indicating the desired grade for each student model. In this case, good exams are those for
which the expected grades are close to those specified by the target mapping. The merit of
our algorithmic framework is exemplified with real auto-generated questions in two domains:
middle-school algebra and trigonometric equations.
1
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
2
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
List of Acronyms
MOEG : Model based Exam Generation
ITS : Inteligent Tutoring System
MOOC : Massively Open Online Course
MIP : Mixed Integer Programming
IRT : Item Response Theory
CTT : Classical Test Theory
ICC : Item Characteristic Curve
1-4pl : 1-4 Parameter Logistic Model
MLE : Maximum Likelihood Estimator
IIF : Item information function
TIF : Test information function
CAI : Computer Aided Instruction
3
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
4
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
List of Symbols
Q : Question set
q : Question
A : Ability set
ψ(·, ·) : Sufficiency predicate
s : Student model in set notation
s : Student model in vector notation
M : Set of possible models
PM : Distribution over student models
e : Exam
ke : Exam length
wi : Grading weights / Ability weights
g(·, ·) : Grading function
g∗(·) : Target grade mapping
�∗ : Target student order
�e : Exam induced student order
�w : Ability-weight-induced student order
�⊆ : Subset-relation-induced student order
�P : Question-pool-based student order
C(·, ·) : Order correlation measure
S(q) : Set of solution paths for question q
l(q) : The disjunctive action landmark for question q
l(q) : Approximation of the disjunctive action landmark for question q
Pr(·) : Probability
τ : Kendall’s Tau
εp : Experimental variable controlling student distribution
εw : Experimental variable controlling ability weights
Dlim : Depth limit in landmark approximation algorithm
SOLlim : Solutions limit in landmark approximation algorithm
Tlim : Time limit of landmark approximation algorithm
L(·|·) : Likelihood function
I(·) : Item information function
SE(·) : Standard error
5
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
6
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 1
Introduction
Assessing the knowledge state of students is an important task addressed by educators worldwide
[Gro98]. Knowledge assessment is required not only for the purpose of determining the students’
deserved grades, but also for diagnostic evaluation used to focus the pedagogical resources on the
students’ observed shortcomings [LST12]. The most common method for such an assessment is
having the students answer an exam. Composing an exam from which the students’ knowledge
state can be precisely evaluated is a difficult task that millions of educators encounter regularly.
The importance of exam composition has increased with two major developments in
computer-aided education. The first is the growing popularity of massive open on-line courses
(MOOCs) such as Coursera, Kahn Academy, edX, and Academic-Earth, which offer new educa-
tional opportunities worldwide [YPC13]. The second is the improvement of intelligent tutoring
systems (ITS). These are software products that intelligently guide students through educational
activities [PR13].
Exams are still predominantly written manually by educators. Several attempts have been
made at automating this task, often referred to as testsheet composition [Hwa03, LST12], some
of which are based upon the statistical paradigm of item response theory [GC05, EAAA08]. In
many of these works, exam questions are considered to be atomic abstract objects represented
only by a vector of numeric features. Common features include difficulty level, solving time,
and discrimination degree. Using such a factorial representation, the problem is then defined as
a mixed integer programming (MIP) problem. Usually the objective (maximization) function is
the discrimination level of the entire exam while the remaining features compose the problem
constraints. Different optimization algorithms have been applied to solve such MIP problem
formulations [HLL06, HCYL08, DZWF12, WWP+09].
In these works, assuming a feature vector per question is given, the process of exam compo-
sition is effectively automated. However, in order to apply these methods in real educational
settings, the feature vectors of all candidate questions must be determined. Alas, it remains
unclear how this was done, and this major framework component remains an atomic blackbox.
The reader is left to speculate that perhaps the feature vectors are manually specified by a field
expert. If so, these methods may be regarded as only semi-automatic.
In this paper, we present a novel algorithmic framework for exam generation, which requires
7
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
minimal manual specification. We generate real candidate questions, and algorithmically
determine which domain abilities they test. Student models are used to represent possible
knowledge states, allowing us to determine their performance on candidate questions. Two
different problem formulations are proposed. In the first, the user (educator) specifies a target
order between knowledge states, indicating the relation “more proficient than”, used as input
for the algorithm. The algorithm then searches the space of possible exams for one that best
reflects this ordering over the student population. In the second formulation, the user specifies a
mapping from each possible knowledge state to a deserved grade. In this case the algorithm
searches for an exam for which the resulting student grades are as close as possible to those
specified. We will refer to these two formulations as relative grading and absolute grading, for
convenience.
Building the framework requires us to overcome several difficulties. First, we need to
define a method for the user to easily supply the desired specification of either the student
order or the grade mapping, according to the problem formulation used. We assume that the
domain knowledge is represented by a set of abilities, and the knowledge state of a student
is modeled as a subset of them which she has mastered. This is known in the ITS literature
as overlay model [Bru94]. The number of possible student models is therefore exponential,
deeming the specification of an explicit ordering or grade mapping unfeasible. We developed
a way to simplify the input specification based on an additive grading scheme, for both our
problem formulations. The idea is to have the educator specify a weight per ability and define
base grades for the students as the sum of ability-weights mastered by the student. Extracting
the student ordering from the base grades, as required by the relative grading formulation, is
straightforward: a student model is considered more proficient than another if it has a higher
base grade. For the absolute grading problem formulation, an additional step is required to
transform the base grades to fit a desired distribution. The user is therefore required to supply
this distribution (e.g, N(µ, σ)) as an additional input.
Second, we need to define a utility function for guiding the search through the exam space.
We developed two utility functions, matching the two problem formulations. In the absolute
grading formulation, the utility function computes for each student model the difference between
the grade mapping, as specified by the teacher, and the model’s exam grade. The goal is to
minimize the expected distance under the L1 norm (or any other). In the relative grading
formulation, we use a correlation measure (e.g. Kendall’s Tau) between the exam-imposed
grade order and the target order, specified by the teacher. Obviously, in this case the goal is to
maximize the correlation. In both formulations, since the space of student models is prohibitively
large, we use a sample of models for estimating the utility.
The most difficult hurdle is to determine the grade of a student model on an exam. We
have developed a novel method to solve this problem, using a technique based on graph search
and planning. To do so, we restrict our attention to procedural domains, such as algebra and
geometry, where the abilities are actions, and the answer to an exam question is a sequence of
such actions. For each question, we perform a graph search to compute its action landmarks —
a set of actions that are necessary and sufficient for solving the question. As alternative solutions
8
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
may be possible, we use disjunctive action landmarks. To determine whether a student model
can solve a question, we test whether the model’s set of abilities contains at least one of the
question’s disjuncts.
The remainder of this paper is structured as follows. Chapter 2 describes our generic
problem formulation for exam generation. Chapter 3 presents MOEG, our complete framework
for MOdel based Exam Generation, applicable primarily to procedural educational domains.
All components are defined, motivated and explained in the chapter. The chapter is concluded
with a pseudo-code figure showcasing integration of all components into a unified algorithm.
Chapter 4 describes two representative educational domains from secondary school math courses:
univariate linear equations, and trigonometric equations. In Chapter 5 we present an empirical
evaluation of the algorithmic framework over these two domains. Chapter 6 constitutes a brief
comparative literature review of related work, highlighting the uniqueness of our work in the
context of others. The paper is concluded in Chapter 7, dedicated to discussing the implications
of our work and possibilities for future expansions.
9
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
10
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 2
Problem Definition
In this chapter we present our novel formulation for the problem of exam generation. The
following section describes some preliminary definitions and is followed by the definition of
two problem variations — relative grading and absolute grading.
2.1 Preliminary Definitions
We define an examination domain as a triplet 〈Q,A,ψ〉. It is composed of a set of candidate
questions Q, a set of abilities A = {a1, a2, ..., am}, and a sufficiency predicate ψ : 2A ×Q→{1, 0}, where ψ(A′, q) = 1 iff the ability set A′ ⊆ A is sufficient to answer the question q ∈ Q.
Next, we define a student model, using the relatively simple approach known as the binary
overlay model [Bru94]. By this approach, a student model is defined as a subset of domain
abilities, s ⊆ A, mastered by the student. Therefore, a student s answers a question q ∈ Qcorrectly iff ψ(s, q) = 1. The student model, also sometimes referred to as a knowledge state,
may be alternatively represented by a binary vector s with each coordinate indicating mastery
of a matching ability or lack thereof. These vector and set notations will be used throughout this
paper interchangeably.
We denote the set of all possible models asM = 2A, but assume that not all student models
are equally likely. Therefore we denote by PM = {〈si, pi〉} the distribution over the possible
student models, where si ∈M and pi is its proportion in the population.
An exam e of length ke is defined as a vector of ke questions, and a matching vector of
associated non-negative grading weights: e =⟨〈q1, ..., qke〉, 〈w1, ..., wke〉
⟩. The grade of a
student model s ∈M on exam e is simply the sum of grading weights for questions answered
correctly by the student model : g(s, e) =∑
1≤i≤ke wi · ψ(s, qi). Note that in some cases we
restrict ourselves to uni-weight exams, i.e., exams where all weights w1, ..., wke are equal. We
turn to define the notion of exam utility in the two problem formulations.
11
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
2.2 Relative Grading Formulation
Suppose we were to ask an educator to help us compose the perfect exam for some educational
domain. Ideally, we would like the educator to specify for each pair of students, which one is
more proficient and thus deserves a higher grade. Obviously a student knowing nothing should
be ranked inferior to all others, while a student knowing everything should be ranked superior
to all others. However, to determine the complete order, we rely upon the educator’s expert
knowledge to define the ordering. We call the desired order, given by the educator, the target
student order, and denote it �∗ ⊆M2. This is a partial order defining the binary relation “is
more proficient than” between pairs of students. Several compact and intuitive methods for
defining such an order are described in the following chapter.
Observe that any exam (e) also defines such a partial order between students (�e) according
to their grade. For s1, s2 ∈M, we have that s1 �e s2 ⇔ g(s1, e) ≤ g(s2, e). A good exam is
one for which the resulting student grades accurately reflect the target order, while taking into
account the model distribution PM. That is to say, it is more important to correctly order more
likely models than less likely ones. For this purpose we must make use of some correlation
function C between orders. A reasonable choice for such a correlation measure made throughout
this work is the weighted Kendall’s Tau, defined in the next chapter. We are now ready to define
a utility function for evaluating exams. Given an exam e, an order correlation function C, and a
target student ordering �∗, we define the utility of e as U(e) = C(�e,�∗).
Note that the actual exam grades of the students are of no importance in determining the
fitness of an exam. Therefore, exams of different difficulty levels may be considered equivalently
fit if their imposed student orders are similarly correlated with the target order. This may be
justified with the simple observation that the absolute exam grades may always be curved to
match any desired distribution, thereby fixing difficulty-based bias. Nonetheless, critics may
claim that post-exam grade curving is generally undesirable, and an attempt should be made to
minimize it when possible.
2.3 Absolute Grading Formulation
The alternative absolute grading formulation addresses the issue described above. Instead of a
target student order as input, this formulation uses a target grade mapping — a function defining
the deserved grade of each student model: g∗ :M→ [0, 1]. As the size ofM is exponential
in n, we cannot expect a user to fully specify the mapping explicitly, and therefore present a
compact specification method in the following chapter. The fitness of an exam is defined in
terms of the distance between the actual exam grades, expressed by g, and the target exam
grades, expressed by g∗. The smaller the distance between g and g∗ under a specified norm (e.g.
L1), the more fit the exam is considered to be. As in the case of relative grading, the utility
calculation is weighted by the model distribution PM.
While in the relative grading setting, exams of completely different difficulty levels may be
considered equally fit, this is not possible in the absolute grading setting. The user can control
12
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
the difficulty level of the exam through the target grade mapping, in addition to controlling the
ordering of students. Instead of relying upon a post-exam curving of grades to match the desired
grade distribution, this formulation enables “pre-exam curving” by which the base grades are
curved to facilitate the generation of a desirable exam.
13
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
14
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 3
Model-based Exam Generation(MOEG)
In this chapter we present our algorithmic framework for MOdel-based Exam Generation
(MOEG). To facilitate readability, all sections of this chapter, except the last, assume the relative
grading setting. The last section is dedicated to discussing the changes needed in order to apply
the framework to the absolute grading setting.
3.1 Examination Domains
A reasonable source for candidate questions is a curriculum textbook or a collection of previous
exams. This means that Q is some finite set of questions selected by the educator or curriculum
supervisor and coded once for the purpose of generating all future exams.
A more generic approach is to devise a question-generating procedure. In section 4.1 we
present an algorithm for automatically generating questions in algebra. Such a procedure
takes input parameters controlling aspects of solving methods, difficulty, or topics, and creates
questions. It may be applied as desired to create the entire question pool Q. Naturally, this
approach becomes more difficult with the increasing complexity of the domain. A hybrid
approach is to algorithmically produce variations of existing questions based on user refinement
of constraints [SGR12].
The set of abilities A and a sufficiency predicate ψ are assumed to be given by the educator.
However, for procedural domains, we introduce an algorithm that automatically induces the
sufficiency predicate. Procedural domains are those where questions are solved by applying a
sequence of operators or actions. Such domains can therefore be represented as a search graph,
where the vertices are intermediate solution states S, and the actions are steps executed for
solving the exercise. We assume that the set of search-graph actions are in fact the set of domain
abilities, where each ability a ∈ A is successor function a : S → 2S . Examples of applicable
procedural domains include algebra, geometry, trigonometry, classical mechanics, and motion
problems.
We turn to define the sufficiency predicate ψ : 2A ×Q→ {1, 0} for procedural domains.
15
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
An ability set is sufficient to answer a question iff it contains all abilities needed in at least
one solution path. Helmert and Domshlak [HD09] define a disjunctive action landmark as
a set of actions such that any solution must include at least one of them. We expand the
definition to a set of sets of actions, such that each solution must contain at least one of the
sets. Let S(q) be the set of all solution paths for a question q. The disjunctive action landmark
of q is therefore l(q) , {{a|a appears in t}|t ∈ S(q)}, or can be expressed equivalently as a
DNF formula: l(q) ,∨t∈S(q)[∧a∈t a]. For A′ ⊆ A, q ∈ Q we have that ψ(A′, q) = 1 iff
∃Ai ∈ l(q) : Ai ⊆ A′.In very simple domains, the set of solutions S(q) can be obtained via exhaustive search. In
more complex domains, however, such a procedure is computationally infeasible. We propose
an approximation of ψ using an anytime algorithm that collects possible solutions. The idea is
to generate random operator sequences up to a certain length limit and test if they compose a
new solution path. After a certain number of unsuccessful attempts the length limit is increased,
thereby allowing adaptation to the domain at hand. But, when a new solution is found, the length
limit is reset to the effective length of that solution. This is done in order to keep the depth limit
reasonably small, yet large enough to find additional solutions. The method is defined fully in
the pseudo-code of Figure 3.1.
3.2 Student Population
Describing the student model distribution in the general case requires explicitly defining the
probability for each possible model inM = 2A. Due to the exponential size of this model set,
we adopt a simplifying independence assumption between abilities. By doing so, we reduce
the complexity of distribution specification from exponential to linear in |A|, while retaining
reasonable flexibility. Formally, this simplification means we assume that a randomly selected
student masters each ability ai ∈ A with probability pi ∈ [0, 1]. Furthermore, we assume
that the probability of mastering each ability is independent and that students are mutually
independent as well. It follows that the probability of a model is:
Pr(〈a1, a2, ..., a|A|〉) ≡∏i:ai=1
pi ·∏i:ai=0
(1− pi) .
In future work we intend to relax this assumption and use Bayesian networks for allowing
arbitrary dependencies [GVP90].
3.3 Target Student Order
Explicitly specifying an order over the set of student models is also infeasible in the general
case due to the exponential size of the model setM. We therefore propose three methods for
simple order specification. In the first method, the educator is required to specify a vector of
non-negative ability weights w = 〈w1, ..., w|A|〉, indicating the importance of each ability to
domain mastery. Given these, the proficiency level of a student model s = 〈s1, s2, ..., s|A|〉 ∈ M
16
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
is defined as the sum of its mastered ability weights, i.e., the dot product (s, w) = Σiwi · si.Having defined a scalar proficiency level per student, the target order definition is straightforward.
For any s1, s2 ∈M :
s1 �w s2 ⇔ (s1, w) ≤ (s2, w) .
The second method uses the order induced by the subset relation (�⊆). A student who has
mastered all the abilities of another, as well as some additional ones, is naturally considered
more proficient:
s1 �⊆ s2 ⇔ ∀i[s1[i] = 1→ s2[i] = 1] .
The advantage of this method over the first one is that it requires no input from the educator.
However, the first method allows more refined orders to be specified and is thus preferable when
the additional input is available.
The third method for target order specification is question-pool based: The teacher decides
upon a question pool P , rich enough to order the student models (almost) optimally. As exams
are much more restricted in length, we want to find a small set of questions that order the student
models similarly to the order induced by the pool. Using the question pool P , we define the
target order between student models according to the number of questions in the pool they
answer correctly:
s1 �P s2 ⇔ Σqi∈Pψ(s1, qi) ≤ Σqi∈Pψ(s2, qi) .
The pool is potentially much larger than the desired exam length, and thus cannot serve as an
exam. In some cases it may be reasonable to use the entire question set Q as the pool.
3.4 Order Correlation Measure
We have reduced the problem of evaluating an exam e to comparing its induced student order
(�e) with the target student order (�∗). For this, we require a correlation measure that evaluates
the similarity between the two orders. We considered several alternatives, such as Kendall’s τ
[Ken38], Goodman and Kruskal’s Γ [GK54], Somers’ d [Som62], and Kendall’s τb [Ken45].
Eventually we selected the classic Kendall’s τ for this work, but the others are also adequate
candidates. Kendall’s τ compares the number of concordant student pairs (Nc) with the number
of discordant student pairs (Nd):
Nc , |{s1, s2 ∈M : s1 �e s2 ∧ s1 �∗ s2}|
Nd , |{s1, s2 ∈M : s1 �e s2 ∧ s1 �∗ s2}|
τ ,Nc −Nd(|M|
2
) .
A value of 1 implies complete correlation, while a value of −1 implies complete inverse
correlation. Note that this measure does not account for ties in either of the orders, deeming the
full range [−1, 1] unreachable in the presence of ties.
17
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
3.5 Exam Utility Function
Calculating the order correlation measure over all possible models is computationally infeasible.
Due to this practical constraint, we resort to an approximation measure based on a model sample
drawn from the given distribution PM. We define our utility function as Kendall’s τ computed
over the model sample between the target order and the exam-induced order.
Recall from Chapter 2 that we require the order correlation measure to also reflect the
distribution of student models, stressing the importance of ordering more likely models than less
likely ones. Our sample-based correlation measure meets this requirement: The likely models
are more likely to be sampled, perhaps more than once, and thus have a stronger influence on
the measure. The resulting measure is in fact an approximation of a generalization of Kendall’s
τ to non-uniform element weights, introduced recently by Kumar and Vassilvitsii [KV10].
3.6 Searching the Space of Exams
Our local search algorithm for exam generation involves three steepest ascent hill climbing
phases: adding questions, swapping questions, and adjusting grading weights. Starting with
an empty set of questions, the algorithm iteratively adds the question for which the resulting
set yields maximal utility value, using uniform grading weights. When the question set has
reached the desired exam length, the algorithm turns to consider single swaps between an exam
question and an alternative candidate. The swap maximizing the utility is performed until a local
optimum is reached, at which point the algorithm proceeds to adjusting the grading weights.
Recall that the absolute grades of the students are of no importance to us as we are only
interested in the order imposed on the student sample. It follows that, theoretically, a good set
of weights would meet the desired property that every two subsets have a different sum. Using
such a weight set makes it possible to differentiate between any two students who answered
differently on at least one exam question. That means they’re grades will surely be different, but
due to conflicting constraints they may not reflect the desired order upon all pairs. Constructing
a weight set with this property is not difficult, for example: {1+ 1pi
: pi is the ith prime number}or {1 + 1
2i: i ∈ N} or even a weight set randomly generated from a continuous range (with a
theoretical probability of 1). Of course not all such candidate weight sets are equivalent in terms
of the orderings they may impose between subsets.
The weight adjustment performed by our algorithm enables the construction of such a
desired weight set by applying weight perturbations of exponentially decreasing granularity.
Starting from the local optimum reached at the end of the question swapping phase, the algorithm
proceeds to perform a local search over the space of weight vectors. The search operators include
the addition of a small constant ∆ (e.g. 0.05) or its negation to any question weight. When a
local optimum is reached, the increment step ∆ is halved and the process continues this way
until no improvement is made.
The algorithm produces exams that are expected to have good discriminating capabilities. It
rejects questions for which the student answers are extremely homogeneous, i.e., very difficult or
18
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
very easy ones, since such questions contribute little to the induced order correlation. Moreover,
the desired discrimination is defined by the target order, given as input. Questions which
proficient students, as defined by input, are more likely to answer correctly than others, are
preferred. The grading weights of exam questions are expected to behave similarly. Perhaps
contrary to initial intuition, difficult exam questions are not expected to receive high grading
weights. This behavior, attributed to the lower discriminative capability of such questions, is
reasonable in real educational settings. An exam weighted directly by difficulty generally results
in a distorted grading curve, as only the few most proficient students will answer correctly the
highly weighted questions.
3.7 Wrap-up
We show in Figure 3.2 a high-level pseudo-code for the entire exam generation procedure
described. It accepts as input the domain’s ability set A and the student model distribution PM.
For simplicity of presentation, the pseudo-code uses the default method for defining a target
student order (�w). Therefore the third input parameter is the teacher-specified ability weight
vector w.
3.8 Adapting to the Absolute Grading Setting
This section is dedicated to discussing the changes required in the framework components,
in order to be used in the alternative absolute grading setting. First, the target student order
�∗⊆ M2 is replaced with the target grade mapping g∗ : M → [0, 1]. As the size of Mis exponential in n, we cannot expect the user to specify the value of g∗ explicitly for each
student model. We therefore propose an alternative intuitive method for specifying g∗, by
supplying a triplet 〈w, µ, σ〉. The vector w = 〈w1, ..., wm〉 is composed of non-negative
weights indicating ability importance, as in the definition of �w in Section 3.3. The µ and σ
parameters are respectively the desired mean and standard-deviation of the grade distribution.
For practical use, computing g∗ requires a student sample taken from the model distribution
PM. To compute g∗(s) for all s in the sample, we first order the sampled students by their
proficiency level. Recall that the proficiency level of s is defined as the weight sum of mastered
abilities, (s, w) = Σiwi · si. Each student s is then assigned its theoretical percentile of the
Gaussian distribution N(µ, σ).
Another change is the removal of the order correlation measure used for the utility function in
the relative grading setting. Instead of searching for exams ordering students in high correlation
with the target order, we desire exams for which the resulting grades are close to the target
mapping. Therefore, a distance function is required instead of the correlation measure, and the
objective is changed from maximizing the correlation to minimizing the distance. In this work
we used the L1 norm between grade vectors of sampled students as our distance function. Other
candidate distance functions include cosine similarity, L2, L∞, and others.
19
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Finally, we turn to inspect the changes required in the search method. The algorithm remains
very similar to that of the previous setting with only a couple of changes worth mentioning. In
the relative grading setting, the values of question weights were important only in comparison
to each other. Here, since the absolute grades are of interest the absolute weights are important
as well. Throughout the search, before evaluating an exam, the question weights must hence be
normalized, to sum up to a value of 1. Additionally, the local search selects the successor that
minimizes distance rather than the one that maximizes utility.
20
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Procedure APPROXIMATE-ACTION-LANDMARK(q)
Parameters: Tlim - runtime limit, patience parameter
SolutionOperators← {} # set of solution-operator setsdlim ← 1
tries← 0
Repeat until (TIMEUP(Tlim))
path← Follow random action sequence from q upto length dlimIf IsSolution(path)
ops← {a ∈ A|a ∈ path}If ops /∈ SolutionOperatorsSolutionOperators← SolutionOperators ∪ {ops}dlim ← length(path)
tries← 0
Elsetries← tries+ 1
Elsetries← tries+ 1
If tries ≥ patience
tries← 0
dlim ← dlim + 1
Return SolutionOperators
Figure 3.1: Pseudo-code for action landmark approximation method
21
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Procedure MOEG(A,PM, w)
Q← GENERATE-QUESTIONS()
Foreach q ∈ Ql(q)← APPROXIMATE-ACTION-LANDMARK(q, A)
M ← SAMPLE-STUDENT-POPULATION(PM)
Foreach s ∈ MProficiency(s)← Σiwi · si�∗← {(s1, s2)|Proficiency(s1) ≤ Proficiency(s2)}
Foreach (s, q) ∈ M ×Q
ψ(s, q)←{
1 ∃t ∈ l(q) s.t. t ⊆ s
0 otherwise# set notation for student s ∈M is used for simplicty
For any exam e and students s1, s2 ∈M:grade(s1, e) ,
∑1≤i≤ke wi · ψ(s1, qi)
�e, {(s1, s2)|grade(s1, e) ≤ grade(s2, e)}U(e) , τ(�∗,�e) # or other correlation measure
EXAM BUILD:Initialize exam← Empty Exam
exam← ADD-QUESTIONS(exam,U, ke)
exam← SWAP-QUESTIONS(exam,U)
∆← 0.05 # or any other small valueimproved← TRUE
While improved(exam, improved)← ADJUST-WEIGHTS(exam,U,∆)
∆← ∆/2
Figure 3.2: MOEG pseudo-code
22
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 4
Educational Domains
In this chapter we describe two procedural domains, implemented fully, and used for empirical
evaluation of the MOEG framework. The first domain, typical of middle-school algebra courses,
is that of single-variable linear equations. The second domain is trigonometric equations, a
subject typically found in high school mathematics curricula.
4.1 Algebra Domain
In the Algebra domain students are asked to solve for x in single-variable linear equations such
as:
• 2x+ 5 = 13
• x− 3 = −8 + 8x+ 3(−2x+ 3)
• 2− (−4x+ 2(x+ 6− 3x)) = 4x
4.1.1 Algebra Ability Set
The ability set A, consists of 18 types of algebraic manipulations. We define the following 5
main types of actions and later decompose them into subtypes:
(U ) Unite: Merge two consecutive terms of the same type (variable or constant), e.g., −2x+
7x⇒U 5x.
(O) Open multiplication: Apply the distributive property to eliminate one pair of parentheses,
e.g., −3(x+ 2)⇒O −3x− 6.
(D) Divide: If both sides of the equation contain one term and the coefficient of one is a
divider of the other, divide both sides by it, e.g., 2x = −8⇒D x = −4.
(M ) Move: Move a single term from one side of the equation to the other and place its negation
following a term of the same type, e.g., 5x− 2 = −6 + 3x⇒M 5x = −6 + 2 + 3x.
23
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
(R) Rearrange: Move a single term within one side of the equation so that it follows a term of
the same type, e.g., x− 8 + 7x = 10⇒R x+ 7x− 8 = 10.
Each such action type is further decomposed into subtypes according to the parameters it
is applied over. For example, Unite (U ) is decomposed into 8 subtypes according to 3 binary
arguments: the type of terms united (variable or constant), the sign of the first term (’+’ or ’-’),
and the sign of the second. In a similar manner each action type is decomposed, giving us a
total of |A| = 18 abilities:
U(8) : {U〈t, s1, s2〉 : t ∈ {v, c}, s1, s2 ∈ {−,+}}
O(2) : {O〈s〉 : s ∈ {−,+}}
D(2) : {D〈s〉 : s ∈ {−,+}}
M(4) : {M〈t, s〉 : t ∈ {v, c}, s ∈ {−,+}}
R(2) : {R〈t〉 : t ∈ {v, c}}.
4.1.2 Algebra Question Generation
For this algebraic domain we devised a question-generating algorithm. It starts with an equation
representing the desired solution, and repetitively applies complicating operations while retaining
equation equivalence.
The algorithm receives two parameters: depth (d) and width (w). It begins with a solution
equation of the sort x = c and manipulates it by applying a short random sequence of basic
operations, resulting in an equation of the form a1x + b1 = a2x + b2, where the parameters
a1, a2, b1, b2 may be 0, 1 or any other value. The algorithm then iteratively performs d “deepen-
ing” manipulations, transforming expressions of the sort ax+ b to a′x+ b′+ c(a′′x+ b′′) while
maintaining that a = a′ + ca′′ and b = b′ + cb′′. These deepening manipulations are performed
on random levels of the equation tree structure, and so may be applied to an inner part created
by a previous iteration.
The algorithm continues with w “widening” iterations where a random term is split into a
pair of terms, i.e., b⇒ b′+ b′′ or ax⇒ a′x+ a′′x (where b = b′+ b′′, a = a′+ a′′). Finally, all
terms are shuffled recursively to produce a random permutation. In all manipulations performed,
the algorithm ensures that the newly formed coefficients are bounded by some constant (100).
The set of candidate exam questions in this domain, Q, was produced by applying the
described procedure 10 times with each (w, d) value pair in {0, 1, 2, 3}2, resulting in a set of
size 160.
4.2 Trigonometry Domain
The second domain is trigonometric equations, in which students are asked to solve for x
representing an angle parameter. Example questions from this domain include:
24
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
• 2− 4sin(2x) = 0
• −sin2(x) + cos2(x)− cos(2x) + sin(x) = 0
• 2cos2(x)− 2sin2(x) = 1
4.2.1 Trigonometric Ability Set
The ability set A consists of 25 types of solving actions based on known trigonometric iden-
tities. The definition chosen for these actions abstract out basic algebraic knowledge, which
is considered a prerequisite for this domain. For example, cos(x+ π2 ) = 0, cos(x) = 1
2 , and
1−2cos(3x) = 0, are all solved using the same ability: solving an atomic trigonometric equation
of canonical form cos(mx+ n) = c. We call this atomic ability TrigEqConst:Cos and call
its counterparts for other trigonometric functions TrigEqConst:Sin and TrigEqConst:Tan.
Similarly, we have 3 matching abilities, TrigEqTrig:〈fun〉, for solving equations of the sort
fun(m1x+ n1) = fun(m2x+ n2), where 〈fun〉 ∈ {sin, cos, tan}. A third type of atomic
solution abilities, QuadTrig:〈fun〉, solve quadratic equations with a basic trigonometric func-
tion as its parameter, e.g., 2cos2(3x)+5cos(3x)−3 = 0. The rest of the ability set is composed
of trigonometric identity applications, summarized in Figure 4.1.
4.2.2 Trigonometric Question Pool
The question pool used in this domain was collected from 3 different educational websites1,
along with others we authored ourselves. All together the pool consists of 75 questions testing
all domain abilities defined. Domain questions could have also been automatically generated
but we preferred to exemplify this alternative question source for our second domain.
1 http://tutorial.math.lamar.edu/Extras/AlgebraTrigReview/SolveTrigEqn.aspxhttp://dbhs.wvusd.k12.ca.us/ourpages/auto/2009/5/8/55918698/ch7 trig equation worksheet.pdfhttp://perrysprecalculus.weebly.com/uploads/1/3/4/5/13450290/study guide 7.7-7.8 answers.pdf
25
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Ability name Pre-condition⇒ Post-resultTrigEqConst:Sin sin(mx+ n) = c⇒ SOLTrigEqConst:Cos cos(mx+ n) = c⇒ SOLTrigEqConst:Tan tan(mx+ n) = c⇒ SOLTrigEqTrig:Sin sin(m1x+ n1) = sin(m2x+ n2)⇒ SOLTrigEqTrig:Cos cos(m1x+ n1) = cos(m2x+ n2)⇒ SOLTrigEqTrig:Tan tan(m1x+ n1) = tan(m2x+ n2)⇒ SOLQuadTrig:Sin a · sin2(mx+ n) + b · sin(mx+ n) + c = 0⇒ SOLQuadTrig:Cos a · cos2(mx+ n) + b · cos(mx+ n) + c = 0⇒ SOLQuadTrig:Tan a · tan2(mx+ n) + b · tan(mx+ n) + c = 0⇒ SOL
SinToCos sin(mx+ n)⇒ cos(π2 −mx− n)CosToSin cos(mx+ n)⇒ sin(π2 −mx− n)CosNegate cos(−(mx+ n))⇒ cos(mx+ n)SinNegate sin(−(mx+ n))⇒ −sin(mx+ n)
SinSqToCosSq sin2(mx+ n)⇒ 1− cos2(mx+ n)CosSqToSinSq cos2(mx+ n)⇒ 1− sin2(mx+ n)
SinSqPlusCosSq sin2(mx+ n) + cos2(mx+ n)⇒ 1
DoubleParamSinFrom sin(2(mx+ n))⇒ 2 · sin(mx+ n)cos(mx+ n)DoubleParamSinTo 2 · sin(mx+ n)cos(mx+ n)⇒ sin(2(mx+ n))
DoubleParamCosFrom cos(2(mx+ n))⇒ cos2(mx+ n)− sin2(mx+ n)DoubleParamCosTo cos2(mx+ n)− sin2(mx+ n)⇒ cos(2(mx+ n))
SqrtEquation∏i t
2i =
∏j s
2j ⇒
∏i ti = ±
∏j sj
where {ti} and {si} are basic trig. functions or constantsDivByTrig:Sin sin(mx+ n) ·
∏i ti = sin(mx+ n) ·
∏j sj ⇒
∏i ti =
∏j sj
s.t. sin(mx+ n) 6= 0DivByTrig:Cos cos(mx+ n) ·
∏i ti = cos(mx+ n) ·
∏j sj ⇒
∏i ti =
∏j sj
s.t. cos(mx+ n) 6= 0DivByTrig:Tan tan(mx+ n) ·
∏i ti = tan(mx+ n) ·
∏j sj ⇒
∏i ti =
∏j sj
s.t. tan(mx+ n) 6= 0MultCosForTan e.g., tan(x)− 2sin(x) = 0⇒ sin(x)− 2 · sin(x)cos(x) = 0
s.t. cos(x) 6= 0
Figure 4.1: Trigonometry domain abilities
26
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 5
Evaluation
In this chapter we present an empirical evaluation of the MOEG framework over the two
procedural domains described in the previous chapter.
5.1 Empirical Methodology
Ideally we would have liked to evaluate the exams by measuring their fitness over the entire
population of models. Of course this is computationally infeasible for the same reasons that lead
us to use the utility sample in the first place. We therefore compromise and use another sample,
the “oracle”, for evaluation. Two things are noteworthy regarding this oracle sample: first, it is
taken independently from the utility sample, and second, it is considerably larger. The resulting
evaluation function is therefore (1) unbiased in evaluating the algorithm’s produced exams, and
(2) a better approximation of the entire student model distribution. For completeness, we present
graphs showing the values of both the guiding utility and the oracle evaluation.
We conducted experiments for both the relative grading and absolute grading settings. In
the relative grading setting four main independent variables were tested. The first two are
the size of the sample used for the utility guiding the search and the exam length ke. The
third and fourth variables control the distribution of students PM and the simulated teacher-
specified ability weights w. The student distribution is defined by a collection of ability mastery
probabilities {pi} as described in Section 3.2. The specification of these mastery probabilities
were abstracted by one variable εp controlling the variance between abilities. With a set value
for εp, each pi value was independently sampled from the uniform distribution Uni(0.75± εp).
A similar approach was used to sample the ability weights: wi ∼ Uni(1± εw) for a set value
for εw. The third and forth independent variables are therefore εp and εw. In the absolute
grading setting, addressed in Section 5.8, we experimented with the parameters controlling
the target grade mapping, namely µ, σ. Default values, used unless mentioned otherwise, are
SampleSize = 400, ke = 10, εp = 0.15, εw = 0.5, µ = 0.5, σ = 0.1.
A sample of size 1000 was used for the oracle. All results presented are based on 50
independent experiment runs using the same question set Q. The derivation of their action
landmarks through our anytime approximation method (Figure 3.1) was performed once (with
27
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Tlim = 30 sec, patience = 10).
5.2 Performance Over Time
We tested the performance of the MOEG algorithm and compared it to four baseline algorithms
we defined1. Uniform generate & test generates random uni-weight exams and evaluates them
using the search utility, maintaining the best exam found yet. A variation is the weighted
generate & test, which makes a biased selection of exam questions inspired by the item response
theory concept of item information [DA09], reflecting a question’s usefulness in exams. It
selects questions with probability proportional to their information level, defined as p(1− p),
where p is the proportion of utility sample students answering the question correctly. Note that
these two baseline algorithms use a key component of our framework — the utility function.
Two additional baseline algorithms were devised in the algebra domain. The first is the
diversifier, which attempts to maximize the diversity of exam questions in terms of syntactical
features, because we assume that this will better differentiate between students. We defined
six question features: number of constant terms, variable terms, positive coefficients, negative
coefficients, parentheses, and overall terms. The feature values are normalized as Z-scores to
account for the different scales. The algorithm starts with a random question, then iteratively
selects a question to add, maximizing the sum over pairwise Euclidean distances between exam
questions. This is followed by a similar swapping phase.
The second is the similarity maximizer. It takes an opposite approach to the diversifier,
which could also be considered reasonable — find a good question and others similar to it, in
terms of features. The first question selected is the one maximizing utility value, and thus the
algorithm is identical to MOEG at the first data point. Further question selections are made so
that the feature distance, as defined by the diversifier, is minimized.
Figure 5.1 displays MOEG’s improvement over time during the question selection and
swapping phases, compared to the baseline competitors. We can see that the MOEG curve
surpasses all others in both domains, even with partial exams. Data shows that the performance
of weighted generate and test and uniform are practically equivalent. The diversifier and
similarity maximizer both exhibit surprisingly poor performance. Hence, we did not attempt to
define the question features these algorithms require in the trigonometric domain.
5.3 The Effect of Exam Length on Performance
We expect that longer exams will allow a better ordering of the student population. Figure 5.2
presents how the exam length ke affects the performance of the exam generation algorithm,
averaged over 50 runs. Each run was executed with independently drawn oracle and utility
samples. As expected, both curves are monotonically increasing with a diminishing slope. We
note that the utility is always higher than the oracle. This is to be expected as the search algorithm
1We could not compare MOEG to testsheet composition methods such as [HLL06], as they work with completelydifferent input and cannot be applied to the setup we use.
28
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Ora
cle v
alu
e (
Kend
all
Tau)
Time [exam evaluations]
MOEGWeighted G&T
Uniform G&TDiversifier
Similarity maximizer
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 500 1000 1500 2000 2500 3000 3500
Ora
cle v
alu
e (
Kend
all
Tau)
Time [exam evaluations]
MOEGWeighted G&T
Uniform G&T
Figure 5.1: Performance over time
(Algebra-top, Trigonometry-bottom)
tries to optimize this utility. The difference between the search utility and the oracle may be
considered analogous to the difference between training and testing accuracies in classification
tasks.
5.4 The Effect of Sample Size on Performance
The search process is guided by a sample-based utility. We expect that increasing the size of the
sample will improve the quality of the utility function. Figure 5.3 shows the effect of the utility
sample’s size on performance. Each point in the curve represents an average of 50 runs over
29
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0 5 10 15 20 25
Valu
e (
Kend
all
Tau)
Exam length
OracleUtility
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
0 5 10 15 20 25
Valu
e (
Kend
all
Tau)
Exam length
OracleUtility
Figure 5.2: The effect of exam length on performance
(Algebra-top, Trigonometry-bottom)
different utility samples.
Indeed, we can see that the performance of the algorithm, as measured by the oracle,
improves with the increase in sample size. The difference between the value of the oracle and
that of the search utility can be viewed as the estimation error of the search utility. We can see
that this error decreases as we use larger sample sizes.
30
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.5
0.55
0.6
0.65
0.7
0.75
0.8
0.85
0.9
0 50 100 150 200 250 300 350 400 450 500
Valu
e (
Kend
all
Tau)
Sample size
OracleUtility
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0 50 100 150 200 250 300 350 400 450 500
Valu
e (
Kend
all
Tau)
Sample size
OracleUtility
Figure 5.3: The effect of sample size on performance
(Algebra-top, Trigonometry-bottom)
5.5 Framework Stability
Figure 5.4 and Figure 5.5 show how εw and εp effect the oracle evaluation. The values displayed
are the mean and standard deviation of the oracle evaluation over 50 runs for each εw value
with the default εp and vice versa. The general observation is that variance in ability weights
and mastery probabilities have a relatively minor influence over the algorithm’s performance.
We conclude that the MOEG framework is stable with respect to the model population and the
ability weights.
31
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0 0.1 0.2 0.3 0.4 0.5
Valu
e (
Kend
all
Tau)
Epsilon w
Oracle
0.4
0.42
0.44
0.46
0.48
0.5
0.52
0.54
0.56
0.58
0.6
0 0.1 0.2 0.3 0.4 0.5
Valu
e (
Kend
all
Tau)
Epsilon w
Oracle
Figure 5.4: The effect of εw on performance
(Algebra-top, Trigonometry-bottom)
5.6 Domain Coverage
In this section we examine another desirable property from an exam — domain coverage. Any
educator would, no doubt, agree that a good exam should test a large portion of the domain. We
wish to test if exams produced by MOEG exhibit good domain coverage, but to do so must first
define this concept. An ability is considered required in order to answer a question iff it is used
in all solution paths. Now, the coverage rate of an exam is defined as the number of abilities
required for at least one of the exam questions.
Figure 5.6 compares the coverage rate of the 50 uni-weight MOEG-generated exams in
32
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0 0.05 0.1 0.15 0.2 0.25
Valu
e (
Kend
all
Tau)
Epsilon p
Oracle
0.47
0.48
0.49
0.5
0.51
0.52
0.53
0.54
0.55
0 0.05 0.1 0.15 0.2 0.25
Valu
e (
Kend
all
Tau)
Epsilon p
Oracle
Figure 5.5: The effect of εp on performance
(Algebra-top, Trigonometry-bottom)
algebra, using default parameters, against a pool of 1000 random exams. It is evident that
MOEG exams exhibit significantly higher coverage rates — a desirable property. One may
wonder why the maximal coverage displayed is 10 abilities while there are 18 abilities in the
algebra domain. Data shows that 10 abilities is, in fact, the maximal coverage rate possible,
achieved by an exam of all 160 questions in Q. The reason for this is that many abilities are
not required for any question since they are interchangeable. For example, −2x = −8 can
be solved by dividing by a negative value (−2), or by swapping the sides of both terms and
then dividing by a positive value (2). Therefore, no abilities are considered to be required for
this question. In the trigonometric domain this phenomenon is even more pronounced: most
33
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
4 5 6 7 8 9 10
RandomMOEG
Figure 5.6: Algebra domain coverage
questions have no required abilities at all. The matching graph for the trigonometric domain
was therefore omitted.
5.7 An Alternative Search Algorithm
Our search algorithm first finds a set of questions and then finds a set of weights. An interesting
variation of this algorithm is to mix these two tasks by allowing question swapping also during
the weight adjustment phase. It may have been reasonable to expect that this would result in
better performance due to the additional flexibility we allow the algorithm. However, results
show that this is not the case. Over 50 runs with the same samples used for both algorithms, the
proposed alternative algorithm yields nearly equivalent results. The runtime required, however,
is significantly shorter for the original, as may be expected due to the smaller branching factor
in the search.
5.8 Absolute Grading Setting
In what follows, we present the results of our experiments with the absolute grading setting.
Figure 5.7 shows the improvement of MOEG over time compared to the baselines. The values
displayed are in terms of average grade error, i.e., the L1 distance between exam grade vector
and target grade vector divided by their length.
Next, we examine the effect µ and σ, which define the target grade mapping, have upon
performance. The results of these two experiments are presented in Figure 5.8 and Figure 5.9.
Both figures exhibit U-shaped plots, but for different reasons. The µ parameter effectively
34
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Ora
cle v
alu
e (
Kend
all
Tau)
Time [exam evaluations]
MOEGUniform G&T
Weighted G&T
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 500 1000 1500 2000
Ora
cle v
alu
e (
Kend
all
Tau)
Time [exam evaluations]
MOEGUniform G&T
Weighted G&T
Figure 5.7: Performance over time (absolute grading)
(Algebra-top, Trigonometry-bottom)
controls the difficulty level of exam questions. The higher the mean target grade is, the easier
exam questions need to be with respect to the student population. Therefore, when this parameter
is taken to any extreme, the fitness of most candidate questions decreases. To confirm this
explanation, we extracted a difficulty histogram from our question sets, displayed in Figure 5.10.
The difficulty of each question is defined as the proportion of oracle models able to answer them
(over all 50 runs). The concentration of questions around the optimal µ values (in both domains)
allows the algorithm more flexibility in selecting domain questions of the appropriate difficulty,
thereby resulting in better performance there.
We turn to explain the effect σ has on performance as depicted in Figure 5.9. Consider the
35
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.07
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Err
or
(avera
ge d
iff)
Target mean grade (Mu)
Oracle
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.2 0.3 0.4 0.5 0.6 0.7 0.8
Err
or
(avera
ge d
iff)
Target mean grade (Mu)
Oracle
Figure 5.8: The effect of µ on performance (absolute grading)
(Algebra-top, Trigonometry-bottom)
problem of matching the grade distribution with the default mean µ = 0.5 and extreme STD
σ = 0. In such a case, the algorithm is required to find an exam for which all student grades
tend to 0.5, regardless of ability level. This is, of course, a very difficult task since all questions
inherently favor the more proficient students, i.e., they are more likely to be answered by them.
As expected, the algorithm’s performance improves as σ increases from 0 up to an optimal
performance for some σ. After this point, the target grade mapping gradually becomes harder
to comply with. This is because students with similar ability vectors usually answer the same
questions correctly. Increasing the value of σ induces a target grade mapping where similar
students are assigned grades which are more dissimilar, making the problem more difficult.
36
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0.08
0.09
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0 0.1 0.2 0.3 0.4 0.5
Err
or
(avera
ge d
iff)
Target grade STD (sigma)
Oracle
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0 0.1 0.2 0.3 0.4 0.5
Err
or
(avera
ge d
iff)
Target grade STD (sigma)
Oracle
Figure 5.9: The effect of σ on performance (absolute grading)
(Algebra-top, Trigonometry-bottom)
37
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0
10
20
30
40
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Freq
uency
Success proportion
0
10
20
30
40
50
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Frequency
Success proportion
Figure 5.10: Question difficulty histogram
(Algebra-top, Trigonometry-bottom)
38
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 6
Related Work
The following chapter is dedicated to the presentation of three inter-related research fields
relevant to the task of exam generation. We begin with a short survey of item response theory
(IRT), a statistical paradigm for test design and analysis. The purpose of the survey is to provide
a flavor of a different approach for dealing with problems similar to ours. It is followed by
the discussion of a group of works addressing testsheet composition as integer programming
problems, utilizing various optimization techniques. The chapter is concluded with an overview
of Intelligent Tutoring Systems (ITS), a promising class of educational software. Our work was
inspired by ITS theory, and the notion of the student model was adopted from it. In future work it
may be desirable to use more sophisticated student models than those used in this work. Several
such options are discussed here. Test generation serves an important role in tutoring; therefore
we believe our work may prove beneficial to the ITS research community.
6.1 Item Response Theory
In the following section we present a short survey of Item Response Theory (IRT), introducing
the core principles, assumptions, and concepts of the theory.
6.1.1 IRT Overview
IRT is a statistical paradigm dealing with the design, analysis, and scoring of both tests and
questionnaires based on statistical models deducted from response data. Two major populations
use IRT to aid their cause: educators administering educational tests, and psychologists adminis-
tering psychological questionnaires [ER13]. The main difference between these applications of
the theory is that responses to educational tests generally indicate ability, while responses to
psychological questionnaires generally indicate beliefs or attitudes. Other than these different
interpretations of item responses, the models are quite similar. We shall focus the discussion
in this survey on educational applications. Common IRT educational tasks include test devel-
opment, test equating, and evaluation of item bias [HS85], as well as adaptive testing [GC05],
referenced measurement, and inappropriateness measurement [LR79, HT83].
39
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
The term item serves as an abstraction for different types of questions, including true or
false, multiple choice, fill in the blank, and short-answer questions. Most attention has been
devoted to dichotomous items in which an answer may be either correct or incorrect, but
generalizations to polytomous items, in which each response has a different score value, have
also been made [ON06, NO10]. A correct response is assumed to suggest a certain level of
ability expressed in the form of an unobserved ability parameter θ, associated with the student.
In some cases the ability tested may be identified explicitly, for example, as general intelligence,
mathematical proficiency, or reading comprehension capability. However, often times such
an explicit identification is not possible. Fortunately it is not actually necessary. The theory
assumes the existence of some underlying ability even when the questions are of different nature
and common grounds are not easily identified. Therefore, the ability level is sometimes also
referred to as a latent trait. Classic IRT models make an assumption of unidimensionality,
by which test performance depends on a single ability parameter. Although multidimensional
models have also evolved [Rec09], unidimensional models are still most commonly used as
they simplify mathematical analysis considerably while retaining reasonable fit with respect to
empirical data.
IRT, also known as modern test theory, is generally viewed as an improvement over clas-
sical test theory [HVdL82, HS85]. it relaxes the common assumption of its predecessor by
which all items are parallel instruments [AHHI94]. IRT characterizes each item with an Item
Characteristic Curve (ICC) - mapping a student’s ability level to success probability. The theory
provides methods for estimating item parameters (e.g., difficulty, discrimination) and student
ability levels from response data. A nice property of these methods is that item parameter
estimation does not depend on the overall ability and diversity levels of the examinees. The
inverse is also true; the student ability levels may be estimated using any set of items. Several
new testing problems have been made addressable with IRT: test design, identification of biased
items, equating of test scores, and identifying maximum discriminatory power of items.
6.1.2 ICC Models
The core assumption of IRT is that the probability of student success on a certain item is
dependent on the student’s ability level, or latent trait, denoted θ [Tuc46]. The mapping
between the ability level and the probability of success on an item is defined by its ICC (Item
Characteristic Curve). The first ICC models defined were of the normal-ogive family [Lor52].
These imposed considerable computational difficulties leading to the development of the logistic
model family used today. The two-parameter logistic model, (2pl) [Bir57, Bir58b, Bir58a,
Bir68], defines the ICC as follows
P 2pli (θ) =
eDai(θ−bi)
1 + eDai(θ−bi). (6.1)
Observe that success probability increases as θ − bi increases. That is, the “stronger” a
student is with respect to bi, the more probable he is to succeed, where for θ = bi we get a
success probability of 0.5. The ai parameter is proportional to the slope of Pi(θ) at the point
40
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
θ = bi. A steep slope at bi indicates that an item discriminates very well between students with
abilities in the proximity of θ = bi. The bi and ai parameters are therefore called the item’s
difficulty and discrimination parameters accordingly. The D parameter is used for scaling and is
typically set to the value of 1.7 [Hal52].
Other dichotomous IRT models are described by the number of parameters they employ
[TO01]. The 1pl model assumes equal discrimination for all items and its ICC is obtained by
substituting ai = 1 in Equation 6.1. A special property may be observed when comparing
ICCs of two items in this model, known as specific objectivity. This sometimes desirable
property states that, given a set of students of different abilities, they will all agree regarding
the difficulty order of an item set. Two 1pl ICC curves with different difficulty parameters
will never overlap, hence the model is regarded as sample independent. Although theoretically
appealing on these grounds, the model makes two strong assumptions often criticized as being
unrealistic simplifications [Tra83]. The first is that all items have equal discriminating power,
and the second is that the probability of a student guessing the correct answer is negligible.
Varying discrimination power of items is enabled in the 2pl model, while accommodating
for the possibility of guessing led to the introduction of the 3pl model. The third parameter
introduced (ci) serves as a horizontal lower asymptote of the ICC. Therefore, the success
probability of a student with ability level approaching −∞ is defined to be ci.
The ci parameter reflects situations where students holding no relevant knowledge success-
fully guess correct responses. It is therefore called the guessing or pseudo-guessing parameter.
For example, in a 5-option multiple choice question setting, ci = 0.2 would be considered
reasonable. Figure 6.1 graphically shows the ICC function of a 3-parameter logistic model
P 3pli (θ) = ci + (1− ci)
(eDai(θ−bi)
1 + eDai(θ−bi)
). (6.2)
The 4pl model [McD67] accommodates for the option that even the most proficient students
may sometimes be wrong. Thus, the fourth parameter (γi) may be viewed as the counterpart of
ci in that it serves as a horizontal upper asymptote of the ICC. The 4pl model does not provide
any particular practical gains [BL81], and is thus less often used the field than the 1pl, 2pl, and
3pl models.
A large variety of other model types models were developed for addressing different test
settings, specifically items which are not dichtomously scored. One notable variation is by
Bock and Samejima, who address multichomously scored items [Boc72, Sam72]. In this setting,
differential scoring of answer options is said to improve reliability and validity of mental test
scores [WS70]. The ICC curve for dichotomous items is replaced by a set of item option
characteristic curves, each representing the probability of a certain answer given the ability
level. It may be expected that for a reasonable item, the curve for the correct option will be
monotonically increasing, but this is not necessarily true for curves of other options.
41
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
0
0.2
0.4
0.6
0.8
1
-1 -0.5 0 0.5 1 1.5 2
Succ
ess
prob
abili
ty
Latent ability trait
psudo-guessing paramter (c=0.2)
difficulty paramter (b=0.5)
discrimination paramter (a=2)
Figure 6.1: The 3pl ICC curve
6.1.3 Common IRT Tasks
IRT aims to provide a basis for making predictions, estimates, or inferences both about student
abilities and item characteristics. Since the ability score θ is never explicitly observable, it must
be deduced from response data. A complication arises since the selection of an accurate model
and its parameter values need also be made congruently. A common work-flow for obtaining
ability score estimates includes the following steps: data collection, model selection, parameter
estimation, scaling score values. [HS85]
A common IRT task is the estimation of a student’s ability score: Given a student’s response
data to a set of items, and assuming that the item ICC model and parameters are known, what is
a good estimate for the student’s ability score? A classic approach for dealing with this problem
uses the maximum likelihood estimator (MLE). Since the ICC functions in the typical models
are non-linear, numerical procedures, e.g., Newton-Raphson, are usually applied. The task of
ability score estimation is naturally expanded to cases in which we wish to estimate student
ability and item difficulty simultaneously. Procedures based on conditional-MLE have been
developed, to handle such simultaneous estimation tasks. Another approach for dealing with
these is the Bayesian approach, building upon distributions of student ability scores [Owe75].
We conclude this short survey with a discussion of the test construction task in IRT. Recall
from the discussion of ICC models that an item generally discriminates best between students
with ability scores in proximity of the item’s difficulty level. Intuitively one may say that the
item provides the most “information” regarding students around that ability level. This notion
of item information is formally defined as the Item Information Function (IIF), mapping ability
level to item information. It is naturally expanded to Test Information Functions (TIF), defined
as the IIF sum of test items. The TIF provides valuable information regarding the ability ranges
for which the test is most effective. Student ability estimates are generally most accurate for
levels in which test-information is high. With some a priori knowledge of the student ability
42
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
distribution, tests may be designed to optimize information in the dominant ability regions, thus
improving test reliability [The85]. Another use of TIFs is for qualification exams, where the
students either pass or fail and the actual grades are of no importance. The test designer may
then focus on selecting items with difficulty levels near the qualification threshold, resulting in a
test with maximal information where desired.
6.2 Testsheet Composition
A group of works address the task of exam construction as a combinatorial optimization problem,
known as testsheet composition. The definition of an exam in these works is similar to ours:
a subset of questions from a given collection of candidates, usually without grading weights
assigned. Candidate items are represented by feature vectors expressing different aspects of
them, e.g., difficulty level, solving time, discrimination degree, and association with various
topics [HLL06]. These item parameters, given to the model as input, are assumed to be somehow
acquired from real items. In web-based systems, or when response data is available, previous
test records are analyzed in order to approximate and update the difficulty level, discrimination
degree, and other item parameters [LBE90].
Given the identifying parameters of all candidate items and the input constraints on the
exam, the problem is formulated in terms of a mixed integer programming (MIP) problem with
multiple assessment criteria [Hwa03]. A decision variable is associated with each candidate
item, indicating whether or not the item is selected for the testsheet being constructed. Typical
problem constraints include a required expected solving time, a specified exam length, and a
desired coverage level of topics. Finally, the objective function to be maximized is typically
defined as the discrimination degree of the entire exam. We note that the underlying assumption
in such formulations is similar to ours, i.e., good exams are those capable of discriminating well
between students of different proficiency levels.
Figure 6.2 shows a typical problem formulation of this type by Hwang et al. [HLL06].
The authors explain that the problem expands upon the knapsack problem, which is NP-hard;
they therefore propose a heuristic approximation based on sequential optimization of problem
parameters. An expansion of the model was later proposed for composing a collection of
equivalent testsheets simultaneously without multiple appearances of items [HCYL08]. Creating
parallel testsheets was addressed in other works using different problem formulations. Belov
et al. define the task as a maximum set packing problem, where a maximal number of non-
overlapping testsheets is sought [BA06]. Ishii et al. formulated the problem as a maximum
clique problem, where nodes represent feasible tests, edges represent pairwise-compliance with
overlapping constraints, and the objective is to maximize the number of testsheets complying
with all pairwise-constraints [ISU14].
A large variety of other problem variations have also been investigated. One such example is
the adaptive composition of testsheet requirements according to the various assessment purposes,
addressed in a work by Lin, Su, and Tseng [LST12]. They defined four different types of such
assessment purposes: displacement, summative, formative, and diagnostic. An item selection
43
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Given the discrimination coefficient of item i di (i = 1, 2, ..., n)association degree of item i and concept j rij (i = 1, ..., n,
j = 1, ...,m)expected time to answer item i ti (i = 1, 2, ..., n)lower bound on expected relevance of concept j hj (j = 1, ...,m)bounds on expected time for answering testsheet [l, u]
Find item decision variables xi (i = 1, 2, ..., n)
Maximizing Z =∑n
i=1 dixi/∑n
i=1 xi
Subject to∑n
i=1 rijxi ≥ hj j = 1, 2, ...,m
l ≤∑n
i=1 tixi ≤ u
xi = 0 or 1 i = 1, 2, ..., n
Figure 6.2: Example testsheet composition problem formulation [HLL06]
strategy was applied to filter item candidates, and thus reduce problem complexity to be more
manageable. Then genetic algorithms were used to approximate the optimal solution, as in other
testsheet composition works [HLTL05].
Integer programming formulations have been linked to item response theory, previously
discussed, as well as to classical test theory [AJW98]. A group of works define a binary integer
program where a minimal set of questions is desired, admitting to constraint on item information,
defined in IRT [ABTvdL91, The85]. Another presents a maximin model for test construction
that requires the relative shape of the target test information function [vvLBT89]. Yet another
[EAAA08] uses abductive machine learning techniques for identifying item subsets informative
of student proficiency levels, based on IRT models.
A wide range of techniques have been used to solve different testsheet composition formu-
lations. Hwang [Hwa03] applies a fuzzy logic approach to determine difficulty levels of test
items, according to the learning status and personal features of students. He then uses clustering
techniques with dynamic programming, for constructing feasible testsheets in accordance with
specified requirements. Armstrong et al. define a mathematical programming model maximiz-
ing the classical test theory notion of test reliability and uses network theory and Lagrangian
relaxation techniques to solve their formulation [AJW98]. Other works [HYY06, DZWF12]
use tabu search [Glo89, Glo90], a metaheuristic method for guiding exploration of the solu-
tion space. Duan et al. [DZWF12] apply an analytic hierarchy process [Saa80] in order to
merge multiple objective functions into a unified weighted sum objective. They then combine
biogeography-based optimization [Sim08] with tabu search to solve the composition problem.
Wang et al. [WWP+09] use differential evolution [SP97], a parallel direct search algorithm, for
testsheet composition.
In our work, we searched the papers described above for baseline algorithms to compete
44
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
against. The objective of most testsheet composition formulations is the maximization of
overall test discrimination, as in ours. Nonetheless, the definition of discrimination differs
substantially. In testsheet composition, test discrimination is defined as the sum (or average)
of item discrimination parameter over included items. In the absence of additional constraints,
selecting the most discriminatory items would lead to an optimal solution. But of course
additional constraints are present, thereby complicating the problem and requiring different
approximation solutions.
In our formulation, the notion of test discrimination is more subtle and defined through the
order correlation measure. It is not enough for a test to distinguish between different students;
the discrimination must be in accordance with the target student order. This means that proficient
students must be more highly evaluated than less proficient ones, and not the other way around.
In order to unify the problem formulations, we need to define a discrimination parameter as a
single scalar per item. Using our framework components, a reasonable such definition is the
order correlation of the singleton exam composed of the one item being evaluated. But selecting
items maximizing this parameter leads to poor exams. In the extreme case, equivalent items
with maximal discrimination will be selected. The resulting exam will be of same utility as the
singleton exam of any selected item. Therefore, it makes little sense to define discrimination
parameters per item in our formulation. Other item parameters (difficulty, length, topic coverage,
etc.) may also be derived in our setting, but no constraints on these are included in the problem
definition. The non-linear nature of exam utility in our formulation, as well as the absence of
constraints on other item parameters, make reasonable comparison impossible.
6.3 Intelligent Tutoring Systems
The most common teaching methods in schools and academic institutions today are based
upon frontal presentations by the educator. This traditional approach, derived from educational
essentialism, places the teacher as responsible for transferring knowledge to the students, and
the student as a passive listener [SSZ08]. It has been empirically established that one-on-
one tutoring is a significantly more effective method of teaching [Blo84]. This difference is
quantified as higher grade level, more time on-task, and is also expressed in qualitative aspects
such as interest and attitude towards learning. Tutoring allows the educator more adaptivity to
the academic, motivational, and affective needs of the student, and shifts the student’s role in
the learning process from passive listener to active participant.
The obvious problem with tutoring is that it is impossible to implement in the general public
system due to practical constraints of time and human resources. This lead to the development
of a large variety of educational software in the last fifty years or so, starting with computer
aided instruction applications (CAI). A CAI typically contains a collection of educational items
such as reading materials, exercises, exams, and audio clips, sequenced in a predefined order.
Alternatively, the student is allowed to navigate through the material as desired, using some
hypermedia interface. In either case, such educational software generally serves as a passive
electronic book.
45
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Student
Domain
Pedagogical
Expert
Communication
Figure 6.3: Interaction between ITS components [BSH96]
Intelligent Tutoring Systems (ITSs) are the next evolutionary step in educational software.
An ITS is intended to be more of an interactive tutor then a passive knowledge source, adapting
to the student through learning interactions. Brusilovsky states that knowledge sequencing and
task sequencing are key functions of true intelligent tutoring systems [Bru92]. Additionally to
such knowledge of teaching strategies, an ITS possesses knowledge of the domain and of the
learner [HS73, SP94]. An ITS consists of five main components [BSH96, Woo]: The domain
knowledge, student model, expert model, pedagogical module, and communication model. The
mutual interaction between system components is depicted in Figure 6.3, adopted from Beck,
Stern, and Haugsjaa [BSH96].
The domain module generally maintains a structured representation of the domain taught,
perhaps an ontology of topics and subtopics related to one another by relations such as “prereq-
uisite of,”“generalization of,”“harder than,” etc. [MB+00, SMM04]. The expert module uses
the domain knowledge and serves as somewhat of a model for the optimal student, where the
goal of the tutoring process to bring student proficiency to expert level, as depicted by the expert
module. This module often takes the form of a “runnable,” meaning that it provides functionality
for generating a complete correct solution. The pedagogical module is responsible for guiding
the tutoring session. This includes both macro-level tutoring strategies and micro-level exercise
guidance. Ideally, it will support a variety of teaching strategies suitable for a heterogeneous
population of students who may differ in perceptual, cognitive, and affective aspects [JG95].
Lastly, the student module includes a model of the student using the system, used to make
inferences regarding the appropriate tutoring actions. Chrysafiadi and Virvou [CV13] state three
guiding questions which need to be addressed when building a student model: What to model?
How to model? and Why (i.e, for what purpose) to model? The student model may contain
both domain dependent and independent characteristics [YKG10], including dynamic features
such as knowledge and skills, errors and misconceptions, learning styles, references, affective
and cognitive factors, and meta-cognitive factors. In addition, static features such as age, class,
grades, mother tongue, etc. may also prove informative to the tutor.
46
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Nine major approaches for student modeling are currently established in ITS research, all
implemented in existing systems [CV13]:
• Overlay model
• Buggy/Perturbation model
• Stereotyping models
• Bayesian-network models
• Fuzzy-logic models
• Machine-learning methods
• Ontology-based models
• Constraint-based models
• Cognitive-theory approaches
By the overlay model approach, a student’s knowledge state is composed of incomplete but
correct knowledge. Such a model may include binary variables indicating mastery or lack thereof,
but expansions have been made to variables indicating finer proficiency levels. An expansion
to belief vectors indicating ability level estimations has been proposed [BSW97], along with
an updating scheme accounting for acquisition and retention factors of the student. Another
expansion is the perturbation model where mal-knowledge or buggy knowledge may also be
included in the student model [ND08]. Stereotyping models assign users to common stereotypes
based on similarity of initial information such as class and grade level. Stereotyping offers a
reasonable way to deal with new users for whom little information is known [Ric79, Ric83].
More complicated models address uncertainty in student behavior using bayesian networks or
fuzzy logic. Models based on a machine learning approach, use observations of user behavior as
input examples and attempt to learn predictions of future actions [WPB01]. These sophisticated
student modeling approaches offer interesting expansion opportunities for our work, which
currently uses the relatively simple binary overlay model.
47
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
48
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Chapter 7
Discussion
The new algorithmic framework presented enables automatic composition of well-balanced
exams, expected to produce the desired ordering of students by proficiency level. Different
exam versions are commonly required for the purpose of make-up exams, practice exams, or
for testing different classes [HCYL08]. Educators may now easily produce any number of
alternative exam versions with absolutely no additional effort. Automation of the composition
process introduces a new level of confidence in the fairness of an exam and in the equivalence
of different exam versions.
Our framework uses a distribution of student models to guide its search for good exams.
In Section 3.2 we show a method for parametric specification of such a distribution. These
pi parameters can be estimated from a population of real students. With the development of
MOOCs, an abundance of educational data has become available. This data may potentially be
used to build more accurate model distribution approximations.
The following section is dedicated to discussing the applicability of MOEG to other domains,
followed by a section discussing some expansion possibilities for the framework.
7.1 Other Domains
The MOEG framework is applicable to any domain where a set of cognitive abilities A and
a sufficiency predicate ψ may somehow be defined. Automatically deducing ψ via action
landmarks requires further that solutions be represented as paths in search graphs, with A as the
operator set. Several such procedural domains may come to mind, one of which is geometry.
A classic ITS paper [ABY85] presents a tutor for teaching geometrical reasoning. The tutor
utilizes a library of inference rules: triangle congruency/similarity, properties of polygons,
transitivity of angle/segment congruency, etc. It is implemented as a production system and
used to generate proofs, using different combinations of production rules. This solving process
may naturally be formalized as a graph search: graph states represent collections of statements
deduced so far, and search operators are applications of the production rules. An operator,
successfully applied, results in a new state with the inferred statement added to those of the
previous state. Goal states are those containing the desired claim of the geometry question.
49
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Analytical geometry:Circle C with origin (1, 0) and radius 2
⇒ C : (x− 1)2 + y2 = 4
Point (3, y) on Circle C ⇒ y = 0
Trigonometry:In4ABC : |BC| = 2,]B = 90◦,]A = 30◦
⇒ |AC| = |BC|/sin(]A) = 4
Motion problems:Position at t = 0 is x0 = 10[kmh], Constant velocity V = 5[kmh]
⇒ x(t) = 10 + 5t
Figure 7.1: Example inferences in different domains
The idea of inference steps as operators is also applicable in various computational domains.
Consider, for example, inferring the intersection point of two functions in analytical geometry,
the length of a right triangle’s hypotenuse in trigonometry, or the location at time t of an object
moving according to ~x(t) in motion problems. These example inferences may serve as arcs on
solution paths of three different domains in which diverse sets of useful inference rules exist.
Automatically deducing ψ for questions is also possible for domain types other than search
spaces. For example, domains where solutions may be obtained by automated theorem provers
are also MOEG-applicable with a simple extension. The abilities A will be the axioms, lemmas,
and theorems that a student should know, while solutions will be complete proof trees. Given a
proof tree representing a solution, the axioms at the leaves will be considered the set of required
abilities for the solution. Collecting these ability sets from different proofs found by the theorem
prover results in sufficiency predicates of familiar form: disjunctive action landmarks.
In other domains, different methods for deducing ψ may exist. Consider the domain of
combinatorial problems in which problems are given in text, e.g., “How many non-empty
subsets does a set of size N have?” It is natural here to define domain abilities as combinatorial
concepts such as non-redundant combination (nrc), summation principle (sum), redundant
permutation (rp), or subtraction principle (sub). Automatically deducing ψ from the question
text alone is beyond the state of the art, but given the set of possible solutions, the task
becomes feasible using standard syntactic parsers. As an example, the solution set for the
question above would be{∑i=n
i=1
(Ni
), 2N − 1
}, where the resulting sufficiency predicate is
ψ = (nrc ∧ sum)∨
(rp ∧ sub).
Lifting the requirement for automatic deduction of ψ altogether expands the group of can-
50
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
didate domains substantially. The framework may even be applied to declarative domains
such as history, for example. In history, the set of abilities can be a list of historical thinking
skills1: historical causation, patterns of continuity, periodization, comparison, contextualization,
historical argumentation, appropriate use of relevant historical evidence, historical interpreta-
tion, and historical synthesis. For such a domain, designing an algorithm for computing the
required abilities per question is beyond the state of the art, as it requires deep natural language
understanding. However it should be relatively easy for an educator to manually identify these.
7.2 Framework Expansions
The main contribution of our work is the presentation of a complete framework for exam
generation based on a new notion of exam fitness. In order to demonstrate the utility of our
approach, some simplifying assumptions had to be made in the various framework components.
This paper serves as a starting point from which several expansions can be made in future work.
This section is dedicated to discussing some possible expansions and their implications for the
framework.
Let us start by considering the student models. Using student models from ITS literature
for population-based exam generation, rather than student tailored on-line curriculum planning,
is a new idea that we have not found in the literature. Nonetheless, in this work we adopt
a basic form of student model, the binary overlay model. This selection imposes two main
limitations. First, an ability may either be mastered or not, with no intermediate options possible.
And second, no incorrect knowledge may be expressed in the model. Incorporating non-binary
overlay models or perturbation models [CV13] to deal with these limitations will allow more
realistic student modeling.
Consider, for example, using models composed of belief vectors [BSW97], indicating
non-binary levels of abilities. Along with the incorporation of such models, a more refined
sufficiency predicate is also required. A reasonable option is to define the sufficiency predicate
by a vector of threshold ability levels per solution path of a question. Using such a predicate,
we can conclude that a student model answers a question correctly iff all the model ability
levels are above their matching thresholds in at least one solution path. Several options exist for
defining the threshold ability values per solution path and deducing them automatically. One
option is to define a range of ability levels depending on aspects of the ability-instantiation (e.g.,
characteristics of the arguments in the algebra domain). The threshold abilities of a solution
path will be the maximum ability levels of all ability instantiations on the path. Note that the
resulting predicate is still a disjunctive action landmark, but with threshold conditions on ability
levels instead of atomic binary levels.
Another expansion of the sufficiency predicate involves lifting the assumption of determin-
ism and replacing the logical predicate with a probabilistic estimator. This seems to be a more
reasonable model of the true world, where even the most skilled students may sometimes fail to
1See advancesinap.collegeboard.org/english-history-and-social-science/historical-thinking-skills
51
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
answer a question, even if they have mastered the required skill. Somewhat similar models are
used in item response theory, where the predicted student performance is defined by the item
characteristic curve, mapping ability level to success probability. However, the unidimensional-
ity assumption of standard IRT models, by which a single latent ability parameter is used to
predict student performance, imposes severe limitations on our setting. A core characteristic of
the MOEG framework is that students differ in the abilities they have mastered. This means
that a student may be able to answer one question but not another, while a different student may
be able to answer the second but not the first. Collapsing the student model to a single ability
parameter will cancel this desirable characteristic of the framework. Therefore, incorporating
probabilistic IRT performance prediction into MOEG reasonably would require using the less
common, multidimensional latent traits [Rec09]. In such a case the dimensions of the latent trait
may converge with our notion of abilities, desirably after expansion to non-binary values.
The method for acquiring the action landmarks of questions is another framework aspect
which may be improved. The algorithm for this purpose, presented in Figure 3.1, is simple
yet generic. It requires only the representation of solutions as paths in a search space and
nothing more regarding the specific domain. Although appealing in this aspect, as well as
being theoretically applicable to any search-space domain, practical constraints may impose
difficulties. Consider cases in which very specific combinations of actions are required to reach
a solution and the ability set is considerably large. In such cases, the current method will require
a long time, perhaps even unreasonably so, to find solutions. A desirable expansion of the
landmark algorithm is to use a heuristic function in order to lead the search in an informed
manner. With a heuristic function at hand, a simple modification of the current algorithm
may be to select the next action weighted by the heuristic improvement of each action, either
stochastically or deterministically. This variation has the potential of shortening the algorithm
runtime and generating more “human-like” solutions, assuming a quality heuristic function.
Finally, we turn to discuss the methods used for specifying the student population and
sampling it. Recall that we have made a simplifying assumption of independence between
abilities for simplicity. This simplification is rather gross and is generally expected to reflect
reality quite inaccurately. Consider a student, in our algebra domain, who has mastered the
abilities of uniting variable terms of all argument sign combinations. It would be reasonable to
expect that such a student has also mastered the simpler ability of uniting two positive constant
terms. Bayesian networks may be used to model generic ability dependencies in the student
population [GVP90].
7.3 Concluding Remarks
This paper presents a generic framework for automating exam composition, an important task in
the pedagogical process. Our work is driven by the philosophical question “what constitutes a
good exam?” Existing testsheet composition methods answer that a good exam is one with a
high discrimination degree. But what hides behind this mysterious numeric value, assumed to be
a given feature per question? In order to assess how well an exam discriminates between students
52
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
of different proficiency levels, we define student models and estimate their performance on
real exam questions. We use these models to deduce their performance on real exam questions.
Then, the discrimination between student models in the form of a grading order may be obtained.
With a teacher-specified order of student models by proficiency, we may define what constitutes
a “good” exam. By our approach, a good exam is one which orders students according to their
proficiency level, as defined by the teacher. We believe this is an important step towards making
AI techniques practical for improving education.
53
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
54
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Bibliography
[ABTvdL91] Jos J Adema, Ellen Boekkooi-Timminga, and Wim J van der Linden.
Achievement test construction using 0–1 linear programming. European
Journal of Operational Research, 55(1):103–111, 1991.
[ABY85] John R Anderson, C Franklin Boyle, and Gregg Yost. The geometry tutor.
In IJCAI, pages 1–7, 1985.
[AHHI94] Arnold Alphen, Ruud Halfens, Arie Hasman, and Tjaart Imbos. Likert or
rasch? nothing is more applicable than good theory. Journal of Advanced
Nursing, 20(1):196–201, 1994.
[AJW98] Ronald D Armstrong, Douglas H Jones, and Zhaobo Wang. Optimization
of classical reliability in test construction. Journal of Educational and
Behavioral Statistics, 23(1):1–17, 1998.
[BA06] Dmitry I Belov and Ronald D Armstrong. A constraint programming
approach to extract the maximum number of non-overlapping test forms.
Computational Optimization and Applications, 33(2-3):319–332, 2006.
[Bir57] A Birnbaum. Efficient design and use of tests of a mental ability for
various decision-making problems. Randolph Air Force Base, Texas: Air
University, School of Aviation Medicine, 26, 1957.
[Bir58a] A Birnbaum. Further considerations of efficiency in tests of a mental ability.
Series repot no 17. Project no 7755-23, USAF School of Aviation Medicine,
Randolph Air Force Base, 1958.
[Bir58b] A Birnbaum. On the estimation of mental ability. Series Rep, (15):7755–23,
1958.
[Bir68] A Birnbaum. Some latent trait models and their use in inferring an exami-
nee’s ability. Statistical theories of mental test scores, 1968.
[BL81] Mark A Barton and Frederic M Lord. An upper asymptote for the three-
parameter logistic item-response model*. ETS Research Report Series,
1981(1):i–8, 1981.
55
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
[Blo84] Benjamin S Bloom. The 2 sigma problem: The search for methods of group
instruction as effective as one-to-one tutoring. Educational Researcher,
13(6):4–16, 1984.
[Boc72] R Darrell Bock. Estimating item parameters and latent ability when re-
sponses are scored in two or more nominal categories. Psychometrika,
37(1):29–51, 1972.
[Bru92] Peter L Brusilovsky. A framework for intelligent knowledge sequencing
and task sequencing. In Intelligent Tutoring Systems, pages 499–506.
Springer, 1992.
[Bru94] PL Brusilovskiy. The construction and application of student models in
intelligent tutoring systems. Journal of Computer and Systems Sciences
International, 32(1):70–89, 1994.
[BSH96] Joseph Beck, Mia Stern, and Erik Haugsjaa. Applications of ai in education.
Crossroads, 3(1):11–15, 1996.
[BSW97] J. Beck, M. Stern, and B.P. Woolf. Using the student model to control prob-
lem difficulty. Courses and Lectures - International centre for mechanical
sciences, pages 277–288, 1997.
[CV13] Konstantina Chrysafiadi and Maria Virvou. Student modeling approaches:
A literature review for the last decade. Expert Systems with Applications,
2013.
[DA09] RJ De Ayala. The Theory and Practice of Item Response Theory. The
Guilford Press, 2009.
[DZWF12] Hong Duan, Wei Zhao, Gaige Wang, and Xuehua Feng. Test-sheet compo-
sition using analytic hierarchy process and hybrid metaheuristic algorithm
ts/bbo. Mathematical Problems in Engineering, 2012, 2012.
[EAAA08] El-Sayed M El-Alfy and Radwan E Abdel-Aal. Construction and analysis
of educational tests using abductive machine learning. Computers &
Education, 51(1):1–16, 2008.
[ER13] Susan E Embretson and Steven P Reise. Item response theory for psycholo-
gists. Psychology Press, 2013.
[GC05] Eduardo Guzman and Ricardo Conejo. Self-assessment in a feasible,
adaptive web-based testing system. IEEE Transactions on Education,
48(4):688–695, 2005.
56
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
[GK54] Leo A Goodman and William H Kruskal. Measures of association for
cross classifications*. Journal of the American Statistical Association,
49(268):732–764, 1954.
[Glo89] Fred Glover. Tabu search-part i. ORSA Journal on Computing, 1(3):190–
206, 1989.
[Glo90] Fred Glover. Tabu search-part ii. ORSA Journal on Computing, 2(1):4–32,
1990.
[GM15] Omer Geiger and Shaul Markovitch. Algorithmic exam generation. In
Proceedings of the 24th International Conference on Artificial Intelligence,
pages 1149–1155. AAAI Press, 2015.
[Gro98] Norman E Gronlund. Assessment of student achievement. ERIC, 1998.
[GVP90] Dan Geiger, Thomas Verma, and Judea Pearl. Identifying independence in
bayesian networks. Networks, 20(5):507–534, 1990.
[Hal52] David C Haley. Estimation of the dosage mortality relationship when the
dose is subject to error. Technical report, DTIC Document, 1952.
[HCYL08] Gwo-Jen Hwang, Hui-Chun Chu, Peng-Yeng Yin, and Ji-Yu Lin. An inno-
vative parallel test sheet composition approach to meet multiple assessment
criteria for national tests. Computers & Education, 51(3):1058–1072, 2008.
[HD09] Malte Helmert and Carmel Domshlak. Landmarks, critical paths and
abstractions: what’s the difference anyway? In ICAPS, pages 162–169,
2009.
[HLL06] Gwo-Jen Hwang, Bertrand MT Lin, and Tsung-Liang Lin. An effective
approach for test-sheet composition with large-scale item banks. Computers
& Education, 46(2):122–139, 2006.
[HLTL05] Gwo-Jen Hwang, Bertrand MT Lin, Hsien-Hao Tseng, and Tsung-Liang
Lin. On the development of a computer-assisted testing system with genetic
test sheet-generating approach. IEEE Transactions on Systems, Man, and
Cybernetics, Part C: Applications and Reviews, 35(4):590–594, 2005.
[HS73] JR Hartley and Derek H Sleeman. Towards more intelligent teaching
systems. International Journal of Man-Machine Studies, 5(2):215–236,
1973.
[HS85] Ronald K Hambleton and Hariharan Swaminathan. Item Response Theory:
Principles and Applications, volume 7. Springer Science & Business
Media, 1985.
57
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
[HT83] DL Harnisch and KK Tatsuoka. A comparison of appropriateness indices
based on item response theory. Applications of item response theory, pages
104–122, 1983.
[HVdL82] Ronald K Hambleton and Wim J Van der Linden. Advances in item
response theory and applications: An introduction. 1982.
[Hwa03] Gwo-Jen Hwang. A test-sheet-generating algorithm for multiple assess-
ment requirements. IEEE Transactions on Education, 46(3):329–337,
2003.
[HYY06] Gwo-Jen Hwang, Peng-Yeng Yin, and Shu-Heng Yeh. A tabu search
approach to generating test sheets for multiple assessment criteria. IEEE
Transactions on Education, 49(1):88–97, 2006.
[ISU14] Takatoshi Ishii, Pokpong Songmuang, and Maomi Ueno. Maximum clique
algorithm and its approximation for uniform test form assembly. IEEE
Transactions on Learning Technologies, 7(1):83–95, 2014.
[JG95] Waynne Blue James and Daniel L Gardner. Learning styles: Implications
for distance learning. New directions for adult and continuing education,
(67):19–31, 1995.
[Ken38] Maurice G Kendall. A new measure of rank correlation. Biometrika, pages
81–93, 1938.
[Ken45] Maurice G Kendall. The treatment of ties in ranking problems. Biometrika,
pages 239–251, 1945.
[KV10] Ravi Kumar and Sergei Vassilvitskii. Generalized distances between rank-
ings. In Proceedings of the 19th International Conference on World Wide
Web, pages 571–580. ACM, 2010.
[LBE90] Pedro Lira, M Bronfman, and J Eyzaguirre. Multitest ii: A program for
the generation, correction, and analysis of multiple choice tests. IEEE
Transactions on Education, 33(4):320–325, 1990.
[Lor52] Frederic M Lord. The relation of test score to the trait underlying the test.
ETS Research Bulletin Series, 1952(2):517–549, 1952.
[LR79] Michael V Levine and Donald B Rubin. Measuring the appropriateness
of multiple-choice test scores. Journal of Educational and Behavioral
Statistics, 4(4):269–290, 1979.
[LST12] Huan-Yu Lin, Jun-Ming Su, and Shian-Shyong Tseng. An adaptive test
sheet generation mechanism using genetic algorithm. Mathematical Prob-
lems in Engineering, 2012, 2012.
58
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
[MB+00] Riichiro Mizoguchi, Jacqueline Bourdeau, et al. Using ontological en-
gineering to overcome common ai-ed problems. Journal of Artificial
Intelligence and Education, 11:107–121, 2000.
[McD67] Roderick P McDonald. Nonlinear factor analysis. Psychometric mono-
graphs, 1967.
[ND08] Loc Nguyen and Phung Do. Learner model in adaptive learning. World
Academy of Science, Engineering and Technology, 45:395–400, 2008.
[NO10] M.L. Nering and R. Ostini. Handbook of Polytomous Item Response Theory
Models. Routledge, 2010.
[ON06] Remo Ostini and Michael L. Nering. Polytomous item response theory
models. Number 144. Sage, 2006.
[Owe75] Roger J Owen. A bayesian sequential procedure for quantal response in
the context of adaptive mental testing. Journal of the American Statistical
Association, 70(350):351–356, 1975.
[PR13] Martha C Polson and J Jeffrey Richardson. Foundations of Intelligent
Tutoring Systems. Psychology Press, 2013.
[Rec09] Mark Reckase. Multidimensional Item Response Theory. Springer, 2009.
[Ric79] Elaine Rich. User modeling via stereotypes. Cognitive science, 3(4):329–
354, 1979.
[Ric83] Elaine Rich. Users are individuals individualizing user models. Interna-
tional Journal of Man-Machine Studies, 18(3):199–214, 1983.
[Saa80] Thomas L Saaty. The analytic hierarchy process: Planning, priority setting,
resources allocation. McGraw-Hil, New York, 1980.
[Sam72] Fumiko Samejima. A general model for free-response data. Psychometrika
Monograph Supplement, 1972.
[SGR12] Rohit Singh, Sumit Gulwani, and Sriram K Rajamani. Automatically
generating algebra problems. In AAAI, 2012.
[Sim08] Dan Simon. Biogeography-based optimization. IEEE Transactions on
Evolutionary Computation, 12(6):702–713, 2008.
[SMM04] Pramuditha Suraweera, Antonija Mitrovic, and Brent Martin. The role of
domain ontology in knowledge acquisition for itss. In Intelligent Tutoring
Systems, pages 207–216. Springer, 2004.
59
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
[Som62] Robert H Somers. A new asymmetric measure of association for ordinal
variables. American Sociological Review, pages 799–811, 1962.
[SP94] Valerie J Shute and Joseph Psotka. Intelligent tutoring systems: Past,
present, and future. Technical report, DTIC Document, 1994.
[SP97] Rainer Storn and Kenneth Price. Differential evolution–a simple and
efficient heuristic for global optimization over continuous spaces. Journal
of Global Optimization, 11(4):341–359, 1997.
[SSZ08] Myra Sadker, David Miller Sadker, and Karen Zittleman. Teachers, Schools,
and Society. McGraw Hill New York, 2008.
[The85] TJJM Theunissen. Binary programming and test design. Psychometrika,
50(4):411–420, 1985.
[TO01] David Thissen and Maria Orlando. Item Response Theory for Items Scored
in Two Categories. Lawrence Erlbaum Associates Publishers, 2001.
[Tra83] RE Traub. A priori considerations in choosing an item response model.
Applications of Item Response Theory, 57:70, 1983.
[Tuc46] Ledyard R Tucker. Maximum validity of a test with equivalent items.
Psychometrika, 11(1):1–13, 1946.
[vvLBT89] Wim J van ver Linden and Ellen Boekkooi-Timminga. A maximin
model for irt-based test design with practical constraints. Psychometrika,
54(2):237–247, 1989.
[Woo] Beverly Woolf. AI in Education. Encyclopedia of Artificial Intelligence,
Shapiro, S., ed. New Jersey: A Wiley Interscience Publication, New York.
[WPB01] Geoffrey I Webb, Michael J Pazzani, and Daniel Billsus. Machine learning
for user modeling. User Modeling and User-Adapted Interaction, 11(1-
2):19–29, 2001.
[WS70] Marilyn W Wang and Julian C Stanley. Differential weighting: A review
of methods and empirical studies. Review of Educational Research, pages
663–705, 1970.
[WWP+09] Feng-rui Wang, Wen-hong Wang, Quan-ke Pan, Feng-chao Zuo, and
JJ Liang. A novel online test-sheet composition approach for web-based
testing. In IEEE International Symposium on IT in Medicine & Education,
volume 1, pages 700–705, 2009.
[YKG10] Guangbing Yang, K Kinshuk, and S Graf. A practical student model for a
location-aware and context-sensitive personalized adaptive learning system.
60
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
In 2010 International Conference on Technology for Education (T4E),
pages 130–133. IEEE, 2010.
[YPC13] Li Yuan, Stephen Powell, and JISC CETIS. Moocs and open education:
Implications for higher education. Cetis White Paper, 2013.
61
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
שני בעבור בתיכון. המתקדמות בהקבצות רוב פי על שנלמד יותר מתקדם תחום טריגונומטריות,
פותר בהן הבסיס צעדי את המדמה התרגילים לפתרון שלם יכולות אוסף ומימשנו הגדרנו התחומים
התבססנו ובטריגונומטריה תרגילים ליצירת אוטומטי מנגנון יצרנו באלגברה בתחום. טיפוסי תלמיד
הבחינה. תורכב ממנו גדול שאלות מקבץ ליצירת אינטרנטיים מקורות על
יציבות את וכן המופקות הבחינות טיב את הבודק אמפירי מחקר ביצענו הנ"ל התחומים שני בעבור
פני על הערכה בפונקציית השתמשנו הבחינות, הערכת לצורך השונים. לפרמטרים ביחס המערכת
באלגוריתם התועלת פונקציית להגדרת ששימש מזה תלוי בלתי באופן שנדגם ידע מצבי של מדגם
שהבחינות מראות התוצאות מתחרים. ארבעה הצבנו שלנו האלגוריתם כנגד משמעותית. ממנו וגדול
הבעיה. הגדרות ובשתי התחומים בשני מהמתחרים טובות שלנו האלגוריתם ע"י המופקות
לפונקציית המשמש הידע מצבי מדגם וגודל הבחינה אורך של ההשפעה לבדיקת ניסויים ביצענו בנוסף,
ככל המקרים. בשני ציפיותינו את תאמה הנצפית האיכותית המגמה הפלטים. איכות על התועלת,
וכתוצאה לבחינה שאלות מגוון לבחור האלגוריתם של הבחירה חופש גם גדל גדל, הבחינה שאורך
פונקציית של האיכות עולה גדל, הידע מצבי מדגם כאשר כמוכן, יותר. איכותיות הבחינות מכך
ביחס המערכת יציבות את בדק נוסף ניסויים אוסף המופקות. הבחינות איכות ובהתאם התועלת
הניסויים המטרה. פונקציית לקביעת והקלט הידע מצבי אוכלוסיית ובפרט קלט־משתמש לפרמטרי
אלו. לפרמטרים ביחס יציבה שהמערכת מראים
ישימה כולה השיטה שהודגמו. לשניים מעבר אקדמיים תחומים פני על השיטה ביישומיות נדון לסיום,
באמצעות השאלות לפתרון ומספיקים הכרחיים ותנאים יכולות אוסף להגדיר ניתן בו תחום בכל
פרוצדוראליים, תחומים פני על כאן מתבצעת אלה לוגיים תנאים של אוטומטית הסקה היכולות.
במרחב, טריגונומטריה בגאומטריה, הוכחות כוללים אלו תחומים מצבים. כמרחבי לייצג בחרנו אותם
היסקים סדרת ע"י נפתרות שאלות בו תחום כל כמעט ובעצם אנליטית גאומטריה תנועה, בעיות
בשיטות אחרים מרחבים בסוגי גם אפשרית הפתרון תנאיי של אוטומטית הסקה חישוביים. או לוגיים
להגדיר אפשר כאלה בתחומים אוטומטיים. הוכחה כלי באמצעות הנפתרות בעיות בסוגי למשל שונות,
נעשה בהם ללמות בהתאם שאלה לפתרון התנאי את ולקבוע הלמות קבוצת בתור היכולות אוסף את
שאלה, על המענה תנאיי את אוטומטית לקבוע קשה בהם אחרים בתחומים השונים. בפתרונות שימוש
ביטויים אוסף בהינתן בקומבינטוריקה, לדוגמה הפתרונות. אוסף סמך על זאת לעשות אולי אפשר
המבנה ולפי אוטומטיים כלים באמצעות תחבירית אותם לנתח אפשר קביל, פתרון המהווים מתמטיים
פתרון. דרך לכל הדרושות היכולות את לקבוע התחבירי
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
נדרש המורה בראשונה, בחינה. כתיבת של הבעיה להגדרת שונות פורמולציות שתי מציעים אנו
לאלגוריתם. כקלט ומשמש מאשר" מיודע "יותר קשר שמציין השונים הידע מצבי בין סדר יחס להגדיר
פני על הרצוי הסדר את שמשקפת בחינה אחר האפשריות הבחינות במרחב מחפש האלגוריתם
ציונים לבין ידע מצבי בין מיפוי מגדיר המורה השנייה, הבעיה בהגדרת התלמידים. אוכלוסיית
הציונים עקומת עבורה בחינה אחר מחפש האלגוריתם זה במקרה הבנתו. לפי מתאימים אבסולוטיים
הגדרות לשתי נתייחס המורה. שהגדיר למיפוי הניתן ככל קרובה תהיה התלמידים אוכלוסיית של
הנוחות. למען האבסולוטית", ו"הבעיה ההשוואתית" "הבעיה בתור הנ"ל הבעיה
שיטה להגדיר נדרשנו ראשית, קשיים. מספר על התגברות דרשה שלנו האלגוריתמית המסגרת בניית
בהתאם ציונים, מיפוי ־ לחלופין או הסדר, יחס לתיאור קומפקטית ספציפיקציה לספק יוכל המורה בה
ידע מצב לכך, ובהתאם יכולות של אוסף ע"י מיוצג הנבחן בתחום שהידע מניחים אנו הבעיה. לסוג
האפשריים הידע מצבי מספר התלמיד. שולט בהן היכולות של תת־הקבוצה ע"י מיוצג תלמיד של
ישים. אינו ציונים) מיפוי (או בינם סדר יחס של מפורשת הגדרה ולכן היכולות במספר מעריכי הוא
ומשמשת חיבורית ניקוד סכמת על המבוססת אלו, קלטים הגדרת על להקלה שיטה פיתחנו לפיכך,
בתחום חשיבות ציון יכולת כל עבור יגדיר שהמורה הוא הבסיסי הרעיון הבעיה. הגדרות בשתי אותנו
ידע מצב של הבסיס ציון יחיד. באופן אפשרי ידע מצב לכל בסיס ציון ייקבע זו ומהגדרה הנבחן
הסדר יחס חילוץ ההשוואתית, בבעיה הידע. במצב הנשלטות היכולות של החשיבות ציוני סכום הינו
ציון עם מאחר יותר מיודע ייחשב ידע מצב הבסיס: ציוני מתוך ישיר באופן כעת נעשה ידע מצבי בין
לציונים הבסיס ציוני את להעביר מנת על נוסף צעד דרוש האסולוטית בבעיה יותר. גבוה שלו הבסיס
הציונים בעקומת התקן וסטיית התוחלת את לקבוע נדרש המורה זה במקרה עצמה. בבחינה הרצויים
המתאימה. גאוס להתפלגות הבסיס ציוני נרמול ע"י יחיד באופן נקבע הציונים ומיפוי הרצויה
הבחינות במרחב האלגוריתם את תנחה אשר הבחינות פני על תועלת פונקציית להגדיר נדרשנו שנית,
בבעיה הבעיה. הגדרות משתי אחת כל עבור פונקציה תועלת, פונקציות שתי הגדרנו האפשריות.
הידע מצבי ציוני לבין המורה שהגדיר הציונים ווקטור בין המרחק את מחשבים אנו האבסולוטית
מנהטן נורמת תחת המרחק את למינימום להביא היא זו בבעיה המטרה משרה. המוערכת שהבחינה
להתאמה במדד שימוש עושים אנו ההשוואתית בבעיה הנ"ל. הציונים ווקטורי שני בין אחרת) (או
להביא היא המטרה כאן, הבחינה. שמשרה זה לבין המורה שהגדיר הסדר יחס בין קנדל) של (טאו
מעריכי הינו הידע מצבי מרחב הבעיה הגדרות בשתי אלה. סדרים בין ההתאמה את למקסימום
הידע. מצבי של מייצג במדגם שימוש נעשה ולפיכך
חדשנית שיטה פיתחנו כך לצורך בחינה. פני על ידע מצב של הציון קביעת היה ביותר הגדול הקושי
אלגברה כגון פרוצדוראליים ידע בתחומי ישימה זו שיטה בגרפים. וחיפוש בתכנון שימוש שעושה
של סדרה דרך מושג שאלה של והפתרון היסק או חישוב פעולות הן היכולות בהם וגאומטריה,
הכרחי תנאי לחישוב בגרף חיפוש מבצעים אנו הנתון באוסף פוטנציאלית שאלה כל עבור פעולות.
של כדיסיונקציה מוגדר התנאי שונים פתרון מסלולי מספר שיתכנו כיוון השאלה. לפתרון ומספיק
על המודל הצלחת את לקבוע מנת על פתרון. במסלול יכולות של אוסף אחד, כל המייצגים גורמים,
השאלה. של הלוגי התנאי את מקיימות שולט הוא בהן היכולות אוסף האם בודקים אנו שאלה,
על־ בלימודים הנפוצים ידע תחומי שני פני על כאן מודגמת בחינות לכתיבת שפיתחנו השיטה
שנלמד תחום אחד, נעלם עם לינאריות אלגבריות משוואות פתרון הינו הראשון התחום יסודיים.
משוואות פתרון הוא השני התחום במתמטיקה. בסיסי לידע ונחשב העולם ברחבי ביניים בחטיבות
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
תקציר
תלמידים הערכת העולם. ברחבי חינוך אנשי המאתגרת חשובה משימה היא תלמידים של הידע הערכת
מיקוד לצורך דיאגנוסטיות למטרות גם אלא מתאימים, ציונים עבורם לקבוע מנת על רק לא נדרשת
באמצעות היא הרווחת ההערכה שיטת הנצפות. המרכזיות החולשה בנקודות הפדגוגיים המשאבים
ניתן ממנה אשר בחינה כתיבת של המשימה לפתור. התלמידים מתבקשים אותה אחידה, בחינה
מתמודדים חינוך אנשי עמה מורכבת, משימה היא מדויק באופן תלמיד של הידע מצב את להעריך
תדיר. באופן
חינוכיות טכנולוגיות שתי של התפתחותן עם לאחרונה עלתה הבחינות כתיבת משימת של חשיבותה
אקדמיית קורסרה, של אלו כגון מקוונים קורסים של הפעולה התרחבות היא הראשונה מרכזיות.
במערכות השיפור היא השנייה המידע. בעידן להמונים אקדמאיים לימודים שמנגישות ואחרות, קאן
אמת. בזמן לתלמידים ואדפטיבית אינטליגנטית הנחייה המאפשרות אינטליגנטיות, הוראה
לאוטומציה ניסיונות כמה בוצעו ידני. באופן רוב, עפ"י חינוך, אנשי ע"י נכתבות עדיין בחינות בימינו,
בתכנון העוסקת סטטיסטית תאוריה על התבססו אלו מניסיונות חלק הבחינה. כתיבת תהליך של
לשאלות ההתייחסות אלה, מעבודות ניכר בחלק פריטים". מענה "תאוריית בשם בחינות וניתוח
מאפיינים בלבד. נומריים מאפיינים של אוסף ע"י המאופיינים אטומיים פריטים כאל היא הבחינה
שכזה ייצוג ואחרים. תלמידים) (בין ההפרדה יכולת משוער, פתרון זמן קושי, רמת כוללים שכאלה
יכולת היא המטרה פונקציית לרוב בשלמים. תכנות כבעיית בחינה כתיבת בעיית של הגדרה מאפשר
של רחב מגוון הבעיה. אילוציי להגדרת משמשים המאפיינים יתר כאשר הבחינה, של הכוללת ההפרדה
טאבו, חיפוש גנטיים, אלגוריתמים כולל שכאלו, פורמולציות לפתרון הוצעו אופטימיזציה אלגוריתמי
הנחיל. ואופטימיזציית סימפלקס אלגוריתם
מבוצע אכן הבחינות כתיבת תהליך ידועים, השאלות של המאפיינים שערכי בהנחה אלו, בעבודות
נתפסים והם מספריים ערכים אותם של למקור מפורשת התייחסות אין לרוב אך אוטומטי. באופן
את לקבוע יש בשטח, אלו שיטות ליישם מנת על מאידך, האופטימיזציה. לבעיית מראש ידוע כקלט
אוטומטיות־ כשיטות אלו לשיטות להתייחס ניתן לפיכך המועמדות. השאלות מאפייני כל של ערכם
למחצה.
הגדרות של מינימום הדורשת בחינות לכתיבת חדשנית אלגוריתמית מסגרת מציגים אנו זו, בעבודה
קוגניטיביות יכולות אילו אלגוריתמי באופן וקובעים אמתיות שאלות אוסף יוצרים אנו מהמורה. ידניות
את ומסיקים אפשריים ידע מצבי לייצג מנת על תלמידים של במודלים משתמשים אנו בודקות. הן
אלגוריתמי. באופן השאלות על במענה ביצועיהם
i
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
זה מחקר מתוצאות חלק המחשב. למדעי בפקולטה מרקוביץ', שאול פרופ' בהנחיית בוצע המחקר
.IJCAI15־ בכנס פורסמו
תודות
האקדמי למנחה ראשית בהצלחה: זו עבודה להשלים לי שסייעו אלה לכל לב מקרב להודות ברצוני
שלנו, המחקר לקבוצת הגבולות; חסרת וסבלנותו מפז היקרה הנחייתו על מרקוביץ', שאול פרופ' שלי
חזן אורית לפרופ' המחקר; במהלך ותמיכתם משוביהם על דואק, ושרי מסינג מיטל פרידמן, ליאור
על יהב ערן לפרופ' חינוך; כאשת שלה הייחודית הפרספקטיבה על המדעים, להוראת מהמחלקה
העבודה. לשיפור והצעותיו הערותיו
הקריטיים בצמתים בהתייעצות לי סייע בתורו אחד כל אשר קולגות, למספר להודות ארצה בנוסף,
עמית, נדב יניב, יונתן בן־בסט, רן ויטקין, אדוארד לוי, עומר גיל־עד, אלון התיזה: להשלמת הדרך על
גנקין. ודניאל מורן שי
יותר רחבה פרספקטיבה של ההצגה ועל המוראלית תמיכתם על ואורה, דני להוריי, מודה אני
במהלך הרבה סבלנותה ועל ותמיכתה אהבתה על נועה לאשתי מודה אני לבסוף, צריך. כשהייתי בפניי
זה. מאתגר מסע
זה. מחקר מימון על לטכניון מסורה תודה הכרת
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
בחינות של אלגוריתמית בנייה
מחקר על חיבור
מגיסטר התואר לקבלת הדרישות של חלקי מילוי לשם
המחשב במדעי למדעים
גייגר עומר
לישראל טכנולוגי מכון ־ הטכניון לסנט הוגש
2016 פברואר חיפה התשע"ו טבת
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016
בחינות של אלגוריתמית בנייה
גייגר עומר
Technion - Computer Science Department - M.Sc. Thesis MSC-2016-06 - 2016