# learning representations of student knowledge and ... · learning representations of student...

Post on 13-Mar-2020

0 views

Embed Size (px)

TRANSCRIPT

Learning Representations of Student Knowledge and Educational Content

Siddharth Reddy [email protected] of Computer Science, Cornell University, Ithaca, NY USAKnewton, Inc., 100 5th Avenue., New York, NY 10011 USA

Igor Labutov [email protected] of Electrical and Computer Engineering, Cornell University, Ithaca, NY USA

Thorsten Joachims [email protected] of Computer Science, Cornell University, Ithaca, NY USA

AbstractStudents in online courses generate largeamounts of data that can be used to personalizethe learning process and improve quality of ed-ucation. The goal of this work is to develop astatistical model of students and educational con-tent that can be used for a variety of tasks, suchas adaptive lesson sequencing and learning an-alytics. We formulate this problem as a regu-larized maximum-likelihood embedding of stu-dents, lessons, and assessments from historicalstudent-module interactions. Akin to collabora-tive filtering for recommender systems, the algo-rithm does not require students or modules to bedescribed by features, but it learns a representa-tion using access traces. An empirical evaluationon large-scale data from Knewton, an adaptivelearning technology company, shows that this ap-proach predicts assessment results more accu-rately than baseline models and is able to dis-criminate between lesson sequences that lead tomastery and failure.

1. IntroductionThe popularity of online education platforms has soared inrecent years. Companies like Coursera and EdX offer Mas-sive Open Online Courses (MOOCs) that attract millionsof students and high-calibre instructors. Khan Academyhas become a hugely popular repository of videos and in-teractive materials on a wide range of subjects. E-learningproducts offered by universities and textbook publishers are

Proceedings of the 31 st International Conference on MachineLearning, Lille, France, 2015. JMLR: W&CP volume 37. Copy-right 2015 by the author(s).

also gaining traction. These platforms improve access tohigh quality educational content for anyone connected tothe Internet. As a result, people who would otherwise lackthe opportunity are able to consume materials like videolectures and problem sets from courses offered at top uni-versities. However, in these online environments learnersoften lack the personalized instruction and coaching thatcan potentially lead to significant improvements in educa-tional outcomes. Furthermore, the educational content maybe contributed by many authors without a formal underly-ing structure. Intelligent systems that learn about the ed-ucational properties of the content, guide learners throughcustom lesson plans, and quickly adapt through feedbackcould help learners take advantage of large and heteroge-neous collections of educational content to achieve theirgoals.

The extensive literature on intelligent tutoring systems(ITS) and computer-assisted instruction (CAI) dates backto the 1960s. Early efforts focused on approximatingthe behavior of a human tutor through rule-based systemsthat taught students South American geography (Carbonell,1970), electronics troubleshooting (Lesgold et al., 1988),and programming in Lisp (Corbett & Anderson, 1994). To-day’s online education platforms differ from early ITSesin their ability to gather data at scale, which facilitates theuse of machine learning techniques to improve the edu-cational experience. Relatively little academic work hasbeen done to design systems that use the massive amountsof data generated by students in online courses to providepersonalized learning tools. Learning and content analytics(Lan et al., 2014c), instructional scaffolding in educationalgames (ORourke et al., 2015), hint generation (Piech et al.,2015b), and feedback propagation (Piech et al., 2015a) area few topics currently being explored in the personalizedlearning space.

Our aim is to build a domain-agnostic framework for mod-

Learning Representations of Student Knowledge and Educational Content

eling students and content that can be used in many onlinelearning products for applications such as adaptive lessonsequencing, learning analytics, and content analytics. Acommon data source available in products is a stream of in-teraction data, or access traces that log student interactionswith modules of course content. These access traces comein the form Student A completed Lesson B and Student Cpassed assessment D. Lessons are content modules that in-troduce or reinforce concepts; for example, an animationof cellular respiration or a paragraph of text on Newton’sfirst law of motion. Assessments are content modules withpass-fail results that test student skills; for example, a true-or-false question halfway through a video lecture. By re-lying on a coarse-grained, binary assessment result, we areable to gracefully handle many types of assessments (e.g.,free response and multiple choice) so long as a student re-sponse can be labelled as correct or incorrect.

We use access traces to embed students, lessons, and as-sessments together in a joint semantic space, yielding a rep-resentation that can be used to reason about the relationshipbetween students and content (e.g., the likelihood of pass-ing an assessment, or the skill gains achieved by complet-ing a lesson). The model is evaluated on synthetic data, aswell as large-scale real data from Knewton, an educationtechnology company that offers personalized recommen-dations and activity analytics for online courses (Knewton,2015). The data set consists of 2.18 million access tracesfrom over 7,000 students, recorded in 1,939 classroomsover a combined period of 5 months.

2. Related WorkOur work builds on the existing literature in psychomet-ric user modeling. The Rasch model estimates the prob-ability of a student passing an assessment using latentconcept proficiency and assessment difficulty parameters(Rasch, 1993). The two-parameter logistic item responsetheory (2PL IRT) model adds an assessment discriminabil-ity parameter to the result likelihood (Linden & Hambleton,1997). Both models assume that a map from assessmentsto a small number of underlying concepts is known a pri-ori. We propose a data-driven method of learning contentrepresentation that does not require a priori knowledge ofcontent-to-concept mapping. Though this approach sacri-fices the interpretability of expert ratings, it has two advan-tages: 1) it does not require labor-intensive expert annota-tion of content and 2) it can evolve the representation overtime as existing content is modified or new content is intro-duced.

Lan et al. propose a sparse factor analysis (SPARFA)approach to modeling graded learner responses that usesassessment-concept associations, concept proficiencies,and assessment difficulty (Lan et al., 2014c). The algo-

rithm does not rely on an expert concept map, but in-stead learns assessment-concept associations from the data.Multi-dimensional item response theory (Reckase, 2009)also learns these assocations from the data. We extendthe ideas behind SPARFA and multi-dimensional item re-sponse theory to include a model of student learning fromlesson modules, which is a key prerequisite for recom-mending personalized lesson sequences.

Bayesian Knowledge Tracing (BKT) uses a HiddenMarkov Model to model the evolution of student knowl-edge (which is discretized into a finite number of states)over time (Corbett & Anderson, 1994). Further work(González-Brenes et al., 2014) has modified the BKTframework to include the effects of lessons (described byfeatures) through an input-output Hidden Markov Model.Similarly, SPARFA has been extended to model time-varying student knowledge and the effects of lesson mod-ules (Lan et al., 2014a). Item response theory has also beenextended to capture temporal changes in student knowledge(Ekanadham & Karklin, 2015; Sohl-Dickenstein, 2014).Recurrent neural networks have been used to trace stu-dent knowledge over time and model lesson effects (Piechet al., 2015c). Similar ideas for estimating temporal studentknowledge from binary-valued responses have appeared inthe cognitive modeling literature (Prerau et al., 2008; Smithet al., 2004). We extend this work in a multi-dimensionalsetting where student knowledge lies in a continuous statespace and lesson prerequisites modulate knowledge gainsfrom lesson modules.

Our model also builds on previous work that uses tempo-ral embeddings to predict music playlists (Moore et al.,2013). While Moore et al. focused on embedding objects(songs) in a metric space, we propose a non-metric embed-ding where the distances between objects (students, assess-ments, and lessons) are not symmetric, capturing the natu-ral progression in difficulty of assessments and the positivegrowth of student knowledge.

3. Embedding ModelWe now describe the probabilistic embedding model thatplaces students, lessons, and assessments in a joint se-mantic space that we call the latent skill space. Studentshave trajectories through the latent skill space, while as-sessments and lessons are placed at fixed locations. For-mally, a student is represented as a set of d latent skill levels~s ∈ Rd+; a lesson module is represented as a vector of skillgains ~̀ ∈ Rd+ and a set of prerequisite skill requirements~q ∈ Rd+; an assessment module is represented as a set ofskill requirements ~a ∈ Rd+.

Students interact with lessons and assessments in the fol-lowing way. First, a student can be tested on an assessment

Learning Representations of Student Knowledge and Educational Content

Ri,j,t

~si,t~si,t−1

~̀k ~qk

~aj

~si,t+1

Figure 1. A graphical model of student learning and testing, i.e.a continuous state space Hidden Markov Model with inputs andoutputs. ~s = student knowledge state, ~̀ = lesson skill gains, ~q= lesson prerequisites, ~a = assessment requirements, and R = re-sult. Assessments can be completed by different students multipletimes, each resulting in a separate result. Students can completedifferent assessments multiple times, each resulting in a separateresult. Lessons can be completed by different students multipletimes. Students depend on their knowledge state at the previoustimestep.

module with a pass-fail result R ∈ {±1}, where the likeli-hood of passing is high when a student has skill levels thatexceed the assessment requirements and vice-versa. Sec-ond, a student can work on lesson modules to improve skilllevels over time. To fully realize the skill gains associatedwith completing a lesson module, a student must satisfyprerequisites (fulfilling some of the prerequisites to someextent will result in relatively small gains, see Equation 3for details). Time is discretized such that at every timestept ∈ N, a student completes a lesson and may complete zeroor many assessments. The evolution of student knowledgecan be formalized as the graphical model in Figure 1, andthe following subsections elaborate on the details of thismodel.

3.1. Modeling Assessment Results

For student ~s, assessment ~a, and result R,

Pr(R = r) =1

1 + exp (−r ·∆(~s, ~a))(1)

where r ∈ {±1} and ∆(~s, ~a) = ~s·~a‖~a‖ − ‖~a‖ + γs + γa.~s and ~a are constrained to be non-negative (for details seeSection 4). A pass result is indicated by r = 1, and a failby r = −1. The term ~s·~a‖~a‖ can be rewritten as ‖~s‖cos(θ),where θ is the angle between ~s and ~a; it can be interpretedas “relevant skill”. The term ‖~a‖ can be interpreted as gen-eral (i.e. not concept-specific) assessment difficulty. Theexpression ~s·~a‖~a‖ − ‖~a‖ is visualized in Figure 2. The bias

Figure 2. Geometric intuition underlying the parametrization ofthe assessment result likelihood (Equation 1). Only the length ofthe projection of the student’s skills ~s onto the assessment vector~a affects the pass likelihood of that assessment, meaning only the“relevant” skills (with respect to the assessment) should determinethe result.

term γs is a student-specific term that captures a student’sgeneral (assessment-invariant and time-invariant) ability topass; it can be interpreted as a measure of how well thestudent guesses correct answers. The bias term γa is amodule-specific term that captures an assessment’s general(student-invariant and time-invariant) difficulty. γa differsfrom the ‖~a‖ difficulty term in that it is not bounded; seeSection 4 for details. These bias terms are analogous tothe bias terms used for modeling song popularity in (Chenet al., 2012).

Our choice of ∆ differs from classic multi-dimensionalitem response theory, which uses ∆(~s,~a) = ~s·~a+γa wheres and a are not bounded (although in practice, suitable pri-ors are imposed on these parameters). We have also chosento use the logistic link function instead of the normal ogive.

3.2. Modeling Student Learning from Lessons

For student ~s who worked on a lesson with skill gains ~̀ andno prerequisites at time t+ 1, the updated student state is

~st+1 ∼ N(~st + ~̀, Σ

)(2)

where the covariance matrix Σ = Idσ2 is diagonal. For alesson with prerequisites ~q,

~st+1 ∼ N(~st + ~̀ ·

1

1 + exp (−∆(~st, ~q)), Σ

)(3)

where ∆(~st, ~q) = ~st·~q‖~q‖ − ‖~q‖. The intuition behind thisequation is that the skill gain from a lesson should be

Learning Representations of Student Knowledge and Educational Content

0.0 0.2 0.4 0.6 0.8 1.0

Skill 1

0.0

0.2

0.4

0.6

0.8

1.0

Ski

ll 2

Q

Q = (0.30, 0.30), L = (0.50, 1.00)

0.0 0.2 0.4 0.6 0.8 1.0

Skill 1

0.0

0.2

0.4

0.6

0.8

1.0

Ski

ll 2

Q

Q = (0.70, 0.30), L = (0.50, 1.00)

0.0 0.2 0.4 0.6 0.8 1.0

Skill 1

0.0

0.2

0.4

0.6

0.8

1.0

Ski

ll 2

Q

Q = (0.30, 0.70), L = (0.50, 1.00)

Figure 3. ~̀ = skill gains, ~q = prerequisites. Here, we illustratethe vector field of skill gains possible for different students underdifferent lesson prerequisites. A student can compensate for lackof prerequisites in one skill through excess strength in anotherskill, but the extent to which this trade-off is possible depends onthe lesson prerequisites.

weighted according to how well a student satisfies the les-son prerequisites. A student can compensate for lack ofprerequisites in one skill through excess strength in anotherskill, but the extent to which this trade-off is possible de-pends on the lesson prerequisites. The same principle ap-plies to satisfying assessment skill requirements. Figure 3illustrates the vector field of skill gains possible for differ-ent students under different lesson prerequisites. Withoutprerequisites, the vector field is uniform.

Our model differs from (Lan et al., 2014a) in that we ex-plicitly model the effects of prerequisite knowledge ongains from lessons. Lan et al. model gains from a lessonas an affine transformation of the student’s knowledge stateplus an additive term similar to ~̀.

4. Parameter EstimationWe compute maximum-likelihood estimates of model pa-rameters Θ by maximizing the following objective func-tion:

L(Θ) =∑A

log (Pr(R | ~st,~a, γs, γa))

+∑L

log (Pr(~st+1 | ~st, ~̀, ~q))− β · λ(Θ)(4)

where A is the set of assessment interactions, L is the setof lesson interactions, λ(Θ) is a regularization term thatpenalizes the L2 norms of embedding parameters (not biasterms), and β is a regularization parameter. Non-negativityconstraints on embedding parameters (not bias terms) areenforced.

L2 regularization is used to penalize the size of embeddingparameters to prevent overfitting. The bias terms are notbounded or regularized. This allows −‖~a‖+ γa to be pos-itive for assessment modules that are especially easy, and~s·~a‖~a‖ + γs to be negative for students who fail especiallyoften. We solve the optimization problem with box con-straints using the L-BFGS-B (Zhu et al., 1997) algorithm.We randomly initialize parameters and run the iterative op-timization until the relative difference between consecutiveobjective function evaluations is less than 10−3. Averag-ing validation accuracy over multiple runs during cross-validation reduces sensitivity to the random initializations(since the objective function is non-convex).

5. Experiments on Synthetic DataTo verify the correctness of our model and to illustratethe properties of the embedding geometry that the modelcaptures, we conducted a series of experiments on small,synthetically-generated interaction histories. Each scenario

Learning Representations of Student Knowledge and Educational Content

is intended to demonstrate a different feature of the model(e.g., recovering student knowledge and assessment re-quirements in the absence of lessons, or recovering sensibleskill gain vectors for different lessons). For the sake of sim-plicity, the embeddings are made without bias terms. Thefigures shown are annotated versions of plots made by ourembedding software.

5.1. Recovering Assessments

The embedding can recover intuitive assessment require-ments and student skill levels from small histories withoutlessons. See Figures 7 and 8 for examples. These examplesalso illustrate the fact that a multidimensional embeddingis more expressive than a one-dimensional embedding, i.e.increasing the embedding dimension improves the model’sability to capture the dynamics of complicated scenarios.

5.2. Recovering Lessons

The embedding can recover orthogonal skill gain vectorsfor lessons that deal with two different skills. See Figure 9for an example in two dimensions.

The embedding can recover sensible lesson prerequisites.In Figure 10, we recover prerequisites that explain a sce-nario where a strong student realizes significant gains froma lesson while weaker students realize nearly zero gainsfrom the same lesson.

6. Experiments on Online Course DataWe use data processed by Knewton, an adaptive learn-ing technology company. Knewton’s infrastructure usesstudent-module access traces to generate personalized rec-ommendations and activity analytics for partner organiza-tions with online learning products. The data describes in-teractions between college students and two science text-books. Both data sets are filtered to eliminate students withfewer than five lesson interactions and content moduleswith fewer than five student interactions. To avoid spaminteractions and focus on the outcomes of initial student at-tempts, we only consider the first interaction between a stu-dent and an assessment (subsequent interactions betweenstudent and assessment are ignored). Each content modulein our data has both a lesson and assessment component.To simulate a more general setting where the sets of les-son and assessment modules are disjoint, each module israndomly assigned to the set of lessons or the set of assess-ments. We performed our experiments several times usingdifferent random assignments, and observed that the resultsdid not change significantly.

Figure 4. The directed graph of module sequences for all stu-dents in Book A. V = set of assessment and lesson modules,(n1, n2) ∈ E with edge weight k if k students had a transitionfrom module n1 to n2 in their access traces. This graph showsthat module sequences are heavily directed by the order imposedby the textbook, teachers, or a recommender system.

0 100 200 300 400 500 600 700 800

Student history length (number of interactions)

0

100

200

300

400

500

600

Frequency

(num

ber

of

students

)

Figure 5. Distribution of the lengths of student histories in theBook A data set

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Number of assessment interactions per timestep

0

50000

100000

150000

200000

250000

Frequency

(num

ber

of

tim

est

eps)

Figure 6. Distribution of the number of assessment interactions ateach timestep in the Book A data set. Long chains of assessmentinteractions are rare, implying that students complete lesson mod-ules with consistent frequency.

Learning Representations of Student Knowledge and Educational Content

Book A Book BRANDOM LAST RANDOM LAST

d s a ` q γ TRAINING TEST TRAINING TEST TRAINING TEST TRAINING TEST1 1 Y N N N N 0.659 0.605 0.659 0.626 0.723 0.696 0.723 0.6862 1 N Y N N N 0.743 0.734 0.743 0.737 0.716 0.712 0.715 0.6613 2 Y Y N N N 0.994 0.687 0.994 0.687 0.992 0.659 0.991 0.6144 2 Y Y N N Y 0.995 0.673 0.994 0.705 0.993 0.745 0.992 0.7265 2 Y Y Y N N 0.900 0.713 0.897 0.730 0.896 0.733 0.897 0.6966 2 Y Y Y N Y 0.873 0.741 0.872 0.756 0.869 0.787 0.871 0.7487 2 Y Y Y Y N 0.898 0.709 0.898 0.727 0.902 0.742 0.898 0.6948 2 Y Y Y Y Y 0.890 0.732 0.889 0.753 0.872 0.787 0.882 0.749

Table 1. We conduct a lesion analysis to gain insight into which components of an embedding contribute most to prediction accuracy(AUC)

6.1. Data

Here, we give summary statistics of the data sets we use toevaluate our model.

6.1.1. Book A

This data set was collected from 869 classrooms from Jan-uary 1, 2014 through June 1, 2014. It contains 834,811interactions, 3,471 students, 3,374 lessons, 3,480 assess-ments, and an average assessment pass rate of 0.712. SeeFigures 4, 5, and 6 for additional data exploration.

6.1.2. Book B

This data set was collected from 1,070 classrooms fromJanuary 1, 2014 through June 1, 2014. It contains1,349,541 interactions, 3,563 students, 3,843 lessons,3,807 assessments, and an average assessment pass rate of0.693.

6.2. Assessment Result Prediction

We evaluate the embedding model on the task of predictingresults of held-out assessment interactions using the fol-lowing scheme:

• We use ten-fold cross-validation to compute the val-idation accuracies of variations of the embeddingmodel

• On each fold, we train on the full histories of 90%of students and the truncated histories of 10% of stu-dents, and validate on the assessment interactions im-mediately following the truncated histories

• Truncations are made at random locations in studenthistories, or just before the last batch of assessmentinteractions for a given student (maximizing the sizeof the training set)

We start by constructing two baseline models that pre-dict the student pass rate and the assessment pass rate re-

spectively. The student baseline is equivalent to a one-dimensional embedding of only students that is not regu-larized or bounded (see row 1 of Table 1). Analogously,the assessment baseline is equivalent to a one-dimensionalembedding of only assessments that is not regularized orbounded (see row 2 of Table 1). We then gradually addcomponents to the embedding model to examine their ef-fects on prediction accuracy. Initially, we omit lessons andbias terms from the embedding, and only consider studentsand assessments. We progressively include lesson param-eters ~̀without prerequisites, prerequisite parameters ~q forlessons, and bias terms γ. Each variant of the model corre-sponds to a row in Table 1. Our performance metric is areaunder the ROC curve (AUC), which measures the discrimi-native ability of a binary classifier that assigns probabilitiesto class membership.

Table 1 shows AUC performance on the assessment resultprediction task for different data sets (Book A and BookB), student history truncation styles (random and last), andmodels. All embeddings on both data sets use the defaultparameter values σ2 = 0.5 and β = 10−6, which wereselected in exploratory experiments. All p-values are com-puted using Welch’s t-test (Welch, 1947) for the results incolumn (Book A, Last, Test). From these results, we ob-serve the following:

• Including lessons in the embedding improves perfor-mance significantly (p = 0.003 for rows 3 vs. 5,p = 0.0005 for 4 vs. 6)

• Including prerequisites gives a modest performancegain on Book B and a statistically insignificant de-crease in performance on Book A (p = 0.913 for rows5 vs. 7, p = 0.744 for 6 vs. 8)

• Including bias terms gives a large performance gain(p = 0.02 for 5 vs. 6, p = 0.028 for 7 vs. 8)

• The assessment baseline performs surprisingly wellon Book A (see row 2)

Learning Representations of Student Knowledge and Educational Content

Carter LeeA1

Skill 1

At t = 1,• Lee passes A1, fails A2• Carter fails A1 and A2

At t = 2,• Lee completes lesson L1,

then passes A1 and A2• Carter completes lesson L1,

then passes A1, fails A2

A2

LeeCarterA1

Skill 1

A2L1

= student = assessment = lesson

Figure 7. A one-dimensional embedding, where a single latentskill is enough to explain the data. The key observation here isthat the model recovered positive skill gains for L1, and “cor-rectly” arranged students and assessments in the latent space. Ini-tially, Carter fails both assessments, so his skill level is behindthe requirements of both assessments. Lee passes A1 but failsA2, so his skill level is beyond the requirement for A1, but be-hind the requirement for A2. In an effort to improve their results,Lee and Carter complete lesson L1 and retake both assessments.Now Carter passes A1, but still fails A2, so his new skill level isahead of the requirements for A1 but behind the requirements forA2. Lee passes both assessments, so his new skill level exceedsthe requirements for A1 and A2. This clear difference in resultsbefore completing lesson L1 and after completing the lesson im-plies that L1 had a positive effect on Lee and Carter’s skill levels,hence the non-zero skill gain vector recovered for L1.

EvanPasses A2, fails A1

FogellFails A1 and A2

McLovinPasses A1 and A2

A2

A1Skill 1

Skill 2

SethPasses A1, fails A2

Figure 8. A two-dimensional embedding, where an intransitivityin assessment results requires more than one latent skill to explain.The key observation here is that the assessments are embedded ontwo different axes, meaning they require two completely inde-pendent skills. This makes sense, since student results on A1 areuncorrelated with results on A2. Fogell fails both assessments,so his skill levels are behind the requirements for A1 and A2.McLovin passes both assessments, so his skill levels are beyondthe requirements for A1 and A2. Evan and Seth are each able topass one assessment but not the other. Since the assessments haveindependent requirements, this implies that Evan and Seth haveindependent skill sets (i.e. Evan has enough of skill 2 to pass A2but not enough of skill 1 to pass A1, and Seth has enough of skill1 to pass A1 but not enough of skill 2 to pass A2).

Fogell

McLovin

A2

A1Skill 1

Skill

2

Seth

Evan

Slater (t = 1)Passes A2, fails A1

Slater (t = 2)Reads L1, then Passes A1 and A2

Michaels (t = 1)Passes A1, fails A2

Michaels (t = 2)Reads L2, then Passes A1 and A2

= student = assessment = lesson

L1

L2

Figure 9. We replicate the setting in Figure 8, then add two newstudents Slater and Michaels, and two new lesson modules L1and L2. Slater is initially identical to Evan, while Michaels isinitially identical to Seth. Slater reads lesson L1, then passes as-sessments A1 and A2. Michaels reads lesson L2, then passesassessments A1 and A2. The key observation here is that the skillgain vectors recovered for the two lesson modules are orthogonal,meaning they help students satisfy completely independent skillrequirements. This makes sense, since initially Slater was lackingin Skill 1 while Michaels was lacking in Skill 2, but after com-pleting their lessons they passed their assessments, showing thatthey gained from their respective lessons what they were lackinginitially.

Fogell

A2

A1Skill 1

Skill 2

Seth

Evan

McLovin (t = 1)Fails A3

A3McLovin (t = 2)Reads L1, then passes A3

Vector field of skill gains for L1

Prerequisites of L1

Figure 10. We replicate the setting in Figure 8, then add a newassessment module A3 and a new lesson module L1. All stu-dents initially fail assessmentA3, then read lessonL1, after whichMcLovin passes A3 while everyone else still fails A3. The keyobservation here is that McLovin is the only student who initiallysatisfies the prerequisites for L1, so he is the only student whorealizes significant gains.

Learning Representations of Student Knowledge and Educational Content

One issue that may have affected the findings is the bi-ased nature of student paths, which has been discussed by(González-Brenes et al., 2014). In the data, we observe thatstudent paths are heavily directed along common routesthrough modules. We conjecture that this bias dulls the ef-fect of including prerequisites in the embedding, allowingthe assessment baseline to perform well in certain settings,and causing the inclusion of bias terms to give a large per-formance boost. Most students attack an assessment withthe same background knowledge, so an embedding thatcaptures the variation in students who work on the sameassessment is not as valuable. In a regime where studentswho work on an assessment come from a variety of skillbackgrounds, the embedding may outperform the baselinesby a wider margin.

Here, we explore the parameter space of the embeddingmodel by varying the regularization constant β, learningupdate variance σ2, and embedding dimension d:

• From Figure 11, we find that regularizing via β is notnecessary and that any small value of β gives goodperformance. Our conjecture is that the Gaussian les-son model already acts as a regularizer. Results forchanging d are shown in Figure 12. In summary, wefind that increasing embedding dimension d substan-tially improved performance for embedding modelswithout bias terms, but that it has little effect on per-formance for embeddings with bias terms. The formeris expected, since the embedding itself must be usedto model general student passing ability and generalassessment difficulty.

• From Figure 13, very small and very large σ performpoorly. For very large σ, the Gaussian steps causedby lesson interactions have so much variance that theymight as well not occur, causing the embedding withlessons to effectively degenerate into an embeddingwithout lessons. We see this occur when performancefor d=2, with bias approaches that of d=2, withoutlessons as σ is increased.

Here, we measure the sensitivity of model performance tothe size of the training set:

• Performance is mostly affected by a student’s recenthistory. We see this in the drastic plateau of vali-dation accuracy after a student’s history is extendedmore than fifty interactions into the past in Figure 19.

• The number of full student histories in the training sethas a strong effect on performance (via the quality ofmodule embeddings). We see this in the positive rela-tionship between validation accuracy and the numberof full histories in Figure 20.

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103

Regularization constant

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Are

a u

nder

RO

C C

urv

e

Validation Performance

d=2, with bias

d=2, without bias

d=1, only students

d=1, only assessments

Figure 11. Book A, fixed student history truncations, σ2 = 0.5.Note that a high regularization constant for d=2, with bias sim-ulates a one-parameter logistic item response theory (1PL IRT)model.

2 5 10 20 50

Embedding dimension

0.725

0.730

0.735

0.740

0.745

0.750

0.755

0.760

Are

a u

nder

RO

C C

urv

e

Validation Performance

beta=1e-08, with bias

beta=1e-06, with bias

beta=0.0001, with bias

beta=0.01, with bias

d=1, only assessments

2 5 10 20 50

Embedding dimension

0.720

0.725

0.730

0.735

0.740

0.745

Are

a u

nder

RO

C C

urv

e

Validation Performance

beta=1e-08, without bias

beta=1e-06, without bias

beta=0.0001, without bias

beta=0.01, without bias

d=1, only assessments

Figure 12. Book A, fixed student history truncations, σ2 = 0.5

10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 100 101 102 103

Learning update variance

0.45

0.50

0.55

0.60

0.65

0.70

0.75

0.80

Are

a u

nder

RO

C C

urv

e

Validation Performance

d=2, with bias

d=2, without lessons

d=1, only students

d=1, only assessments

Figure 13. Book A, fixed student history truncations, β = 10−6

Learning Representations of Student Knowledge and Educational Content

0.0 0.2 0.4 0.6 0.8 1.0 1.2

Student pass rate

5

0

5

10

15

Stu

dent

bia

s

Figure 14. Student pass rate seems to be the result of passing γsthrough a logistic function, which corresponds to our model ofassessment results (see Equation 1)

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

Assessment pass rate

20

15

10

5

0

5

10

15

Ass

ess

ment

bia

s -

Ass

ess

ment

difficu

lty

Figure 15. Assessment pass rate seems to be the result of passing−‖~a‖+ γa through a logistic function, which corresponds to ourmodel of assessment results (see Equation 1)

6.3. Qualitative Evaluation

Here, we explore the recovered model parameters of a two-dimensional embedding of the Book A data set. In Figures14 and 15, we see that the recovered student and assessmentparameters are tied to student and assessment pass rates inthe training data. In Figure 16, we see that the learning sig-nals aggregated across students grow over time (i.e., stu-dents are generally learning and improving their skills overtime); Figure 18 gives an example of one of these learningsignals, or “student trajectories”. Together, these figuresdemonstrate the ability of the embedding to recover sensi-ble model parameters.

6.4. Lesson Sequence Discrimination

The ability to predict future performance of students on as-sessments, while a useful metric for evaluating the learned

0 50 100 150 200 250 300 350 400 450

Time

170

180

190

200

210

220

230

240

Magnit

ude o

f cl

ass

room

ski

lls

Figure 16. Students’ skill levels, i.e. the Frobenius norm of thematrix of embeddings of all students for a given t, improve overtime

2 0 2 4 6 8 10 12 14

Lesson skill gain magnitude

5

0

5

10

15

20

Less

on p

rere

quis

ite m

agnit

ude

Figure 17. ‖~̀‖ vs. ‖~q‖ (see Equation 3). The reason for the shapeof this plot is unclear.

2.8 3.0 3.2 3.4 3.6 3.8 4.0 4.2 4.4

Skill 1

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

Ski

ll 2

Figure 18. A two-dimensional embedding of a student over time.White indicates where the student starts, and black indicateswhere the student ends. The variance of the learning update σ2

has an effect on the tortuosity of student trajectories. Smaller σleads to paths that show less student forgetting (i.e. do not doubleback on themselves), while larger σ allows for more freedom.

Learning Representations of Student Knowledge and Educational Content

0 50 100 150 200

Depth of student history (number of lesson interactions)

0.66

0.68

0.70

0.72

0.74

0.76

0.78

0.80

Are

a u

nder

RO

C C

urv

e

Sensitivity to depth of student history

Figure 19. Area under ROC curve is computed for assessment in-teractions at T ≈ 201 after training from T − depth to T . Forstudents who do not have assessment interactions at T = 201, weselect the soonest T > 201 such that the student has assessmentinteractions. Performance is mostly affected by a student’s recenthistory. We see this in the drastic plateau of validation accuracyafter a student’s history is extended more than 50 interactions intothe past.

0 500 1000 1500 2000 2500 3000

Number of full student histories in training set

0.68

0.69

0.70

0.71

0.72

0.73

0.74

0.75

Are

a u

nder

RO

C C

urv

e

Sensitivity to number of full student histories

Figure 20. Area under ROC curve is computed for 10% of stu-dents, using fixed history truncations. The number of full studenthistories in the training set has a strong effect on performance(via the quality of module embeddings). We see this in the pos-itive relationship between validation accuracy and the number offull histories.

…"

…"

…"

…"

Figure 21. A schematic diagram of a bubble, where triangles arelessons and squares are assessments. The green path is the rec-ommended path (note that recommendations are personalized toeach student).

embedding, does not address the more important task ofadaptive tutoring via customized lesson sequence recom-mendation. We introduce a surrogate task for evaluatingthe sequence recommendation performance of the modelbased entirely on the observational data of student interac-tions, by assessing the model’s ability to recommend “pro-ductive” paths amongst several alternatives.

The size of the data set creates a unique opportunity toleverage the variability in learning paths to simulate thesetting of a controlled experiment. For this evaluation, weuse a larger version of the Book A data set examined inSection 6.2, containing 14,707 students and 14,327 contentmodules. We find that the data contains many instances ofstudent paths that share the same lesson module at the be-ginning and the same assessment module at the end, butcontain different lessons along the way. We call these in-stances bubbles, for example see Figure 21, which presentthemselves as a sort of natural experiment on the relativemerits of two different learning progressions. We can thususe these bubbles to evaluate the ability of an embeddingto recommend a learning sequence that leads to success, asmeasured by the relative performance of students who takethe recommended vs. the not recommended path to the as-sessment module at the end of the bubble.

We use the full histories of 70% of students to embed les-son and assessment modules, then train on the histories ofheld-out students up to the beginning of a bubble. The les-son sequence for a student is then played over the initialstudent embedding, using the learning update (Equation 3)to compute an expected student embedding at the end ofthe bubble (which can be used to predict the passing likeli-hood for the final assessment using Equation 1). The paththat leads the student to a higher pass likelihood on the finalassessment is the “recommended” path. Our performancemeasure is E

[E[R′]−E[R]

E[R]

], where R′ ∈ {0, 1} is the out-

come at the end of the recommended path and R ∈ {0, 1}is the outcome at the end of the other path (0 is failing and1 is passing). This measure can be interpreted as “expectedgain” (averaged over many bubbles) from taking recom-mended paths, or how “successful” the paths recommended

Learning Representations of Student Knowledge and Educational Content

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

Minimum absolute difference in true pass rates of two paths

0.05

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

Expect

ed g

ain

fro

m t

aki

ng t

he r

eco

mm

ended p

ath

1-NN matching

3-NN matching

5-NN matching

no matching

random

0

200

400

600

800

1000

1200

1400

1600

1800

Frequency

(num

ber

of

bubble

s)

Figure 22. The x-axis represents a threshold on absolute differ-ence between pass rates of the two bubble paths. The y-axis tickson the left correspond to the gain curves, and the y-axis ticks onthe right correspond to the gray bins that illustrate the number ofbubbles that meet the threshold given on the x-axis. Bubbles arefiltered to meet the following criteria: at least ten students takeeach branch, each branch must contain at least two lessons, andboth branches must contain the same number of lessons.

by the model are when compared to the alternative.

This observational study is potentially confounded bymany hidden variables. For example, it may be that onegroup of students systematically takes recommended pathswhile another group of students does not, leading to re-sults at the end of a bubble that are mostly dictated by theteachers directing the groups, or other student-specific hid-den factors, rather than path quality. To best approximatethe settings of a randomized controlled trial in our observa-tional study, we use the standard propensity score matchingapproach for de-biasing observational data (Rosenbaum &Rubin, 1983; Caliendo & Kopeinig, 2008). The key ideabehind propensity score matching is to subset the observeddata in a way that balances the distribution of the fea-tures (“hidden variables”) describing subjects in the twoconditions, as it would be expected in a randomized ex-periment. The validity of any conclusion drawn from theobservational data de-biased in this way hinges on the as-sumption that all confounding variables that determine self-selection have been accounted for in the features priorto matching. In this study, we hypothesize that the setof all lesson modules and assessment modules (with out-comes) that the learner attempted throughout his or herduration in the online system is sufficient to compensatefor any self-selection in the taken learning paths. For-mally, we represent learners in a feature space X such thatXij ∈ {−1, 0, 1}, where Xij = 1 if student i passed mod-ule j (lessons are always “passed”), Xij = 0 if student ihas not completed module j, and Xij = −1 if student ifailed module j.

We use PCA to map X to a low-dimensional feature space

where students are described by 1,000 features, which cap-ture 80% of the variance in the original 14,327 features. Alogistic regression model with L2 regularization is used toestimate the probability of a student following the recom-mended branch of a bubble, i.e. the propensity score, giventhe student features (the regularization constant is selectedusing cross-validation to maximize average log-likelihoodon held-out students). Within each bubble, students whotook their recommended branch are matched with theirnearest neighbors (by absolute difference in propensityscores) from the group of students who did not take theirrecommended branch. Matching is done with replacement(so the same student can be selected as a nearest neigh-bor multiple times) to improve matching quality, tradingoff bias for variance. Multiple nearest neighbors can bematched (we examine the effect of varying the number ofnearest neighbors), trading off variance for bias.

Naturally, our evaluation metric of gain in the passrate from following a recommended path would dependstrongly on the relative merits of the recommended and al-ternative paths. From Figure 22, the two-dimensional em-bedding model with lessons, prerequisites, and bias terms(the same configuration as row 8 in Table 1) is able to rec-ommend more successful paths when there is a significantdifference between the quality of the two paths, as mea-sured by the absolute difference in pass rates between thetwo paths (regardless of choice of propensity score match-ing).

7. Outlook and Future Work7.1. Model Extensions

An issue with our model is that we cannot embed new mod-ules that do not have any existing access traces. One wayto solve this cold start problem is to jointly embed mod-ules with semantic tags, i.e. impose a prior distributionon the locations of modules with certain tags. This ap-proach has been previously explored in the context of musicplaylist prediction (Moore et al., 2012). Embedding withtags has the added benefit of making the dimensions ofthe latent skill space more interpretable, as demonstratedin (Lan et al., 2014b).

Another approach to solving the cold start problem is touse an expert content-to-concept map to impose a prior onmodule embeddings that captures 1) the grouping of mod-ules by underlying concept and 2) the prerequisite relation-ships between concepts (which translate into prerequisiterelationships between modules). The objective function(recall Section 4) would penalize the cosine distance be-tween modules governed by the same concept, and rewardarrangements of modules that reflect prerequisite edges inthe concept graph.

Learning Representations of Student Knowledge and Educational Content

The variance parameter σ2 in the learning update (Equa-tion 2) can be used to model offline learning and forgetting.Though we treat it as a constant across students in earlierexperiments, it can also be estimated as a student-specificparameter and as a function of the real time elapsed be-tween lesson interactions.

The forgetting effect, where student knowledge diminishesduring the gap between interactions, may be modeled as apenalty in the skill gains from a lesson that increases withthe time elapsed since the previous interaction. A substan-tially similar approach has been explored in (Lan et al.,2014a).

7.2. Adaptive Lesson Sequencing

Given a student’s current skill levels and a set of assess-ments the student wants to pass, what is the optimal se-quence of lessons for the student? We can formulate theproblem two ways: specify a minimum pass likelihood forthe goal assessments and find the shortest path to mas-tery, or specify a time constraint and find the path thatleads to the highest pass likelihood for the goal assessmentswhile not exceeding the prescribed length. Both problemscan be tackled by using a course embedding to specify aMarkov Decision Process (Bellman, 1957), where state isgiven by the student embedding, the set of possible actionscorresponds to the set of lesson modules that the studentcan work on, the transition function is the learning update(Equation 3), and the reward function is the likelihood ofpassing all goal assessments (Equation 1).

An issue we have not considered is the spacing effect(Dempster, 1988). If a student learns on a known timeframe (e.g., 1-3pm on weekdays), it is possible for theadaptive tutor to adjust the learning schedule to presentreview modules in a timely manner or help a studentcram. For theoretical work on the timing of modules see(Novikoff et al., 2012).

7.3. Learning and Content Analytics

We speculate that a course embedding can be used to mea-sure the following characteristics of educational content:lesson quality, through magnitude of the skill gain vector‖~̀‖; the diversity of skills and concept knowledge relevantfor a course, through the embedding dimension that opti-mizes validation accuracy on the assessment result predic-tion task; the availability of content that teaches and as-sesses different skills, through the density of modules em-bedded in different locations in the latent skill space. Thefollowing characteristics of student learning may also be ofinterest: the rate of knowledge acquisition, through the ve-locity of a student moving through the latent skill space; therate of knowledge loss (i.e. forgetting), through the amountof backtracking in a student trajectory.

8. ConclusionsWe presented a general model that learns a representationof student knowledge and educational content that can beused for personalized instruction, learning analytics, andcontent analytics. The key idea lies in using a multi-dimensional embedding to capture the dynamics of learn-ing and testing. Using a large-scale data set collected inreal-world classrooms, we (1) demonstrate the ability ofthe model to successfully predict learning outcomes and (2)introduce an offline methodology as a proxy for assessingthe ability of the model to recommend personalized learn-ing paths. We show that our model is able to successfullydiscriminate between personalized learning paths that leadto mastery and failure.

AcknowledgementsWe would like to thank members of Knewton for valu-able feedback on this work. This research was funded inpart by the National Science Foundation under Awards IIS-1247637 and IIS-1217686, and through the Cornell Presi-dential Research Scholars program.

ReferencesBellman, Richard. A markovian decision process. Indiana

Univ. Math. J., 6:679–684, 1957. ISSN 0022-2518.

Caliendo, Marco and Kopeinig, Sabine. Some practi-cal guidance for the implementation of propensity scorematching. Journal of economic surveys, 22(1):31–72,2008.

Carbonell, Jaime R. Ai in cai: An artificial-intelligence ap-proach to computer-assisted instruction. Man-MachineSystems, IEEE Transactions on, 11(4):190–202, 1970.

Chen, Shuo, Moore, Josh L, Turnbull, Douglas, andJoachims, Thorsten. Playlist prediction via metric em-bedding. In Proceedings of the 18th ACM SIGKDDinternational conference on Knowledge discovery anddata mining, pp. 714–722. ACM, 2012.

Corbett, AlbertT. and Anderson, JohnR. Knowledge trac-ing: Modeling the acquisition of procedural knowl-edge. User Modeling and User-Adapted Interaction,4(4):253–278, 1994. ISSN 0924-1868. doi: 10.1007/BF01099821. URL http://dx.doi.org/10.1007/BF01099821.

Dempster, Frank N. The spacing effect: A case study inthe failure to apply the results of psychological research.American Psychologist, 43(8):627, 1988.

Ekanadham, Chaitanya and Karklin, Yan. T-skirt: Onlineestimation of student proficiency in an adaptive learning

Learning Representations of Student Knowledge and Educational Content

system. Machine Learning for Education Workshop atICML, 2015.

González-Brenes, JP, Huang, Yun, and Brusilovsky, Pe-ter. General features in knowledge tracing: Applica-tions to multiple subskills, temporal item response the-ory, and expert knowledge. In Proceedings of the 7thInternational Conference on Educational Data Mining(accepted, 2014), 2014.

Knewton. The knewton platform: A general-purpose adaptive learning infrastructure. Tech-nical report, 2015. URL http://www.knewton.com/wp-content/uploads/knewton-technical-white-paper-201501.pdf.

Lan, Andrew S, Studer, Christoph, and Baraniuk,Richard G. Time-varying learning and content analyt-ics via sparse factor analysis. In Proceedings of the 20thACM SIGKDD international conference on Knowledgediscovery and data mining, pp. 452–461. ACM, 2014a.

Lan, Andrew S, Studer, Christoph, Waters, Andrew E, andBaraniuk, Richard G. Tag-aware ordinal sparse fac-tor analysis for learning and content analytics. arXivpreprint arXiv:1412.5967, 2014b.

Lan, Andrew S., Waters, Andrew E., Studer, Christoph,and Baraniuk, Richard G. Sparse factor analysisfor learning and content analytics. J. Mach. Learn.Res., 15(1):1959–2008, January 2014c. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=2627435.2670314.

Lesgold, Alan et al. Sherlock: A coached practice environ-ment for an electronics troubleshooting job. 1988.

Linden, WJ and Hambleton, Ronald K. Handbook of mod-ern item response theory. New York, 1997.

Moore, J., Chen, Shuo, Joachims, T., and Turnbull, D.Learning to embed songs and tags for playlist prediction.In Conference of the International Society for Music In-formation Retrieval (ISMIR), pp. 349–354, 2012.

Moore, J., Chen, Shuo, Turnbull, D., and Joachims, T.Taste over time: The temporal dynamics of user prefer-ences. In Conference of the International Society for Mu-sic Information Retrieval Conference (ISMIR), pp. 401–406, 2013.

Novikoff, Timothy P., Kleinberg, Jon M., and Stro-gatz, Steven H. Education of a model stu-dent. Proceedings of the National Academy of Sci-ences, 109(6):1868–1873, 2012. doi: 10.1073/pnas.1109863109. URL http://www.pnas.org/content/109/6/1868.abstract.

ORourke, Eleanor, Andersen, Erik, Gulwani, Sumit, andPopovic, Zoran. A framework for automatically gener-ating interactive instructional scaffolding. 2015.

Piech, Chris, Huang, Jonathan, Nguyen, Andy, Phulsuk-sombati, Mike, Sahami, Mehran, and Guibas, Leonidas.Learning program embeddings to propagate feedback onstudent code. arXiv preprint arXiv:1505.05969, 2015a.

Piech, Chris, Sahami, Mehran, Huang, Jonathan, andGuibas, Leonidas. Autonomously generating hints byinferring problem solving policies. 2015b.

Piech, Chris, Spencer, Jonathan, Huang, Jonathan, Gan-guli, Surya, Sahami, Mehran, Guibas, Leonidas J., andSohl-Dickstein, Jascha. Deep knowledge tracing. CoRR,abs/1506.05908, 2015c. URL http://arxiv.org/abs/1506.05908.

Prerau, Michael J, Smith, Anne C, Eden, Uri T, Yanike,Marianna, Suzuki, Wendy A, and Brown, Emery N. Amixed filter algorithm for cognitive state estimation fromsimultaneously recorded continuous and binary mea-sures of performance. Biological cybernetics, 99(1):1–14, 2008.

Rasch, Georg. Probabilistic models for some intelligenceand attainment tests. ERIC, 1993.

Reckase, M. D. Multidimensional Item Response Theory.Springer Publishing Company, Incorporated, first edi-tion, 2009.

Rosenbaum, Paul R and Rubin, Donald B. The centralrole of the propensity score in observational studies forcausal effects. Biometrika, 70(1):41–55, 1983.

Smith, Anne C, Frank, Loren M, Wirth, Sylvia, Yanike,Marianna, Hu, Dan, Kubota, Yasuo, Graybiel, Ann M,Suzuki, Wendy A, and Brown, Emery N. Dynamic anal-ysis of learning in behavioral experiments. The journalof neuroscience, 24(2):447–461, 2004.

Sohl-Dickenstein, Jascha. Temporal multi-dimensionalitem response theory, 2014. URL http://lytics.stanford.edu/datadriveneducation/slides/sohldickstein.pdf.

Welch, B. L. The generalization of student’sproblem when several different population var-lances are involved. Biometrika, 34(1-2):28–35,1947. doi: 10.1093/biomet/34.1-2.28. URLhttp://biomet.oxfordjournals.org/content/34/1-2/28.short.

Zhu, Ciyou, Byrd, Richard H, Lu, Peihuang, and Nocedal,Jorge. Algorithm 778: L-bfgs-b: Fortran subroutinesfor large-scale bound-constrained optimization. ACM

Learning Representations of Student Knowledge and Educational Content

Transactions on Mathematical Software (TOMS), 23(4):550–560, 1997.