web as a textbook: curating targeted learning paths ...maximum likelihood estimation that assumes...

Web as a textbook: Curating Targeted Learning Pathsthrough the Heterogeneous Learning Resources on the

Web.

Igor LabutovCornell University

[email protected]

Hod LipsonColumbia University

[email protected]

ABSTRACTA growing subset of the web today is aimed at teachingand explaining technical concepts with varying degrees ofdetail and to a broad range of target audiences. Contentsuch as tutorials, blog articles and lecture notes is becomingmore prevalent in many technical disciplines and providesup-to-date technical coverage with widely different levelsof prerequisite assumptions on the part of the reader. Wepropose a task of organizing heterogeneous educational re-sources on the web into a structure akin to a textbook or acourse, allowing the learner to navigate a sequence of web-pages that take them from point A (their prior knowledge)to point B (material they want to learn). We approach thistask by 1) performing a shallow term-level classification ofwhat concepts are explained and assumed in any given text,and 2) using this representation to connect web resourcesthat explain concepts to those web resources where the sameconcepts are assumed. The main contributions of this pa-per are 1) a supervised classification approach to identifyingexplained and assumed terms in a document and 2) an algo-rithm for finding optimal paths through the web resourcesgiven the constraints of the user’s goal and prior knowledge.

Keywordsweb resources; optimizing learning

1. INTRODUCTIONNo scholar is born at the frontier of knowledge — early learn-ing and lifelong learning both play a defining role in shapingthe research vector of an academic [7]. More alarming, recentresearch [6] demonstrates that the pre-career idle time of anup-and-coming researcher has been on the steady rise duringthe last century, attributing to the “burden of knowledge”phenomenon — the inflation of the body of prerequisite priorknowledge to be mastered before being able to contribute tothe field with original research. The hypothesis of [9, 10] isthat facilitating effective early and lifelong learning practicesis a viable way for easing the “burden of knowledge”.

While physical textbooks and classrooms traditionally as-sumed the role of knowledge curators, they also presenta bottleneck in today’s rapidly growing web of up-to-datetechnical and academic content — peer-reviewed articles,lecture notes, tutorials, slides etc — from academics and“citizen scientists” alike. An automatic approach for “weav-ing” natural curricular progressions through the web of suchheterogeneous academic/educational content, we believe, willcatalyze early and lifelong learning by creating more effi-cient and goal-oriented curricula targeted to the level of theaudience.

The web is the only collection of resources today whereattempting this task becomes meaningful and promising.The reason for this is that the web contains an extensiveamount of diversity in its content, i.e. content that explainsthe same concepts but in many different ways. Naturallythis diversity reflects the diversity of the people who createthis content, their backgrounds, styles of learning and waysof thinking about complex concepts, which would naturallymatch learners with similar characteristics. We believe thatthis diversity can be leveraged to create learning pathwaysthat are not bound to the traditional curricula that areoften constrained for no better than a historical reason. Wepropose instead to optimize a curriculum directly for whatyou want to know given what you already know.

We propose to tackle the problem of curriculum mining on theweb, which broadly, involves linking technical resources onthe web to other resources that explain a subset of conceptsthat are assumed in the original document. We propose todecompose the task into 1) understanding what is explainedand assumed in a document on the part of the the readerand 2) use this document-level representation to sequencedocuments that guide the learner from their current state ofknowledge towards their goal, for example, understanding aspecific research paper or a set of lecture notes.

We propose a term-centric approach for inducing curricularrelations between any pair of documents. Naturally, under-standing a technical concept is more than being familiar withits surface term, and in this view an approach that operatesat the level of individual terms may appear to be naıve. Af-ter all, to explain a new concept is to put together existingconcepts in a novel way [13], and in the process introduceconvenient nomenclature. However, we hypothesize, that bythe virtue of seeking the shortest sequence of documents that“cover” (explain) multiple terms at once, the resulting bottle-

Proceedings of the 9th International Conference on Educational Data Mining 110

neck will implicitly “prefer” to link to prerequisite documentsthat introduce and explain whole concepts, i.e. groups ofterms, as opposed to introducing terms one document at atime (an extreme example would be presenting a sequenceof pages from a dictionary, each document defining a termindependence; this is clearly undesirable). It will be ourrunning assumption, that there exists a correlation betweenthe knowledge of the terms and the understanding of theoverarching concept.

Thus, to a first-order approximation, we model technical doc-uments as “bags of terms”, and in the interest of tractabilityset forth the following set of modeling assumptions:

• Assumption 1 A document is a bag-of-technical-terms(multiset) that is further partitioned into two multisets:E (Explained), A (Assumed) — corresponding to the role(aspect) of the term within the document:

Explained: The terms appear in the context that fur-thers the understanding of the concept correspondingto the term.

Assumed: The concept corresponding to the term isassumed to be familiar, and is required for understandingthe context in which it appears.

• Assumption 2 The degree of reliance on the knowledgeof a particular term in the document is proportional tothe frequency of the term in the Assume multiset, i.e.which concepts are fundamental to the understanding ofthe document, and which are auxiliary is reflected in thenumber of occurences of the corresponding terms.

As an illustration, consider the following excerpt from Christo-pher Bishop’s classic textbook Machine Learning and PatternRecognition from the chapter that introduces the concept ofExpectation Maximization:

Expectation MaximizationAn elegant and powerful method for finding maximumlikelihood solutions for models with latent variablesis called the expectation maximization algorithm, orEM algorithm.

In the excerpt above, we solid–underline the terms that ap-pear in the Explained aspect and dash–underline terms thatappear in the Assumed aspect. Understanding the concept ofMaximum likelihood is a prerequisite for understanding Ex-pectation Maximization. It is no surprise that most resourcesthat introduce the concept of Expectation Maximization im-plicitly assume that the reader is familiar with MaximumLikelihood. Academic and educational literature is fraughtwith such implicit assumptions that may be challenging tounravel for a learner especially new to the area. Note that onthe surface it may seem that detecting instances of explainedterms in the text is an equivalent task to finding instances ofterm definitions – a well studied task – but it is not so. Espe-cially in technical disciplines, explaining a concept requiresmuch more than giving a definition. A document defining aterm, may or may not actually explain the concept behind it.For example, a document may define a term to refresh the

reader’s memory but otherwise assume the reader’s familiar-ity with it. On the other hand, a document may explain aterm without ever giving a one-sentence definition.

Finally, the proposed dichotomy may appear as a gross over-simplification, ignoring the entire continuum of pragmaticsbetween the two extremes. We argue that while binary term-level classification alone may not capture the fine-grainedaspect of any one term, combining it with the context of theentire document, will enable us to unravel the prerequisiterelationships between documents.

2. RELATED WORKEvidence of information overload in traditional text-books Formal study of textbook organization conductedby [1] on a corpus of textbooks from India quantitativelyaddresses the issue known as the “mentioning problem” [12],where “concepts are encountered before they have been ad-equately explained and forces students to randomly ’knockaround’ the textbook”. The work of [1] suggests that manytraditional textbooks suffer from the resulting phenomenonof “information burden” and provide diagnostic metrics forevaluating it. A user study conducted by [2], though limitedto electronic textbooks, demonstrated the utility of a naviga-tional aid that links concepts and terms within a textbookand allows the user to navigate according to own preferences.This suggests the potential utility of tools that expand such“navigational ability” outside textbooks.

Attempts at manual curriculum curation There havebeen at least two efforts that we are aware of, that attemptsto manually create “paths” between a selected set of resourceson the web — two educational start-ups, Metacademy [5], andKnewton [4]. While motivated by the same goal, we believethat manual web-scale curriculum curation is akin to themanually-curated directory of the web (not too different fromthe original Yahoo directory from the 1990s), i.e. offeringpoor scaling capability in the dynamic, growing landscape ofeducational content on the web.

Attempts at automatic curriculum curation Most rel-evant to our task is the work of [11] that attempt to inferprerequisite relationships between a pair of Wikipedia ar-ticles. They frame the problem of prerequisite predictionas “link-prediction” between a pair of pages using primarilygraph-derived (e.g. hyperlink structure) and some content-derived features (e.g. article titles). In contrast to theirapproach, we do not assume any existing structure connect-ing the web resources (e.g. within Wikipedia), as the majorityof the educational content on the web is unstructured. Ourapproach also naturally facilitates a scalable assimilation ofnew content, as we require only a document-scoped term-level classification, without needing to explicitly constructor update a prerequisite graph. Furthermore, we develop anapproach for optimizing curricular paths using the proposedrepresentation. More recent work of [8] develop a methodthat does not rely on a manual annotation of the prerequisiterelations as in [11], and instead uses the statistics of conceptreference in a pair of pages to determine the prerequisiterelation between them. Similar to [11], their focus is on thepairwise link prediction, in contrast to our goal of globallyoptimizing a learning curriculum.


(a) (b)

Figure 1: (a) ROC curves for the task of binary aspectclassification. (b) AUC (left y-axis) of aspect classificationfor terms with a maximum document rank given on x -axis.Shaded region shows the number of terms up to the givenmaximum rank (right y-axis).

3. MODEL3.1 Modeling explanationsWe model the problem of identifying the explained and as-sumed terms in a document as a term-level binary classifica-tion task, i.e each term in the document is classified into oneof the two categories. Although simple from an implemen-tation perspective, this task is made difficult by the lack ofannotated data in this domain. In this work, we rely on (i)manual annotation of the term aspects performed by us forone of the textbooks (Rice University’s statistics text) and(ii) explicit annotations from the index of Bishop’s PatternRecognition and Machine Learning textbook that were madeby the author of the text (the annotation is in the form of alocation in the text where a particular concept is explained).

The Rice University’s Online Statistics Education: An In-teractive Multimedia Course of Study textbook, from hereonreferred to as statsbook consists of a total of 112 units,with a median of 12.5 unique technical terms per unit, fora total of 339 different technical terms in the book. Wescrape the text content of the book from the web, replaceall mathematical formulae and symbols with special tokens,and manually annotate each technical term mention with itsrepresentative form from the index, i.e. normally distributedwith normal distribution. Manual term annotation obviatesthe need for introducing a word-sense disambiguation com-ponent and additional errors. We process the PRML datasetin an identical manner.

Each technical term in every unit of the book was annotatedwith the binary {explain, assume} aspect, following the defi-nitions outlined on the previous page. While for most terms,the application of these definitions is fairly unambiguous, fora significant number of term mentions, the aspects are notmutually exclusive, i.e. the term may be construed to belongto both aspects simultaneously. Often, in using (assuming)a term to explain a related concept, something about theassumed term is also explained as a side effect. The degreeto which the explanation is distributed between the termsis difficult to judge objectively, and may vary between dis-tinct mentions of the terms in different parts of the samedocument. We adopt a simple strategy for “breaking ties” insuch cases: if we a judge a term as having been intended tobe explained in the given context by the author, we markit with the explain aspect, otherwise, the term is assumed

to be assumed. In total across the entire statsbook corpus,1878 terms were annotated for their aspect (note that thesame term appears in multiple documents with potentiallydifferent aspects), with a class ratio of 537 terms belonging tothe explain and 1341 terms belonging to the assume aspect.

The PRML dataset contains a total of 3883 annotated terms,with 222 terms belonging to the explain and 3661 termsbelonging to the assume aspect. The aspect of the termwas determined from the index of the book, which explicitlyspecifies the pages where a term is explained.

A logistic regression model (LIBLINEAR [3] with defaultregularization parameter) was trained to predict a binaryaspect of the terms and evaluated with 10-fold stratifiedcross-validation. A set of lexical and dependency featuresdescribing the context of each term (within a 1 sentencewindow), positional features describing the location of theterm’s mention within the document and sentences in whichthe term appeared, and the frequency rank of the term withinthe document were employed. We compare the performanceof a classifier that uses all of these features with the one thatuses only the rank. A classifier that is given rank as theonly feature, will essentially learn a rank “threshold” thatwill decide the aspect of the term within the document, i.e.predict all terms above a certain rank as explained.

Figure 1(a) summarizes the performance of aspect predictionwith the classifier trained using both linguistic and rankfeatures (Rank+Text, AUC=0.76) versus a classifier trainedusing only the rank (Rank only, AUC=0.66) for the stats-book corpus. As expected, rank is predictive of the aspect,but contextual linguistic cues provide a significant boost.

Keeping our end goal in mind, under Assumption 2 stated inthe introduction, we hypothesize that the frequency rank ofthe term in a document correlates with the degree to which aterm is either assumed or explained in that document. In thedownstream task of linking documents to their prerequisites,getting the aspects of the more frequent terms correct isarguably more important than of the terms that only appearonce or twice. We evaluate the performance of our aspectclassifier as a function of the term’s rank. Figure 1(b) il-lustrates predictive performance (AUC) on a subset of thedata stratified by the term’s frequency rank. We observea favorable trend in increased predictive performance forhigher ranked terms. An obvious explanation is that morefrequent terms accumulate a larger set of features describingthem (since each mention of the term contributes its contextfeatures), effectively decreasing variance in the predictions.

3.2 Optimal learning pathsConsider now that we have a large collection of documents(e.g. tutorials, papers, textbook chapters). Each such docu-ment explains some concepts but also assumes the reader’sknowledge of other concepts (e.g. a tutorial may explain theconcept of normal distribution, but may assume the knowl-edge of probability and distribution). We will now considerthat we can reliably classify each term in each document intoeither the Explained or Assumed category. Consider that wealso have a user who is interested in understanding a specific(target) document (or a set of target documents). The goalis to give a user a self-contained sequence of documents of


wj

A0

Ei

Ai

Target

Figure 2: Each document is represented by a blue shadedregion: the top part corresponds to the explained set Ei andthe bottom part corresponds to the assumed set Ai. Reddots correspond to terms. This is an example of a feasiblesolution, where each document is covered.

minimal length that explains all of the concepts needed tounderstand the target document.

Formally each document di in our collection is a set of twosets of terms: the explained terms Ei = E(di) and theassumed terms Ai = A(di). A term in any document iseither explained or assumed, but not both, i.e. Ai ∩ Ei = ∅.We say that the document di is covered by a prerequisite setof documents Pi when:

Ai ⊆⋃

dj∈Pi

E(dj)

In other words the document is covered when every one ofits assumed terms is explained by at least one document inthe prerequisite set. For any prerequisite set that covers thisdocument, the documents in the prerequisite set need to becovered as well, recursively until all documents have beencovered. We assume the existence of documents with noprerequisites (leaves), i.e. those documents for which A· = ∅.The goal is to find a smallest self-contained set of documentsP , i.e. a set of documents such that all the documents inP are covered and d0 ∈ P , where d0 = {A0, E0} is thetarget document of interest to the user. Figure 2 illustrates afeasible solution to an example problem. Without additionalrestrictions, solutions to this problem can contain cyclicaldependencies. Such cycles don’t make sense in our setting.Thus an important restriction is that the the set of documentsP can be ordered such that every document in the sequenceis covered by the preceding documents in the sequence. Letp be a sequence of documents of length K, where pk is thekth document in the sequence, then we seek:

minimize |p|

s.t. ∀k : A(pk) ⊆k−1⋃

k′=0

E(pk′)

d0 ∈ p (1)

ILP formulationWe formulate an Integer Linear Program (ILP) that findsa minimum length self-contained sequence p of at most Kdocuments such that it covers a user’s document of interest

d0. Consider that we have a total of D documents. We definethe following variables:

xki ∈ {0, 1} document di is in kth position in the sequence

We define the following constants:

eij ∈ {0, 1} Term j is explained in document iaij ∈ {0, 1} Term j is assumed in document i

Each assumed term in a document in position k must beexplained by at least one document up to (but not including)the document in position k. This can be expressed via thefollowing constraint:

k−1∑

k′

D∑

i

eijxk′i ≥

D∑

i

aijxki ∀j∀k

Each position in the sequence contains at most 1 document:

D∑

i

xki ≤ 1 ∀k

User’s preference of covering a document of interest d0 is anadditional constraint:

K∑

k

xk0 = 1 ∀k

Finally, the objective is to minimize the number of documentsin the sequence:

minimize

K∑

k

D∑

i

xki

The above formulation also allows us to directly incorporatethe user’s prior knowledge into this optimization problem. Ifwe represent a user as a set of explained terms, i.e. terms thatthe user is assumed to have mastered, then the constraintscorresponding to these terms may simply be dropped fromthe formulation.

In the most general case, this formulation has D2 variablesand O(D2 × V ) constraints, where V is the number of termsin the vocabulary. In practice, however, we will often limitthe maximum allowable sequence length to a fairly smallconstant (e.g. 10, as done in our experiments), reducingthe order of the problem to O(D) variables and O(D × V )constraints.

While in extremely large settings (hundreds of thousands ofdocuments), even with a small K, solving this ILP directlyis infeasible, in practice, we find that that we can can obtainexact solutions using LP relaxation and a vanilla Branchand Bound (using GLPK1) within several seconds, evenwith a many as 1,000 documents and hundreds of terms.Developing an approximation algorithm based on roundingthe LP solution is our ongoing work.

1https://www.gnu.org/software/glpk/


3 4 5 6 7 8 9 10Prerequisite depth

0.65

0.70

0.75

0.80

0.85

0.90

AU

C

model

baseline 2

baseline 1

Figure 3: Term aspect classification is useful at the task of re-covering prerequisites for units within a textbook. The y-axisis the average AUC at the task of predicting whether a par-ticular unit is a prerequisite of another unit, based on threemetrics. The metric that incorporates the Explain/Assumeclassifier performs best (solid line).

4. EVALUATION

4.1 PrerequisitesIn order to evaluate the Explain/Assume classifier in anend-to-end setting, we employ the output of this classifier inthe task of predicting prerequisites in a dataset where theprerequisites have been explicitly annotated. One such re-source is Rice University’s Online Statistics Textbook, whichin addition to the text content, provides an explicit depen-dency graph annotating prerequisite relations between pairsof units (units are at the level of chapter sections). We pro-pose a metric for scoring a pair of units according to theirprerequisite relationship based only on the terminology ofboth units and the output of the Explain/Assume classifier.The proposed “prerequisite score” is defined as follows:

P (da → db) =

∑ti∈db n

ai 1[ti assumed in db ∧ explained in da]∑

tj∈da naj1[tj explained in da]

where nai is the number of occurrences of term i in document

da. Since the above score is guaranteed to be in the [0, 1]range, we can interpret it as a probability P (da → db), aprobability that document a is a prerequisite of documentb. There is an intuitive interpretation to the above score:a document can be considered a strong prerequisite of atarget document when it explains all of the assumed termsin the target document and nothing more. We can convinceourselves that in this case the score as defined above will beequal to 1. A document that explains too many unrelatedconcepts will suffer a penalty with respect to its prerequisitescore to another document. Furthermore, we consider therelative frequency of the explained term in the prerequisitedocument as an additional signal of that term’s importance.We find that this additional information increases the perfor-mance of prerequisite classification (discussed at the end ofthis section).

Because the output of the Explain/Assume classifier is aprobability, rather than a class, we can relax the above score

to directly incorporate the uncertainty in the classification:

P (da → db) =∑

ti∈db

nai P (ti explained in da)∑

tj∈da najP (tj explained in da)

(2)

Note that in addition to relaxing the requirement of an ex-plicit Explain or Assume label, we also drop the requirementthat only the assumed terms need to be explained to counttowards the prerequisite score. This distinction is optional,but it encodes an important assumption on the kinds of“prerequisites” that this score will discover. This also bringsup the importance of being precise about the definition of aprerequisite. A document a is a strict prerequisite of docu-ment b, if document a explains a subset of the assumptionsin document b. However, we can relax this definition bynot requiring that the terms explained in the prerequisite(a) are strictly assumed in the target (b). In other words, adocument that explains a subset of the terms also explainedin the target and nothing else, will have a score of 1 accordingto the above equation. In practice this corresponds to docu-ments that explain the same concepts but in a simpler way(since they explain only a subset of the explained concepts inthe target), and this is often a desired behavior in a learningsequence. For example, before reading a more advanced arti-cle on Support Vector Machines, the learner might want toread a more basic introduction to Support Vector Machines,although from the perspective of term classifications, bothdocuments explain the same concept.

4.1.1 Reconstructing prerequisitesRice University’s Online Statistics Textbook provides a valu-able resource for evaluating the effectiveness of the Ex-plain/Assume classification at the task of predicting prereq-uisite relations between documents. The textbook consistsof 112 units at the granularity of chapter sections, annotatedas a directed graph, i.e. specifying a directed edge betweena pair of units if one unit is considered a prerequisite of an-other unit. We process the raw HTML files of the textbookby removing markup, segmenting sentences and extractingterminology (obtained from the index) features as describedin Section 3.1. We pose the problem of prerequisite rela-tion prediction as a standard binary classification task, i.e.predicting for each pair of units in the book whether oneunit is a prerequisite of another, where we consider a pair ofunits to be in a gold-standard prerequisite relation if thereis a directed path between them in the graph. AUC is aconvenient metric for evaluating performance in this pre-diction task, as the output of our scoring metric (Equation2) is already scaled between 0 and 1. Note that the modeltrained only on the PRML corpus was used for term-aspectclassification in this task. Figure 3 illustrates the resultsfor three different models, as a function of the prerequisitedepth, i.e. stratifying the classification results for a pair ofunits by the maximum distance between them in the graph.The three models evaluated are as follows:

• Model Prerequisite score is computed with Equation 2.

• Baseline 1 Prerequisite score is computed with Equation2, but with all na

i , naj and P (t· explained in ·) set to 1.

This baseline is equivalent to a ratio between the numberof overlapping terms between a pair of documents and the

number of terms in the prerequisite, i.e. |da∩db||da| .


• Baseline 2 Prerequisite score is computed with Equation2, but with P (t· explained in ·) set to 1.

Each baseline illustrates the effect of not including a com-ponent of the scoring function in Equation 2. Our firstconclusion from the results in Figure 3 is that the output ofthe Explain/Assume classifier provides an important signal inpredicting the prerequisite relationship between documents.Furthermore, the relative frequency of the explained termsin the prerequisite document provides an additional gain inperformance. This can be explained by Figure 1(b): theperformance of the Explain/Assume classifier is greater inthe higher term-frequency regime; discounting low-frequencyterms (that are also likely less important to the content) re-duces the classification noise and boosts the performance atthe prerequisite prediction task. An additional observationis that the performance of the pairwise prerequisite classifica-tion improves for pairs of units that are closer in the graph,i.e. with less units in between. This is easily explained:units that are farther apart typically share less terminology,making the estimates based on terminology overlap noisier.

It is also interesting to note that the simplest baseline thatconsiders only the ratio of overlapping terms between a pairof documents to the total number of terms in the prerequisitedocument does surprisingly well, especially well for pairs ofdocuments closer together. This can be explained as follows:in a sequence of units like those in a textbook, units thatare prerequisites tend to be less advanced, i.e. have lessterminology, since less of it was introduced up to that point.Thus, units that are prerequisites, at least in a textbook,would be fairly predictable from the relative frequency ofoverlapped terms alone.

4.2 Scaling to the webWe collect and release two web corpora of educational con-tent in the areas of Machine Learning and Statistics. Bothcorpora were collected using Bing Search API, by queryingfor short permutations of terms collected from the indexof the Pattern Recognition and Machine Learning and RiceUniversity’s Online Statistics Textbook. The two corporacontain 42,000 and 1,000 documents respectively – a mix-ture of HTML and PDF files, pre-processed and convertedto plain text. The difference in size of the two corpora isdue to a smaller set of keywords used in the query set, andused primarily to rapidly validate the proposed model forpath optimization. Consequently, because of a smaller termvocabulary, the smaller corpus is significantly less noisy (lessirrelevant documents). The union of the terminology fromthe index of both textbooks was used as the vocabulary inprocessing each document. Additionally, terminology vari-ations and abbreviations were consolidated using the linkdata from Wikipedia, e.g. terms EM, E-M, Expectation-Maximization, are all mapped to the same concept of EM inthe terminology extraction stage.

Following the extraction of terminology from each webpage,each term is classified using the Explain/Assume classifiertrained on the Pattern Recognition and Machine Learningtextbook. We train this classifier in a fully supervised set-ting using all of the annotated data. In the next severalsections, we present the analysis of the two web corpora and

demonstrate the effectiveness of the proposed approach toconnecting educational resources on the web.

4.3 Diversity of assumptionsThe web is a unique setting, that unlike a traditional text-book or a course, offers a multitude of diverse explanations ofthe same concept. This diversity potentially enables the levelof personalization that is not possible in traditional resources.We can analyze the diversity in the educational content onthe web by looking at a slice of the web resources that sharethe same topic, but differ in their underlying assumptionsand explanations. Figure 4 illustrates two articles that areboth on the topic of Expectation Maximization. However,the two articles differ significantly in their assumptions onthe background of the reader. Article 1 (left in Figure 4) isa very basic introduction to the topic and does not assumethe knowledge of even the concept of maximum likelihood,which under most traditional curricula is assumed to be theprerequisite. Article 2 (right in Figure 4), however, assumesthe knowledge of many more concepts such as posterior prob-ability, likelihood function and maximum likelihood. Thisdifference in the distribution of the underlying assumptionsis explained by the fact the Article 1 s a very basic intro-duction to the topic, intended for an audience not in thearea of statistics or machine learning. Article 2, however,is a significantly more thorough and a more technical in-troduction to the concept of the Expectation Maximizationalgorithm and thus assumes significantly more prerequisitebackground in the areas of statistics and machine learning.It’s important to note that this distinction between the twodocuments cannot be easily made from their titles, or othersurface cues: both documents are approximately the samelength and their titles do not give away the level of technicaldetail. Their text content, however, provides the necessarycues to this information.

4.4 Fundamental prerequisitesFigure 5 illustrates the result of optimizing a learning pathover the web corpus of 1,000 documents for the target web-page on the topic of “Maximum Likelihood Estimation”. Se-quences were optimized using the ILP formulation describedin Section 3.2 using the GLPK Branch and Bound solver.Red rectangles correspond to terms for which the predictedlabel is assumed in the given document, and blue otherwise.In addition to the term-coverage diagram, we also illustratethe prerequisite dependencies extracted from the term cov-erage data: a directed edge is drawn to a document fromthe closest prerequisite in the sequence that covers at leastone assumed term in the document. In the example in Fig-ure 5, the target web-page is a fairly technical article onMaximum Likelihood Estimation that assumes the reader’sunderstanding of the concepts such as the likelihood functionwhich is pivotal for understanding the concept of maximumlikelihood. As a consequence, the web-page that is placedimmediately before in the optimal sequence are slides whichconsist of a more basic introduction to the maximum likeli-hood. Furthermore, the original target article assumes thereader’s familiarity with Generalized Linear Models (which isin fact the previous section of the lecture notes of that series,indicating it as a prerequisite). The resulting sequence alsocontains an additional prerequisite on this topic. Finally, aninteresting observation is that while the target article is fairlyadvanced in its assumptions about the reader’s knowledge of


Figure 4: An example of two different web-pages aboutthe same topic: Expectation Maximization, together witheach page’s terminology and its classification into either theExplained class (green) or the Assumed class (red). Observethat the two pages, while about the same topic, are differentin what they assume about the reader. The article on theleft is a very basic introduction to this topic, while the articleon the right is written for experts.

probability, it actually goes into surprising depth in explain-ing the concept of a derivative and maximizing a functionusing derivatives from scratch, which is another importantprerequisite to the concept of maximum likelihood. This ishighly unconventional in traditional textbook and coursecurricula. This again underlines the advantage of workingwith the assumptions at document-level, allowing to leveragethe diversity in explanations to find “shortcuts” through thelearning paths.

Figure 6 provides additional insightful examples of the gen-erated sequences extracted from the term-coverage data ofeach sequence. Figure 6(d) is another example where thetarget document is a fairly advanced introduction to the topic(Expectation Maximization), which is preceded by a moregentle introduction to the same topic, as well as an addi-tional prerequisite (Maximum Likelihood) which is a commonprerequisite for this topic. Note, however, that while Maxi-mum Likelihood is traditionally considered as a prerequisitefor learning about Expectation Maximization, it is not thecase for the more basic introduction to this topic (What isthe Expectation Maximization algorithm), as that particularintroduction aims to bring a very high-level understandingof the topic without burdening the reader with additionalprerequisite requirements. Therefore, in that particular se-quence, the reader is first given a gentle introduction to thetopic, then the necessary prerequisite (Maximum Likelihood)for understanding the more advanced introduction.

4.4.1 Error analysisThe extracted sequences are not without errors. These errorsstem from several potential sources, as a fairly involvedpipeline lies between the raw document and the resultingoptimal sequence, providing an opportunity for errors to

propagate through the different stages. We break down theseerrors by their source to give a better understanding of howthese problems need to be addressed in future work:

Terminology extraction: The greatest source of errorsstems from errors in terminology extraction. There are twotypes of errors involved in terminology extraction: falsenegatives (missing terms) and false positives (term sensedisambiguation errors). False negatives are more difficult todetect and often result in missing prerequisites; missing termsare especially difficult when relying on a finite vocabulary.

Explain/Assume classification: The second greatest sourceof errors are the mistakes made by the aspect classifier. Clas-sifying an explained term as an assumed term creates un-necessary prerequisites, while the reverse results in missingpotentially important prerequisites.

Path optimization: because we solve the optimizationproblem exactly (i.e. find a global optimum), there are no er-rors stemming from the optimization itself (this will becomea potential source of errors, however, when an approximationscheme, e.g. LP rounding, is used to obtain an approximatesolution). However, the formulation of the optimization prob-lem can be improved so as to introduce robustness to theerrors in the earlier stages of the pipeline. As path opti-mization is the final stage that produces the final output, itssensitivity to the errors in terminology extraction and termaspect classification are directly reflected in the resultingoutput. Introducing robustness to these errors directly inthe formulation of the optimization problem is potentiallythe most effective way to address the issues in the earlierstages of the pipeline. One issue with the current formu-lation is its inability to incorporate the relative frequencyof the term into the optimization objective: ideally termsthat appear less frequently in a document should have alesser precedence for coverage than those that appear morefrequently (Assumption 2 in the Introduction). The exam-ple in Figure 5 demonstrates the lack of robustness in thethird document, where the appearance of the term integralcreates an additional sequence of documents that cover thisconcept. From our earlier analysis is Section 3.1, we haveshown that the errors in the Explain/Assume classifier aredirectly related to the relative frequency of the terms, andthus a way to incorporate these frequencies as weights intothe optimization would potentially be the most effective wayto deal with this noise.

5. CONCLUSIONWe developed what we believe is the first end-to-end ap-proach towards automatic curriculum extraction from theweb, relying on the following pipeline: 1) extracting whatis assumed vs. what is explained in a single document andthen 2) connecting these documents into a sequence ensuringthat the progression builds up the knowledge of the learnergradually towards their goal. We developed algorithms thataddressed both of these components: 1) a semi-supervisedapproach for learning a term aspect classifier from a verysmall set of annotated examples and 2) an optimization prob-lem for learning path recommendation based on the user’slearning goals. To the best of our knowledge, we for thefirst time demonstrate and leverage the most unique charac-teristic of the web in the domain of learning: diversity, i.e.


Gradient&

directionalderivative

tutorialVideo

Lecture...

ATutorial

onProbability

Theory-

SCIHome

LECTURE11

EXPONENTIALFAMILY

ANDGENERALIZED

LINEARMODELS

MaximumLikelihoodEstimation

-University

ofOregon

Maximum-LikelihoodEstimation

BasicIdeas

MA103-AnIntroduction

tothe

DirectionalDerivative

and...

Target

Figure 5: An example optimal sequence for the target document on Maximum Likelihood Estimation. Left: the term-coveragediagram. Each column represents a single web-page and each row a single term. Red rectangles correspond to terms thatare classified as assumed in the corresponding document and blue corresponds to the explained terms. Right: the term-coverdiagram is converted into a directed graph whereby an edge is drawn to a document from its closest prerequisite that explainsat least one assumed term.

Discriminantfunctionanalysis

-Wikipedia,

thefree...

Whatarethe

advantagesof

supportvector

machines(SVM

...

Comparinglogistic

regression,supportvector

machines...

AnIntroduction

toLogistic

Regression

Whatis

LinearRegression?

-StatisticsSolutions

ProbabilityHelp

&Tutorials

-About

StatisticsTutorials

...

Lecture5

ConditioningConditionalprobability

Approximateinference

inBayesiannetworks

[3P]

GibbsSampling

-Brown

CS

Graphicalmodel

-Wikipedia,

thefree

encyclopedia

EMfor

HMMsa.k.a.The

Baum-WelchAlgorithm

TheExpectation

MaximizationAlgorithm

ARevealing

Introductionto

HiddenMarkovModels

AnIntroduction

toLogistic

Regression

Whatis

LinearRegression?

-StatisticsSolutions

Maximum-LikelihoodEstimation

BasicIdeas

AnIntroduction

tothe

ExpectationAnIntroduction

tothe...

Covariance-

Wikipedia,thefree

encyclopedia

Whatis

theexpectation

maximization-

StanfordAILab

ProbabilityHelp

&Tutorials

-About

StatisticsTutorials

...

AnIntroduction

toBasic

Statisticsand

Probability

ARevealing

Introductionto

HiddenMarkovModels

Expectation-MaximizationTheory

|Introduction

|InformIT

Backpropagation

Introductionto

K-MeansCluster

Analysis-

galvanize.com

A. D.B. C.

E.

Figure 6: Additional examples of optimal paths generated from the 1,000-document web-page corpus for a select set of targetweb-pages. See text for details.

presence of content that explains the same concepts but inmany different ways and from many different angles. Thisproperty of the web opens the doors to personalizing learningsequences that leverage the differences in explanations to findthe most effective paths and shortcuts through the Internet.Finally, we outlined a set of important challenges that needto be addressed in order to make this task a practical realityat web-scale. We hope that this work, in addition to thedatasets that we release, will serve to inspire interest from

the community in what we believe is a challenging and animportant task.

AcknowledgementsThis research was funded by a grant from the John TempletonFoundation provided through the Metaknowledge Networkat the University of Chicago. Computational resources wereprovided in part by grants from Amazon and Microsoft.


6. REFERENCES[1] R. Agrawal, S. Chakraborty, S. Gollapudi, A. Kannan,

and K. Kenthapadi. Empowering authors to diagnosecomprehension burden in textbooks. In Proceedings ofthe 18th ACM SIGKDD international conference onKnowledge discovery and data mining, pages 967–975.ACM, 2012.

[2] R. Agrawal, S. Gollapudi, A. Kannan, andK. Kenthapadi. Study navigator: An algorithmicallygenerated aid for learning from electronic textbooks.JEDM-Journal of Educational Data Mining,6(1):53–75, 2014.

[3] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, andC.-J. Lin. LIBLINEAR: A library for large linearclassification. JLMR, 9:1871–1874, 2008.

[4] J. Ferreira. Knewton.http://http://www.knewton.org.

[5] R. Grosse. Metacademy. http://www.metacademy.org.

[6] B. Jones, E. Reedy, and B. A. Weinberg. Age andscientific genius. Technical report, National Bureau ofEconomic Research, 2014.

[7] B. F. Jones. As science evolves, how can science policy?In Innovation Policy and the Economy, Volume 11,pages 103–131. University of Chicago Press, 2011.

[8] C. Liang, Z. Wu, W. Huang, and C. L. Giles.Measuring prerequisite relations among concepts.

[9] M. Peat, C. E. Taylor, and S. Franklin. Re-engineeringof undergraduate science curricula to emphasisedevelopment of lifelong learning skills. Innovations inEducation and Teaching International, 42(2):135–146,2005.

[10] D. J. Rowley, H. D. Lujan, and M. G. Dolence.Strategic Choices for the Academy: How Demand forLifelong Learning Will Re-Create Higher Education.The Jossey-Bass Higher and Adult Education Series.ERIC, 1998.

[11] P. P. Talukdar and W. W. Cohen. Crowdsourcedcomprehension: predicting prerequisite structure inwikipedia. In Proceedings of the Seventh Workshop onBuilding Educational Applications Using NLP, pages307–315. Association for Computational Linguistics,2012.

[12] H. Tyson-Bernstein. A conspiracy of good intentions.america’s textbook fiasco. 1988.

[13] S. M. Wilson and P. L. Peterson. Theories of learningand teaching: what do they mean for educators?National Education Association Washington, DC, 2006.


web as a textbook: curating targeted learning paths ...maximum likelihood estimation that assumes...

Documents