metadata generation and glossary creation in elearning lothar lemnitzer review meeting, zürich, 25...

Metadata generation and glossary creation in

eLearning

Lothar LemnitzerReview meeting, Zürich, 25 January

2008

Outline

• Demonstration of the functionalities• Where we stand• Evaluation of tools• Consequences for the development

of the tools in the final phase

Demo

We simulate a tutor who adds a learning objects and generates and edits additional data

Where we stand (1)

Achievements reached in the first year of the project:

• Annotated corpora of learning objects• Stand-alone prototype of keyword

extractor (KWE)• Stand-alone prototype of glossary

candidate detector (GCD)

Where we stand (2)

Achievements reached in the second year of the project:

• Quantitative evaluation of the corpora and tools

• Validation of the tools in user-centered usage scenarios for all languages

• Further development of tools in response to the results of the evaluation

Evaluation - rationale

Quantitative evaluation is needed to• Inform the further development of

the tools (formative)• Find the optimal setting / parameters

for each language (summative)

Evaluation (1)

Evaluation is applied to:• the corpora of learning objects• the keyword extractor• the glossary candidate detector

In the following, I will focus on the tool evaluation

Evaluation (2)

Evaluation of the tools comprises of1. measuring recall and precision

compared to the manual annotation2. measuring agreement on each task

between different annotators3. measuring acceptance of keywords /

definition (rated on a scale)

KWE Evaluation step 1

• On human annotator marked n keywords in document d

• First n choices of KWE for document d extracted

• Measure overlap between both sets• measure also partial matches

Best method F-Measure

Bulgarian TFIDF/ADRIDF 0.25

Czech TFIDF/ADRIDF 0.18

Dutch TFIDF 0.29

English ADRIDF 0.33

German TFIDF 0.16

Polish ADRIDF 0.26

Portuguese TFIDF 0.22

Romanian TFIDF/ADRIDF 0.15

KWE Evaluation – step 2

• Measure Inter-Annotator Agreement (IAA)

• Participants read text (Calimera „Multimedia“)

• Participants assign keywords to that text (ideally not more than 15)

• KWE produces keywords for text


1. Agreement is measured between human annotators

2. Agreement is measured between KWE and human annotators

We have tested two measures / approaches– kappa according to Bruce / Wiebe– AC1, an alternative agreement weighting

suggested by Debra Haley at OU, based on Gwet

IAA human annotators

IAA of KWE with best settings

Bulgarian 0.63 0.99

Czech 0.71 0.78

Dutch 0.67 0.72

English 0.62 0.82

German 0.64 0.63

Polish 0.63 0.67

Portuguese 0.58 0.67

Romanian 0.59 0.61


• Humans judge the adequacy of keywords

• Participants read text (Calimera „Multimedia“)

• Participants see 20 KW generated by the KWE and rate them

• Scale 1 – 4 (excellent – not acceptable)• 5 = not sure

20 kw First 5 kw

First 10 kw

Bulgarian 2.21 2.54 2.12

Czech 2.22 1.96 1.96

Dutch 1.93 1.68 1.64

English 2.15 2.52 2.22

German 2.06 1.96 1.96

Polish 1.95 2.06 2.1

Portuguese

2.34 2.08 1.94

Romanian 2.14 1.8 2.06

GCD Evaluation - step 1

• A human annotator marked definitions in document d

• GCD extracts defining contexts from same document d

• Measure overlap between both sets• Overlap is measured on the sentence

level, partial overlap counts

Is-definitions

Recall Precision

Bulgarian 0.64 0.18

Czech 0.48 0.29

Dutch 0.92 0.21

English 0.58 0.17

German 0.55 0.37

Polish 0.74 0.22


Romanian 1.0 0.53

GCD Evaluation – step 2

• Measure Inter-Annotator Agreement• Experiments run for Polish and Dutch• Prevalence-adjusted version of kappa

used as a measure• Polish: 0.42; Dutch: 0.44• IAA rather low for this task


• Judging quality of extracted definitions• Participants read text • Participants get definitions extracted by

GCD for that text and rate quality• Scale 1 – 4 (excellent – not acceptable)• 5 = not sure

# defin. # testers Av. value

Bulgarian 25 7 2.7

Czech 24 6 3.1

Dutch 14 6 2.8

English 10 4 3.3

German 5 5 2.1

Polish 11 5 2.7

Portuguese

36 6 2.2

Romanian 9 7 3.0


Further findings• relatively high variance (many ‚1‘

and ‚4‘)• Disagreement between users about

the quality of individual definitions

Individual user feedback - KWE• The quality of the generated keywords

remains an issue • Variance in the responses from different

language groups• We suspect a correlation between

language of the users and their satisfaction

• Performance of KWE relies on language settings, we have to investigate them further

Individual user feedback – GCD

• Not all the suggested definitions are real definitions.

• Terms are ok, but definitions cited are often not what would be expected.

• Some terms proposed in the glossary did not make any sense.

• The ability to see the context where a definition has been found is useful.

Consequences - KWE

• Use non-distributional information to rank keywords (layout, chains)

• Present first 10 keywords to user, more keywords on demand

• For keyphrases, present most frequent attested form

• Users can add their own keywords

Consequences - GCD

• Split definitions into types and tackle the most important types

• Use machine learning alongside local grammars

• Look into the part of the grammars which extract the defined term

• Users can add their own definitions

Plans for final phase

• KWE, work with lexical chains• GCD, extend ML experiments

• Finalize documentation of the tools

Validation

User scenarios with NLP tools embedded:

1. Content provider adds keywords and a glossary for a new learning object

2. Student uses keywords and definitions extracted from a learning object to prepare a presentation of the content of that learning object

Validation

3. Students use keywords and definitions extracted from a learning objects to prepare a quiz / exam about the content of that learning object

Validation

We want to get feedback about• The users‘ general attitude towards

the tools• The users‘ satisfaction with the

results obtained by the tools in the particular situation of use (scenario)

User feedback

• Participants appreciate the option to add their own data

• Participants found it easy to use the functions

Plans for the next phase

Improve precision of extraction results:• KWE – implement lexical chainer• GCD – use machine learning in

combination with local grammars or substituting these grammars

• Finalize documentation of the tools

Corpus statistics – full corpus

• Measuring lengths of corpora (# of documents, tokens)

• Measuring token / tpye ratio• Measuring type / lemma ratio

# of documents

# of tokens

Bulgarian 55 218900

Czech 1343 962103

Dutch 77 505779

English 125 1449658

German 36 265837

Polish 35 299071

Portuguese 29 244702

Romanian 69 484689

Token / type Types / Lemma

Bulgarian 9.65 2.78

Czech 18.37 1.86

Dutch 14.18 1.15

English 34.93 2.8 (tbc)

German 8.76 1.38

Polish 7.46 1.78


Romanian 12.43 1.54

Corpus statistics – full corpus

• Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness)

• English has by far the highest ratio• Czech, Dutch, Portuguese and

Romanian are in between• type / lemma ration reflects richness of

inflectional paradigms

To do

• Please check / verify this numbers• Report, for the M24 deliverable,

about improvements / recanalysis of the corpora (I am aware of such activities for Bulgarian, German, and English)

Corpus statistics – annotated subcorpus

• Measuring lenghts of annotated documents

• Measuring distribution of manually marked keywords over documents

• Measuring the share of keyphrases

# of annotated documents

Average length (# of tokens)

Bulgarian 55 3980

Czech 465 672

Dutch 72 6912

English 36 9707

German 34 8201

Polish 25 4432

Portuguese 29 8438

Romanian 41 3375

# of keywords

Average # of keywords per doc.

Bulgarian 3236 77

Czech 1640 3.5

Dutch 1706 24

English 1174 26

German 1344 39.5

Polish 1033 41

Portuguese 997 34

Romanian 2555 62

Keyphrases

Bulgarian 43 %

Czech 27 %

Dutch 25 %

English 62 %

German 10 %

Polish 67 %

Portuguese 14 %

Romanian 30 %

metadata generation and glossary creation in elearning lothar lemnitzer review meeting, zürich, 25...

Documents