metadata generation and glossary creation in elearning lothar lemnitzer review meeting, zürich, 25...
TRANSCRIPT
Metadata generation and glossary creation in
eLearning
Lothar LemnitzerReview meeting, Zürich, 25 January
2008
Outline
• Demonstration of the functionalities• Where we stand• Evaluation of tools• Consequences for the development
of the tools in the final phase
Demo
We simulate a tutor who adds a learning objects and generates and edits additional data
Where we stand (1)
Achievements reached in the first year of the project:
• Annotated corpora of learning objects• Stand-alone prototype of keyword
extractor (KWE)• Stand-alone prototype of glossary
candidate detector (GCD)
Where we stand (2)
Achievements reached in the second year of the project:
• Quantitative evaluation of the corpora and tools
• Validation of the tools in user-centered usage scenarios for all languages
• Further development of tools in response to the results of the evaluation
Evaluation - rationale
Quantitative evaluation is needed to• Inform the further development of
the tools (formative)• Find the optimal setting / parameters
for each language (summative)
Evaluation (1)
Evaluation is applied to:• the corpora of learning objects• the keyword extractor• the glossary candidate detector
In the following, I will focus on the tool evaluation
Evaluation (2)
Evaluation of the tools comprises of1. measuring recall and precision
compared to the manual annotation2. measuring agreement on each task
between different annotators3. measuring acceptance of keywords /
definition (rated on a scale)
KWE Evaluation step 1
• On human annotator marked n keywords in document d
• First n choices of KWE for document d extracted
• Measure overlap between both sets• measure also partial matches
Best method F-Measure
Bulgarian TFIDF/ADRIDF 0.25
Czech TFIDF/ADRIDF 0.18
Dutch TFIDF 0.29
English ADRIDF 0.33
German TFIDF 0.16
Polish ADRIDF 0.26
Portuguese TFIDF 0.22
Romanian TFIDF/ADRIDF 0.15
KWE Evaluation – step 2
• Measure Inter-Annotator Agreement (IAA)
• Participants read text (Calimera „Multimedia“)
• Participants assign keywords to that text (ideally not more than 15)
• KWE produces keywords for text
KWE Evaluation – step 2
1. Agreement is measured between human annotators
2. Agreement is measured between KWE and human annotators
We have tested two measures / approaches– kappa according to Bruce / Wiebe– AC1, an alternative agreement weighting
suggested by Debra Haley at OU, based on Gwet
IAA human annotators
IAA of KWE with best settings
Bulgarian 0.63 0.99
Czech 0.71 0.78
Dutch 0.67 0.72
English 0.62 0.82
German 0.64 0.63
Polish 0.63 0.67
Portuguese 0.58 0.67
Romanian 0.59 0.61
KWE Evaluation – step 3
• Humans judge the adequacy of keywords
• Participants read text (Calimera „Multimedia“)
• Participants see 20 KW generated by the KWE and rate them
• Scale 1 – 4 (excellent – not acceptable)• 5 = not sure
20 kw First 5 kw
First 10 kw
Bulgarian 2.21 2.54 2.12
Czech 2.22 1.96 1.96
Dutch 1.93 1.68 1.64
English 2.15 2.52 2.22
German 2.06 1.96 1.96
Polish 1.95 2.06 2.1
Portuguese
2.34 2.08 1.94
Romanian 2.14 1.8 2.06
GCD Evaluation - step 1
• A human annotator marked definitions in document d
• GCD extracts defining contexts from same document d
• Measure overlap between both sets• Overlap is measured on the sentence
level, partial overlap counts
Is-definitions
Recall Precision
Bulgarian 0.64 0.18
Czech 0.48 0.29
Dutch 0.92 0.21
English 0.58 0.17
German 0.55 0.37
Polish 0.74 0.22
Portuguese 0.69 0.30
Romanian 1.0 0.53
GCD Evaluation – step 2
• Measure Inter-Annotator Agreement• Experiments run for Polish and Dutch• Prevalence-adjusted version of kappa
used as a measure• Polish: 0.42; Dutch: 0.44• IAA rather low for this task
GCD Evaluation – step 3
• Judging quality of extracted definitions• Participants read text • Participants get definitions extracted by
GCD for that text and rate quality• Scale 1 – 4 (excellent – not acceptable)• 5 = not sure
# defin. # testers Av. value
Bulgarian 25 7 2.7
Czech 24 6 3.1
Dutch 14 6 2.8
English 10 4 3.3
German 5 5 2.1
Polish 11 5 2.7
Portuguese
36 6 2.2
Romanian 9 7 3.0
GCD Evaluation – step 3
Further findings• relatively high variance (many ‚1‘
and ‚4‘)• Disagreement between users about
the quality of individual definitions
Individual user feedback - KWE• The quality of the generated keywords
remains an issue • Variance in the responses from different
language groups• We suspect a correlation between
language of the users and their satisfaction
• Performance of KWE relies on language settings, we have to investigate them further
Individual user feedback – GCD
• Not all the suggested definitions are real definitions.
• Terms are ok, but definitions cited are often not what would be expected.
• Some terms proposed in the glossary did not make any sense.
• The ability to see the context where a definition has been found is useful.
Consequences - KWE
• Use non-distributional information to rank keywords (layout, chains)
• Present first 10 keywords to user, more keywords on demand
• For keyphrases, present most frequent attested form
• Users can add their own keywords
Consequences - GCD
• Split definitions into types and tackle the most important types
• Use machine learning alongside local grammars
• Look into the part of the grammars which extract the defined term
• Users can add their own definitions
Plans for final phase
• KWE, work with lexical chains• GCD, extend ML experiments
• Finalize documentation of the tools
Validation
User scenarios with NLP tools embedded:
1. Content provider adds keywords and a glossary for a new learning object
2. Student uses keywords and definitions extracted from a learning object to prepare a presentation of the content of that learning object
Validation
3. Students use keywords and definitions extracted from a learning objects to prepare a quiz / exam about the content of that learning object
Validation
We want to get feedback about• The users‘ general attitude towards
the tools• The users‘ satisfaction with the
results obtained by the tools in the particular situation of use (scenario)
User feedback
• Participants appreciate the option to add their own data
• Participants found it easy to use the functions
Plans for the next phase
Improve precision of extraction results:• KWE – implement lexical chainer• GCD – use machine learning in
combination with local grammars or substituting these grammars
• Finalize documentation of the tools
Corpus statistics – full corpus
• Measuring lengths of corpora (# of documents, tokens)
• Measuring token / tpye ratio• Measuring type / lemma ratio
# of documents
# of tokens
Bulgarian 55 218900
Czech 1343 962103
Dutch 77 505779
English 125 1449658
German 36 265837
Polish 35 299071
Portuguese 29 244702
Romanian 69 484689
Token / type Types / Lemma
Bulgarian 9.65 2.78
Czech 18.37 1.86
Dutch 14.18 1.15
English 34.93 2.8 (tbc)
German 8.76 1.38
Polish 7.46 1.78
Portuguese 12.27 1.42
Romanian 12.43 1.54
Corpus statistics – full corpus
• Bulgarian, German and Polish corpora have a very low number of tokens per type (probably problems with sparseness)
• English has by far the highest ratio• Czech, Dutch, Portuguese and
Romanian are in between• type / lemma ration reflects richness of
inflectional paradigms
To do
• Please check / verify this numbers• Report, for the M24 deliverable,
about improvements / recanalysis of the corpora (I am aware of such activities for Bulgarian, German, and English)
Corpus statistics – annotated subcorpus
• Measuring lenghts of annotated documents
• Measuring distribution of manually marked keywords over documents
• Measuring the share of keyphrases
# of annotated documents
Average length (# of tokens)
Bulgarian 55 3980
Czech 465 672
Dutch 72 6912
English 36 9707
German 34 8201
Polish 25 4432
Portuguese 29 8438
Romanian 41 3375
# of keywords
Average # of keywords per doc.
Bulgarian 3236 77
Czech 1640 3.5
Dutch 1706 24
English 1174 26
German 1344 39.5
Polish 1033 41
Portuguese 997 34
Romanian 2555 62
Keyphrases
Bulgarian 43 %
Czech 27 %
Dutch 25 %
English 62 %
German 10 %
Polish 67 %
Portuguese 14 %
Romanian 30 %