lt4el - wp1: setting the scene wp leader: uaic univ. ai. i. cuza of iasi faculty of computer science...
Post on 15-Jan-2016
215 Views
Preview:
TRANSCRIPT
LT4eL - WP1: Setting the sceneWP leader: UAIC
Univ. AI. I. Cuza of IasiFaculty of Computer Science
Dan Cristea, Corina Forăscu, Dan Tufiş,
Ionuţ Pistol, Diana Trandabăţ, Adrian Iftene
Contact: dcristea@info.uaic.ro
Utrecht Review Meeting, February 1, 2007
Objectives1. inventarization and classification of existing
tools necessary for the development of the relevant functionalities (i.e. key word extractor, glossary candidate detector);
2. collection and normalization of the learning material related to the use of the computer in education (Humanities, Social Sciences);
3. investigation of IPR issues; 4. adoption of relevant standards for linguistic
annotation of learning objects; 5. dissemination of the results through a Web
portal
Partners in WP1• Utrecht University (UU), The Netherlands • University of Hamburg (UHH), Germany • University of Lisbon (FFCUL), Portugal • Charles University Prague (CUP), Czech Republic • Institute for Parallel Processing, Bulgarian Academy
of Sciences (IPP-BAS), Bulgaria • University of Tübingen (UTU), Germany • Institute of Computer Science, Polish Academy of
Sciences (ICS-PAS), Poland • Zürich University of Applied Sciences Winterthur
(ZHW), Switzerland • University of Malta (UOM), Malta
Lexikon
CZ
CZCZEN
ENCONVERTOR 1
Documents SCORM
Pseudo-Struct.
Basic XML LING. PROCESSOR
Lemmatizer, POS, Partial Parser
CROSSLINGUAL RETRIEVAL
LMS User Profile
Documents SCORM
Pseudo-Struct
Metadata (Keywords)
Ling. Annot XML
Ontology
CONVERTOR 2
Documents HTML
Lexikon
PT
Lexikon
RO
Lexikon
PL
Lexicon
GE
Lexikon
MT
Lexikon
BG
Lexikon
DT
Lexicon
EN
PLPL
GEGE
BGBG
PTPT
MTMT
DTDT
RORO
ENDocuments User
(PDF, DOC, HTML,
SCORM,XML)
REPOSITORY
Glossary
The Portal
• A working space: – Repository for resources, tools, deliverables– Exchange information among participants– Statistics
• Hosted by UAIC: – January 2007: 1.15 Gb (without realTimeStat,
searchForm, upload/updateForm)
• Address: http://consilr.info.uaic.ro/uploads_lt4el– Username: guestLt4eL– Passwd: elearning
Demo version on CD
O1. Collection of language resources and tools (1)
• Inventarization and classification of existing tools (http://consilr.info.uaic.ro/uploads_lt4el/tools/all.php?) relevant to:– the integration of language technology resources in
eLearning (WP2)– the integration of semantic knowledge (WP3)
O1. Collection of language resources and tools (2)
• Inventarization and classification of existing language resources– corpora and frequencies lists:
http://consilr.info.uaic.ro/uploads_lt4el/menu/all.php
– lexica: http://www.let.uu.nl/lt4el/wiki/index.php/Lexica_Joint_Table
O2. Collection of LOs: the portal
Uploads, updates & real-time statistics at http://consilr.info.uaic.ro/uploads_lt4el/
Criteria (→ attributes):- Subdomains relevant for beginners in IST & e-learning
→ Domain - Multilingualism → Language- Medium sized documents → Number of words- IPR~clear → IPR- Uniformity in topics → keywords selected initially
Collection of LOs: domains
1. Use of computers in education, with sub-domains:1.1 Teaching academic skills, with sub-domains: 1.1.1 Academic skills 1.1.2 Relevant computer skills for the above tasks (MS Word, Excel, Power
Point, LaTex, Web pages, XML) 1.1.3 Basic skills (use of computer for beginners) (chats, e-mail, Intenet)1.2 e-Learning, e-Marketing1.3 The I*Teach document (Leonardo project, http://i-teach.fmi.uni-sofia.bg/)1.4 Impact of use of computers in society1.5 Studies about use of computers in schools / high schools1.6 Impact of e-Learning on education
2. Calimera documents (parallel corpus developped in the Calimera FP5 project, http://www.calimera.org/ )
Collection of LOs: domains coverageDomain Total Avg # lang
1.1 Teaching Academic skills 66,507 13,301 41.1.1 Academic skills 90,033 18,007 21.1.1.1 Writing a diploma paper 119,335 23,867 71.1.1.2 Making a presentation 41,303 8,261 51.1.1.3 Writing a scientific summary 21,699 4,340 41.1.1.4 Making an interview 22,798 4,560 31.1.1.3 Working out a small project 5,450 1,090 21.1.2 Relevant computer skills for above tasks 606,555 121,311 81.1.2.1 Using MS Word 269,525 53,905 81.1.2.2 Using Excel 136,803 27,361 71.1.2.3 Using Power Point 70,403 14,081 71.1.2.4 Using Latex 242,163 48,433 71.1.2.5 Creating Web pages 549,233 109,847 81.1.2.6 Using XML 259,120 51,824 81.1.3 Basic computer skills (use of computers for beginners) 123,790 24,758 41.1.3.1 Using chats 18,870 3,774 61.1.3.2 Using email 102,023 20,405 81.1.3.3 Accessing the Internet 189,499 37,900 81.2 eLearning, eMarketing 320,537 64,107 81.3 The I*Teach document 126,980 25,396 21.4 Impact of use of computers in society 121,446 24,289 41.5 Studies about use of computers in schools / high schools 559,362 111,872 41.6 Impact of eLearning on education 215,453 43,091 72.1 Calimera full guidelines 656,284 131,257 82.2 Calimera summaries 36,815 7,363 5
4,971,986
The hierarchy of LOs’ formats
Collection of LOs: annotation layers
1. Initial documents: doc, pdf, html, txt → Base-XML
2. Linguistic annotation: tokens, POS, lemma, chunks → WP2 XML format (LT4ELAna.dtd)
3. Keywords, definitions and ontology links annotations
Level 1 conversions
Base-XML
plain texthtml
otherlatexpdfdoc
doc → html
Level 1 conversions doc → html (UTF-8)
1. MS Office: Save As html
2. OpenOffice Writer SXC/ODT: Save As html
Level 1 conversions
Base-XML
plain texthtml
otherlatexpdfdoc
pdf → html
Level 1 conversions: pdf → html (UTF-8)
1. Adobe on-line conversion tool
2. pdfbox (Windows)
3. pdftohtml (Linux)
4. OpenOffice
5. Adobe Acrobat Professional
Level 1 conversions
Base-XML
plain texthtml
otherlatexpdfdoc
Base-XML convertor
Level 1 conversions: html → Base-XML
• The UAIC Java converter – keeps all the tags possibly useful (fixed)– produces a log of all the removed
tags/data• The CUP html2xml.pl converter
– tags kept according to a DTD
Collection of LOs: second level
WP2 XML format
tok-pos-lemma
lemmapostokmorpho NP
Language specific tools
Collection of LOs: second level
WP2 XML format
tok-pos-lemma
lemmapostokmorpho NP
scripts
Collection of LOs: KW extractor
WP2 XML format
Man KD XML Auto KD XML
Level 2
Level 3
KW extractor
Collection of LOs: KW extractor
WP2 XML format
Man KD XML Auto KD XML
Level 2
Level 3
KW extractor evaluation
Collection of LOs: third level
Incl. akw, adefIncl. km.xml, dm.xml
Man KD XML Auto KD XML
def extractor
kmxml: manually annotated kws
dmxml: manually annotated defs
akw: automatically annotated kws
adef: automatically annotated defs
Collection of LOs: third level
Incl. akw, adefIncl. km.xml, dm.xml
Man KD XML Auto KD XML
def extractor
kmxml: manually annotated kws
dmxml: manually annotated defs
akw: automatically annotated kws
adef: automatically annotated defs
def extractor evaluation
Open issues
• Convertors– Tables, figures, page look…
• IPRs– Clarify the IPR status
• authors & EU + national legislation
– Define IPR categories for LOs:• usage (free, restricted, for research...)
WP1 over time
December 05
February 06
NowMay 06
Initial collection on Portal
Structure & functionalities to the portal- BaseXML convertors- new LOs
Levels 2&3 additions- new tools- grammars- guides, docs- ontology, TermLex
D1.1Official end of WP1
Beginning of project
Evaluation
Proposal: the hierarchy seen as a processing environment
Level 2
doc pdf latex other
htmltxt
sxml
morpho tok pos lemma NP
wp2xml
tpl
akw adef
axml
Level 3
Level 1
Conclusions
• LOs, resources and tools collected• Initially: portal seen as a repository• Now: portal potentially integrated
with the LMS as a processing environment
top related