21. - 23. 2. 2007 vŠb - technická univerzita ostrava text mining services for trialogical learning...

20
21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services Text Mining Services for Trialogical for Trialogical Learning Learning Pavel Smrž 1 , Ján Paralič 2 , Peter Smatana 2 , Karol Furdík 2 1: Brno University of Technology, FIT, Božetěchova 2, 612 66 Brno, University of Economics, Prague, W.Churchill Sq.4, 130 67 Praha, Czech Republic, [email protected] 2: Technical University of Košice, Centre for Information Technologies, Letná 9, 040 01 Košice, Slovakia {Jan.Paralic, Peter.Smatana, Karol.Furdik}@tuke.sk

Upload: marianna-powers

Post on 28-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007

VŠB - Technická univerzita Ostrava

Text Mining Services for Text Mining Services for Trialogical LearningTrialogical Learning

Pavel Smrž1, Ján Paralič2, Peter Smatana2, Karol Furdík2

1: Brno University of Technology, FIT, Božetěchova 2, 612 66 Brno,

University of Economics, Prague, W.Churchill Sq.4, 130 67 Praha, Czech Republic,

[email protected]

2: Technical University of Košice, Centre for Information Technologies,

Letná 9, 040 01 Košice, Slovakia

{Jan.Paralic, Peter.Smatana, Karol.Furdik}@tuke.sk

Page 2: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 2

ContentsContents

KP-Lab project

Trialogical Learning and Activity Theory

Semantic Web Knowledge Middleware

Text Mining Services

• Pre-processing

• Learning Ontologies

• Classification

Future work

Page 3: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 3

Full title: Knowledge Practices Laboratory

www.kp-lab.org• Integrated EU funded FP6 IST project No. 27490• Starting date: February 1st, 2006• Duration: 5 years• 22 partners from 14 countries

Main goal: creating a learning system aimed at

facilitating innovative practices of sharing, creating and

working with knowledge in education and workplaces.

KP-Lab ProjectKP-Lab Project

Page 4: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 4

Trialogical LearningTrialogical Learning

Challenge - to capture innovative practices of both learning and working with knowledge, so-called knowledge practices.

Trialogical Learning focuses on the social processes by which learners collectively enrich/transform their individual and shared cognition.

Activity theory:• the object-orientedness of human

activity, • mediation through cultural-

historically developed tools of intelligent activity,

• contradictions emerging between the elements of activity systems.

Page 5: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 5

Knowledge ArtefactsKnowledge Artefacts

KA - a central notion of Trialogical Learning• Mediators of all activities and tasks among learners;• Capture and preserve the shared knowledge within a community.

Forms:• Physical resources / tools (documents, SW code, ...);• Concept maps, taxonomies, ontologies, domain models;• Plans, scientific theories, languages.

Goal of KP-Lab project: to provide a platform (tools & methodology) for creation and transformation of KA‘s in the trialogical manner.

Page 6: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 6

Scientific ChallengesScientific Challenges

1. Facilitating knowledge-creating learning beyond knowledge

acquisition and social participation

2. Expanding and elaborating the "trialogical" object of educational

activity

3. Eliciting the development of trialogical agencies

4. Facilitating horizontal and vertical boundary crossing

5. Developing tools for deliberate transformation of knowledge practices

6. Specifying design-principles of trialogical technologies

7. Developing methods regarding research on longitudinal

transformation of knowledge practices

8. Creating an open, developing community of trialogical technologies

Page 7: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 7

Semantic Web Knowledge MiddlewareSemantic Web Knowledge Middleware

SWKM goal - to facilitate knowledge creation processes by supporting advanced interactions of collaborating learners with knowledge artefacts, i.e. discovery, access, evolution, recommendation, and mining.

Generic modules:• Knowledge Repository - scalable persistent services for large

volumes of knowledge artefacts' descriptions and ontologies; • Knowledge Mediator - services for handling the main registry,

discovery, and evolution for KP-Lab knowledge artefacts; • Knowledge Matchmaker - services supporting interactions of KP-

Lab users with knowledge artefacts employing their semantic descriptions.

Page 8: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 8

SWKM ArchitectureSWKM Architecture

Features:• adopts SOA

principles;

• built upon the RDFSuite OS platform;

• data: RDF, accessed by RQL / RUL.

Page 9: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 9

Text Mining in the KP-LabText Mining in the KP-Lab

Text mining services - intelligent access and manipulation with the knowledge artefacts; to assist users in creating or updating the semantic descriptions of KP-Lab knowledge artefacts.

TMS fundamental tasks:• Ontology learning - extraction of conceptual maps (clustering), i.e. an

automatic extraction of significant terms from KA's textual descriptions and converting them to a structure of concepts and their relationships.

• Classification of knowledge artefacts - grouping a given set of artefacts into predefined or ad hoc categories.

Page 10: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 10

Schema of Text Mining ServicesSchema of Text Mining Services

Page 11: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 11

Pre-processingPre-processing

Preprocessing phase - transforming data into the appropriate form. It consists of several language-dependent NLP steps that provide annotations of the plain-text resources.

Unified modules: • tokenization, stemming (or lemmatization, e.g. in CZ/SK), elimination

of stop words, POS (part-of-speech) tagging.

Individual modules: (crucial for some methods of ontology learning)• chunking, WSD (word-sense disambiguation), full syntactic analysis.

GATE (http://www.gate.ac.uk/) - a platform for NLP, provides:• an architecture, or organisational structure, for NLP software;• a framework, or class library, which implements the architecture;• a development environment built on top of the framework.

Page 12: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 12

Ontology Learning (1)Ontology Learning (1)

1. Conversion to a plain text format• Structural info in source file is used as metainformation in next steps.

2. Processing by GATE• Tokenization, sentence boundaries, POS tagging (Brill‘s tagger),

named entity recognition, Charniak's syntactic analyser.

3. Significant terms (concepts) identification• A background domain model, created from additional textual resources.

4. Semantic relations identification• A set of pre-defined (or automatically identified) patterns and co-

occurrence statistics are used

Page 13: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 13

Ontology Learning (2)Ontology Learning (2)

5. Ontology merging• The extracted structure is combined with the global domain ontology

(stored in KP-Lab knowledge repository). The mechanism of the explicit uncertain knowledge representation is used in this step.

6. Visualisation• Combination of the gained

qualitative data and the relevance weights.

• The selection of the most suitable visualisation form depends on the needs of KP-Lab users; the simple view in a graphical form is the proposal.

Page 14: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 14

Ontology Learning (3)Ontology Learning (3)

7. Export to other formats• Standard OWL export routines are supported currently. The emerging

BayesOWL and FuzzyOWL formats are under development.

Creation of the training set - background model:• 2-billion-word GigaCorpus for English;• 600-million-word corpus for Czech;• additional relevant documents provided by users.

Data simulation - using Wikiversity & Wikipedia texts.

Scenarios:

1. Collaborative acquiring of knowledge in a company

2. Description of a field of interest. Creation of an essay for a given topic(s) in an academic environment.

Page 15: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 15

ClassificationClassification

Task is to automatically organize a set of knowledge artefacts into predefined or ad hoc categories - existing or new concepts of an ontology.

Classification is supervised by a model, created from a training set of semantically annotated artefacts. The model contains a set of parameters (weights, rules, etc.) created in the process of training and used in the classification of unknown examples.

Algorithms to be used:• simple term matching, kNN, SVM, Winnow, Perceptron, Naive Bayes

(multinomial and binomial), boosting, decision rules, and decision trees (various combinations of growing and pruning methods).

Implementation platform: JBowl library

Page 16: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 16

JBowl LibraryJBowl Library

JBowl - Open Source library in Java, provides support for:

• intelligent information retrieval, summarization, and information extraction from textual documents;

• text mining, clustering, categorization, classification tasks.

Main characteristics:• extendable modular architecture;

• platform for pre-processing (incl. NLP methods) and indexing of large textual collections;

• functions for creation and evaluation of text mining models (for both supervised or non-supervised algorithms).

Web: http://sourceforge.net/projects/jbowl/

Page 17: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 17

JBowl Library - ArchitectureJBowl Library - Architecture

modelsmodels

datadata

analysisanalysis

Tokenization Sentence chunking NP chunkingPOS tagging

Statistics TF IDF Term selection

categorization clustering keyword extraction/ summarization

information extraction

utilsutils

BLASMatrixesCollections

documentsdocuments

Lucene index ThesaurusXML

Page 18: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 18

JBowl Library - UsageJBowl Library - Usage

JBowl provides:• Text categorization method for the active learning, allowing to reduce

the number of training examples.

• Heuristics that selects examples according to the confidence of the classifier prediction for the given example. This heuristic does not require a validation set and can be used effectively to select a small set of labeled examples.

• Integration of several classification methods, evaluation.

• Tools for NLP (incl. Slovak linguistic resources and tools).

Scenario for use of classification service:• Annotation of new or updated artefacts - system can suggest

suitable concepts from one or more ontologies to be assigned as metadata or conceptual description to the artefact.

Page 19: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 19

Solving multilinguality - find a minimal set of NLP resources that are satisfactory for the (basic) functionality of the text-mining services.

Increasing efficiency: requirement of synchronous SOA system - e.g. by the use of the Extensible Messaging and Presence Protocol (XMPP)

Classification: Selection of most appropriate algorithms in the context of the automatic annotation of the artefacts according to the semantics codified in several ontologies. (with limited availability of training data)

Ontology learning: to concentrate on the better ways of ontology merging (incl. the need to combine extracted relations with the ones from existing domain ontologies).

Implementation of the first prototype of the SWKM (M24), testing and evaluation.

Future WorkFuture Work

Page 20: 21. - 23. 2. 2007 VŠB - Technická univerzita Ostrava Text Mining Services for Trialogical Learning Pavel Smrž 1, Ján Paralič 2, Peter Smatana 2, Karol

21. - 23. 2. 2007, VŠB - Technická univerzita Ostrava

Text Mining Services for Trialogical LearningPavel Smrž, Ján Paralič, Peter Smatana, Karol Furdík

# 20

Thank you Thank you !!Questions?Questions?

http://www.kp-lab.org

Further information: