pattern analysis & machine intelligence research group university of waterloo

Pattern Analysis & Machine IntelligenceResearch Group

UNIVERSITY OF WATERLOO

LORNET Theme 4Data Mining and Knowledge Extraction for LO

T L : Mohamed KamelPI’s: O. Basir, F. Karray, H. TizhooshAssoc PI’s: A. Wong, C. DiMarco

PAMI Research Group, University of Waterloo

Theme 4 TeamLeader: M. Kamel

PI’s: Dr. Basir Dr. Tizhoosh

Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi

Funding CRC/CFI/OIT NSERC PAMI Lab

Dr. Karray Asso PI (Wong, DiMarco

M. Shokri S. Hassan A. Farahat Dr. R. Khoury

PDS, Vestech, Desire2Learn

Graduated R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04


Data and Knowledge Mining

Knowledge extraction and discovery of patterns from data.

Labeling and categorization, summarization, classification, prediction, association rules, clustering


Theme Overview

KnowledgeExtraction

Taggingand

Organizing

Matchingand

Ranking

LOMining

Classification (MCS, Data Partitioning, Imbalanced Classes)Clustering (Parallel/Distributed Clustering, Cluster Aggregation)

From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-basedFrom Images Image Features, Shape FeaturesFrom Text + Images Describing Images with Text Enriching Text with Images

LO Similarity and RankingAssociation Rules / Social Networks

Reinforcement LearningSpecialized / Personalized Search


Types of Data in LORNET

LCMS

CourseCourseCourseModule Lesson LOModuleModule LessonLesson LOLO

Discussion Board

Thread PostThreadThread PostPostBoardBoardBoard

LOR

MetadataMetadataMetadataRecordRecordRecord

TELOS

SemanticLayer

ResourceResourceResourceSubject MatterText, Images, Flash, Applets, Metadata, Interaction Logs

DiscussionsText, Interaction Logs

LO DescriptorsMetadata

ResourcesMetadata,Semantic References


Abstract View of Data for Mining

Text (Plain or Markup) Any resource that contains text is viewed as an abstract text document (some markup can be

preserved to indicate different weights); e.g. HTML page, Word document, email message, discussion post, even metadata records.

Suitable for text mining, information/metadata extraction, summarization, natural language processing, semantic/concept analysis, social network analysis.

Numeric Matrix (Vector Space Model) Requires text mining algorithms to convert the original text to numeric form through feature extraction

and statistical weighting. Suitable for machine learning algorithms that expect numeric input, especially classification and

clustering algorithms.

Feature Vectors Suitable for mining images: description, indexing, and retrieval (CBIR). Requires image processing

algorithms to extract image features. Also suitable for mining and learning from interaction logs, where each vector describes an event.

Relationship Provides domain knowledge about data, such as containment (e.g. LO within Course, Post within

Thread) and relatedness (collection of resources, cross-referenced LOs). The extra knowledge could be exploited to improve accuracy, or to apply the same algorithm to

different parts of the data (e.g. generating one summary for entire course, or one summary per lesson.)


Data Representation

What level of granularity One representation or multiple Feature representation Dimensionality issues


Document Modeling

Document is represented by a set of concepts called “indexing terms” Document segmentation sub-word level (decomposition of words and their

morphology) word level (words and lexical information) multi-word level (phrases and syntactic

information) semantic level (the meaning of the text) pragmatic level (the meaning of the text with

respect to the context and situation- ontology?)


Document Modeling

required domain knowledge

pragmaticsemanticmultiwordwordsub-word

noise &redundancy

dimensionalitycontent-based

context-based

complex algorithms


Document Modeling

pragmaticsemanticmultiwordwordsub-word

Term-level(most popular)

Emerging

Not explored

Not usual


Document Modeling

Bag-of-words (VSM): most popular document representation model word sequence weighting terms by their importance (based on frequency) terms are independent and uncorrelated

Bag-of-words (VSM):Drawbacks ignoring term dependencies and correlations ignoring text structure ignoring ordering of the words in the document

• IR research shows that word ordering is not important. ignoring grammar language independent

Solutions: generalized VSM, LSI, Phrase based model, concept based representation


Curse of Dimensionality

the number of training samples is exponential

function of the number of features For a fixed sample size, increasing the number of

features may degrade the performance (Peaking Phenomenon)

Limited sample size leads the overfitting problem which implies the lack of generalization and low performance.


Dimensionality Reduction

Feature extraction employing all dimensions and measurement space

to obtain a new transformed space (compacting feature space without removing any)

• identifying important combination of the features (PCA, manifold learning, SVD and factor analysis)

• low dimensional embeddings (random projections) Pros and Cons

+ promising results+ solid mathematical background- high complexity (time and space)- lack of scalability- fails in high dimensional problems of data mining- extracted features usually have no meaning.


Dimensionality Reduction

Feature selectionreducing the feature space dimensionality

by removing useless, redundant, irrelevant and noise features

it is a problem of searching for a subset of features among the total number of features based on one or more performance index (objective function)

Makrehchi and Kamel, IEEE SMC 07.


New Representation Models

Phrase Based RepresentationDocument Index Graph(DIG) Hammouda and Kamel, KIS 2004, IEEE KDE 2004

Concept Based RepresentationShehata, Karray and Kamel, ICDM 2006, KDD 07, WI07


Lang

uage

Inde

pend

ent

TextText

Lang

uage

Dep

ende

nt

l

Semantic Role Labeler

Syntax Parser

POS Tagger

Lang

uage

Dep

ende

nt

Natural Language Processing


Syntax Parser

POS Tagger

Concept - based Model

Sentence Separator

Concept-based Statistical Analyzer

(tf : term frequency)(ctf: conceptual term frequency)

Conceptual Ontological Graph (COG)

Representation

Text Pre- processorText Pre- processor

ConceptsConceptsConceptsConcepts

Concept-based Mining Model



Concept-based Document Similarity

TextDocs

Text Preprocessing- Separate sentences- Label terms- Remove stop-words- Stem words

Clustering Techniques- Single Pass- HAC (ward)- HAC (complete)- k-NN

Cluster2

Cluster1

Cluster3

Concept-based Term Analysis- Term frequency (tf)- Conceptual term frequency (ctf)


Evaluation

Single-Term Concept-based ImprovementReuters 0.723 0.925 +27.94%ACM 0.697 0.918 +31.70%Brown 0.581 0.906 +55.93%

F-measure of the HAC (Ward) (Higher is better)

Single-Term Concept-based ImprovementReuters 0.251 0.012 -95.21%ACM 0.317 0.043 -86.43%Brown 0.385 0.018 -95.32%

Entropy of the HAC (Ward) (Lower is better)


Evaluation (cont.)

Single-Term Concept-based ImprovementReuters 0.511 0.917 +79.45%ACM 0.491 0.891 +81.46%Brown 0.462 0.902 +95.23%

F-measure of the k-NN

Single-Term Concept-based ImprovementReuters 0.348 0.015 -95.68%ACM 0.402 0.111 -29.1%Brown 0.316 0.023 -23.03%

Entropy of the k-NN


Classification

Function that assigns an object to a class Infer that “object X is about sports” Automatically learn the function from a set

of examples

Classifier

sports

farming

finance

set of objects

Known Classes


Classifiers

Template Matching: user need to supply template and metric NMC: nearest class mean, simple, no training K-NN: Asymptotically optimal, slow in testing Bayes: yields simple classifier for Gaussian distributions NN: nonlinear, sensitive to parameters, slow training DT: binary, transparent, sensitive to overtraining SVM: nonlinear, insensitive to overtraining, slow, good generalization


Multiple Classifier Systems

Multiple classifier systems consist of a set of classifiers and a combination strategy.

Motivations: Existence of many alternative classifiers each has its

own feature and representation space Existence of different training sets collected at different

times and may even have different features. Each classifier may have good performance in its own

region of the feature space Classifiers may have different patterns for making

mistakes, even when they are trained on the same data


Multiple Classifier Systems Design

Design of MCS can be accomplished at 4 levels [Kuncheva 04]

Aggregation LevelClassifier LevelFeature levelData Level

Classifier 1 Classifier nClassifier 2 . . .

Aggregation Rule

D1

Training data

DnD2

X1 X2 Xm


Combining Schemes

Static vs Adaptive, Fixed vs Trainable Voting methods: Max, average, majority, Borda Weighted average, fuzzy integrals, belief theory. Decision Template, Behavior Knowledge space Feature Base Architecture (Adaptive) (Wanas and

Kamel 99-02) aggregation is trained and adapts to the data rather than postprocessing.

Data Level combining: partitioning technique for training multiple classifiers (Dara, .. and Kamel IF04, PR 06) that generates nearly optimal training partitions


Imbalanced Classes

Sun and Kamel, ICDM 2006, PR 2007) Data Set: 20-Newsgroup Class size ratio: 1/15 Performance measure: F-measure Base classifier: Naïve Bayesian

NB AdaBoost AdaC1 AdaC2 AdaC3 58.25 59.26 64.11 69.08 68.91 F

Acc

97.13 97.98 98.28 98.31 98.42

94.63 96.15 96.73 96.80 97.00F

Data Set: SchoolNet Class size ratio: 1/12 Performance measure: F-measure Base classifier: Decision Trees

C4.5 C4.5 AdaBoost AdaC1 AdaC2 AdaC3

22.78 31.58 35.16 52.73 53.85 F

Acc

92.50 93.63 92.63 93.35 93.91

86.32 88.34 86.77 88.34 89.24 F

Performance on the small size classF FPerformance on the large size class1. Performance of the base classifier on the small class is poor 2. AdaBoost is capable to improve classification accuracy3. AdaBoost does not guarantee the improved performance on the small class 4. AdaC2 and AdaC3 are effective in increasing the identification performance

of the small class

Observations:


Dealing with time dependant data

Time series data contains dynamic information and is difficult to be modelled by any individual representation methods

Traditional classifiers for time series data like Dynamic Time Warping (DTW) are not robust

Aggregating the decisions based on different representations could provide better and more reliable performances (Chen and Lei 2004-2006)


Architecture


Experimental Results


Finding groups of objects such that objects in a group are similar to one another and different from (dissimilar) objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Clustering


Clustering Approaches

Hierarchal: single link Partitional: K-means, Fuzzy K-means, Bisecting, VQ Density based: DBScan, Chameleon Agglomerative: starts from individual clusters then

merge Divisive: start from one and divide Connectionest: SOM. ART


Clusters Mapping Method

EnsembleCombination

SchemeEnsembleSummarization/Voting

Representation

Combining Method

Partial/Local Clustering

Combined/Global Clustering

Partial/Local Clustering

Partial/Local Clustering. . .

Generated Cluster Ensemble

Overview of Combining Cluster Ensembles

Multi-clustering


Cluster Ensemble

Developed a prototype for cluster ensemble methods (Ayad and Kamel 2005-2007) include:- Generation of cluster ensembles based on: (1) multiple feature subsets, (2) statistical sampling techniques, and (3) variable number of clusters (multi-resolution ensembles).- Combiners of cluster ensembles based on (1) Shared nearest neighbors, (2) Different representations and distance measures between clusters, and (3) Voting.

Positive experimental results on text data, in addition to a variety of benchmark data for machine learning algorithms


Categorization using cluster ensemble

Dataset # samples

# attributes

# classes

K-means’ Mean Error Rate in %

Ensemble’s Mean Error Rate in %

Synthetic1 1000 8 5 17.41 0

Yahoo! (text) 2340 1458 6 38.23 16.24

Texture (image) 5500 40 11 37.99 11.54

Optical Digit Recognition

500 64 10 27.31 16.40


Projects Overview

Text Document

Information ExtractionAnalyzing content to extract relevant information

Keyword ExtractionSummarizationConcept ExtractionSocial Network Analysis

CategorizationOrganizing LOs according to their content

Text Document Classification

Clustering

- Traditional- MCS- Imbalanced

- Traditional- Ensembles- Distributed

PersonalizationProviding user-specific results

ReinforcementLearning

- Traditional- Opposition- based

Image MiningDescribing and finding relevant images

CBIR - Traditional- Fusion-based

ImageInteraction Logs

Integration and Applications

In Progress PublicationsTheme and Industry Collaboration

Software Components


Information Extraction: Summarization

LO Content Package Summarization

Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis.

Statistical term weighting and sentence ranking are performed on each document, and to the whole collection.

Top relevant sentences are extracted for each document.

Planned functionality: Summarization of whole modules or lessons (as opposed to single documents).

Benefits Provide summarized overview of learning objects

for quick browsing and access to learning material.

Scenarios Learning Management Systems can call the

summarization component to produce summaries for content packages.

Data is courtesy University of Saskatchewan


Information Extraction: Social Network Analysis

Social Network Builder

Tasks Finding relationships between people based on their web pages

Progress Modeling

Actors are represented by their associated documents Links are modeled by

• Pair-wise Similarity of the actors’ documents• Merging actors’ documents relations are also modeled by

documents Learning

Some links are known learning social network is translated into text classification problem

No link is revealed a clustering problem with very low performance


Information Extraction: Concept ExtractionLa

ngua

ge In

depe

nden

t

TextText

Lang

uage

Dep

ende

nt

l


Syntax Parser

POS Tagger

Lang

uage

Dep

ende

nt

Natural Language Processing

Semantic Parser

Syntax Parser

POS Tagger

Concept - based Model

Sentence Separator


(tf : term frequency)(ctf: conceptual term frequency)

Conceptual Ontological Graph (COG)

Representation

Text Pre- processorText Pre- processor

ConceptsConceptsConceptsConcepts

F-measure of Hierarchical ClusteringSingle-Term Concept-based Improvement

Reuters 0.723 0.925 +27.94%

ACM 0.697 0.918 +31.70%

Brown 0.581 0.906 +55.93%

Entropy of Hierarchical ClusteringSingle-Term Concept-based Improvement

Reuters 0.251 0.012 -95.21%

ACM 0.317 0.043 -86.43%

Brown 0.385 0.018 -95.32%

Precision of SearchSingle-Term Concept-based Improvement

Cran 0.536 0.901 +68.09%

Reuters 0.591 0.897 +51.77%

Recall of Search ResultSingle-Term Concept-based Improvement

Cran 0.486 0.827 +70.16%

Reuters 0.452 0.841 +86.06%

Concept-Based Statistical Analyser

Conceptual Ontological Graph (COG) Ranking


Information Extraction: Keyword Extraction

Semantic Keyword Extraction

Tasks Developing tools and techniques to extract semantic keywords

toward facilitating metadata generation Developing algorithms to enrich metadata (tags) which can be

applied in index-based multimedia retrieval

Progress Proposed a new information theoretic inclusion index to measure

the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology)

Makrehchi, M. and Kamel, ICDM07, WI 07


Information Extraction: Keyword Extraction

Learn rules to find keywords in English sentences

Rules represent sentence fragments Specific enough for reliable keyword

extraction General enough to be applied to

unseen sentences Rule generalization

Begin with an exact sentence fragment

Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy

Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise

Keyword extraction Find sequence of rules that best

cover an unseen sentence Extract keywords according to rules

Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination

Learns 20 rules from the first 50 training rules Learns 13 additional rules from the next 220

training rules

Both precision and recall values increase during training

Precision (blue) increases 10%Recall (red) shows slight upward trend

Rule-based Keyword Extraction


Categorization: Ensemble-based Clustering

Consensus Clustering Categorization of learning objects using proposed consensus clustering

algorithms. The goal of consensus clustering is to find a clustering of the data objects

that optimally summarizes an ensemble of multiple clusterings. Consensus clustering can offer several advantages over a single data

clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes.

Tasks Development of techniques for producing ensembles of multiple data

clusterings where diverse information about the structure of the data is likely to occur.

Development of consensus algorithms to aggregate the individual clusterings.

Develop solutions for the cluster symbolic-label matching problem Empirical analysis on real-world data and validation of proposed method.


Categorization using cluster ensemble

Dataset # samples

# attributes

# classes

K-means’ Mean Error Rate in %

Ensemble’s Mean Error Rate in %

Synthetic1 1000 8 5 17.41 0

Yahoo! (text) 2340 1458 6 38.23 16.24

Texture (image) 5500 40 11 37.99 11.54

Optical Digit Recognition

500 64 10 27.31 16.40


Distributed Environments

Distributed Data MiningApplying Data Mining in an environment where the data, the mining process, or both are distributed.

Motivation Natural distribution of data on the Web.

Scenarios that require the integration of disparate data and mining results are emerging (e.g. federation of repositories, news feed aggregation, digital libraries, business intelligence gathering, etc.)

Emerging technologies, such as Semantic Web, Web Services, Grid Computing, make it feasible to build distributed mining systems.

Availability of cheap low-end hardware that could be utilized in a distributed environment to achieve high-end goals (e.g. Google, SETI@Home, Folding@Home, etc.)


Categorization: Distributed Clustering

Peer nodes are arranged into groups called “neighborhoods”.

Multiple neighborhoods are formed at each level of the hierarchy.

This size of each neighborhood is determined through a network partitioning factor.

Each neighborhood has a designated supernode.

Supernodes of level h form the neibhorhoods for level h+1.

Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes.

Benefits Significant speedup over centralized clustering and

flat peer-to-peer clustering. Multiple levels of clusters. Distributed summarization of clusters using

CorePhrase keyphrase extraction.

Scenarios Distributed knowledge discovery in hierarchical

organizations.

Neighborhood (Q)

SuperNode (S)

h = 0

h = 1

h = 2

Root

h = H-1

h = H

h = 0β = 0.2

h = 1β = 0.33

h = 2β = 0

h = 3

},,{

},,{)0(

4)0(

1)0(

)0(16

)0(1

)0(

QQ

pp

Q

P

},{

},,,{)1(

2)1(

1)1(

)1(4

)1(3

)1(2

)1(1

)1(

QQ

pppp

Q

P}{

},{)2(

1)2(

)2(2

)2(1

)2(

Q

pp

Q

P

HP2PC Architecture

HP2PC Example3-level network, 16 nodes

Hierarchical P2P Document Clustering


Categorization: Multiple Classifier Systems

Tasks To investigate various aspects of

cooperation in Multiple Classifier Systems (Classifier Ensembles)

To develop evaluation measures in order to estimate various types of cooperation in the system

To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures

To apply these findings to optimize existing ensemble methods

To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity

Progress Proposed a set of evaluation

measures to select sub-optimal training partitions for training classifier ensembles.

Proposed an ensemble training algorithm called Clustering, De-clustering, and Selection (CDS).

Proposed and optimized a cooperative training algorithm called Cooperative Clustering, De-clustering, and Selection (CO-CDS).

Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification.


Categorization: Imbalanced Class Distribution

Objective Advance classification of multi-class imbalanced data

Tasks

To develop cost-sensitive boosting algorithm AdaC2.M1

To improve the identification performance on the important classes

To balance classification performance among several classes


Categorization: Imbalanced Class Distribution

IndInd

..sizesize Dist.Dist.

C1C1 4949 7.84%7.84%

C2C2 288288 46.08%46.08%

C3C3 288288 46.08%46.08%

Class DistributionClass DistributionC4.5C4.5 HPWR (Od=3)HPWR (Od=3)

classclass Meas.Meas. BaseBase AdaBoostAdaBoost BaseBase AdaBoostAdaBoost

C1C1RR 00 5.115.11 10.7010.70 44.0644.06

PP N/AN/A 6.56.5 11.8211.82 32.8932.89

FF N/AN/A 5.845.84 10.8310.83 35.8435.84

C2C2RR 73.2173.21 92.2892.28 88.3188.31 87.4387.43

PP 69.5369.53 88.7588.75 86.7986.79 91.9991.99

FF 72.2972.29 90.3890.38 87.4387.43 89.6489.64

C3C3RR 67.9467.94 91.3691.36 87.6387.63 88.4288.42

PP 73.8973.89 87.8887.88 87.0787.07 89.9189.91

FF 71.9171.91 89.4289.42 86.9986.99 89.0389.03

G-measureG-measure 00 11.4611.46 33.3233.32 68.5068.50

Performance of Base Classification and AdaBoost

C4.5C4.5 HPWR (Od=3)HPWR (Od=3)

ClassClass Meas.Meas. BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1 BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1

C1C1 RR 00 5.115.11 77.5877.58 10.7010.70 44.0644.06 65.7265.72

PP N/AN/A 6.506.50 14.1214.12 11.8211.82 32.8932.89 30.8330.83

C2C2 RR 73.2173.21 92.2892.28 64.7364.73 88.3188.31 87.4387.43 83.1283.12

PP 69.5369.53 88.7588.75 97.2497.24 86.7986.79 91.9991.99 91.3891.38

C3C3 RR 67.9467.94 91.3691.36 65.2365.23 87.6387.63 88.4288.42 83.9583.95

PP 73.8973.89 87.8887.88 93.2293.22 87.0787.07 89.9189.91 90.8190.81

G-meanG-mean 00 11.4611.46 68.4268.42 33.3233.32 68.5068.50 76.0876.08

Balanced performance among classes - Evaluated by G-mean


Personalization

Opposition-based Reinforcement Learning for Personalizing Image Search

Developing a reliable technique to assist users, facilitate and enhance the learning process

Personalized ORL tool assists user to observe the searched images desirable for her/him

Personalized tool gathers images of the searched results, selects a sample of them

By interacting with user and presenting the sample, it learns the user’s preferences


Personalization


Personalization

Opposition-based RL algorithms:

OQ(lambda) (International Joint Conference

on Neural Networks-2006) and

NOQ(lambda) (IEEE Symposium on Approximate

Dynamic Programming and Reinforcement Learning

2007)


Image Mining: CBIR

Content based image retrieval Build an IR system that can retrieve images based on:

Textual Cues, Image content, NL Queries

imag

esR

ich

Doc

umen

ts

Documents contain QI

Images match QI

NL Description of Image

Images contain QT

Automated image tagging

Image RetrievalTool Set

Query Image QIQuery Text QTQuery Document


Accuracy= 70%

Accuracy= 55%

Accuracy= 60%

Accuracy= 95%

IZM FD

MTAR The proposed approachx x x

xx

x x

x

x x

x x x x

x x x

x

x

xxxxx

Illustrative Example


The Performance of the proposed approach

Experimental Results (Cont’d)


Image Mining: CBIR

Interface Module to TELOS

TELOSIKB-BLDR

LOR

ImageAdmissionInterface

LOImage

Repository

Compound Document

Image

TELOSIR

Image

Compound Document

TextQuery

Response



Progress

Finished core parts of the common data mining framework.

Built components and services from theme researchers’ work around the data mining framework.

Provided documentation for the data mining framework and software components.

Launched web site to host components and documentation from Theme 4:http://pami.uwaterloo.ca/projects/lornet/software/

http://pami.uwaterloo.ca/projects/lornet/software/





Progress

Core parts of the common data mining framework are available, including:

• Vector and matrix manipulation.• Document parsing and tokenization.• Statistical term and sentence analysis.• Similarity calculation using multiple distance functions.• IMS Content Package compliant parser.

Components and tools built around the common data mining framework:

• Metadata extraction from single documents; supports Dublin Core encoding.• Document similarity calculation using cosine similarity.• Single document and content package summarization.• Building of standard text datasets from large document collections.

Integration with TELOS:• Developed C# TELOS connector for integrating Theme 4 components.• Worked on component manifest specification with Theme 6.• Provided metadata extraction as part of a complete scenario for TELOS components integration.• The following components were wrapped for use by TELOS through the C# connector: Automatic

Metadata Extractor, Document Similarity, and Document Summarizer.


Theme and Industry Collaboration

Other LORNET themes Providing tools for concept-based metadata extraction to SFU and U of

Saskatchewan. Providing tools for semantic-based ontology representation to SFU. Providing tools for searching course content and discussion data provided by U of

Saskatchewan. Providing tools for comparing between course content and discussion board data

provided by U of Saskatchewan.

Industry Pattern Discovery Software (PDS) provided data mining software tools for use by

researchers. Vestech provided opportunities for researchers to work on speech technologies. Desire2Learn opened job opportunities for LORNET researchers.


Software Components

Learning Object Repository

Metadata Structured Text Categorical

e-Learning Environment

Structured Text Images Object Relationships Context

Automatic metadata extraction LO automatic classification LO organization through clustering Multiple organization strategies through

cluster ensembles

Extracting concepts from LO Summarizing Documents Grouping LOs Tagging LOs Discovering Similar Topics Discovering Similar Peers Building Social Networks Detecting Plagiarism LO recommendation using similarity ranking Personalization / Specialization through

reinforcement learning

Legend Integrated Ready In Progress Year 5

TELOS Metadata Ontology

Ontology construction and unification Finding relations between components Ranking components Grouping components Tagging components

General ToolsC# Connector for TELOSCommon Data Mining Framework

Standard Text Mining ToolsMetadata ExtractorDocument SummarizerContent Package SummarizerDocument SimilarityLO RecommenderMetadata HarvesterKeyword ExtractorTaxonomy ExtractorMetadata Enrichment Tools

Concept-based and Semantic Text Mining Tools

Metadata ExtractorLO Search EngineDocument SimilarityDocument ClassifierDocument ClustererSemantic-based Ontology

RepresentationSemantic Metadata MatchingPOS Rule-Learning SystemTriplet Representation System

Categorization ToolsLO ClassifierLO Multiple ClassifierLO ClustererLO Ensemble ClustererLO Consensus ClustererLO Distributed Clusterer

Overview of ComponentsEnvironment Data Types Tasks

Scenarios for Use of Software Components

User-centric ToolsPersonalized Search EngineSocial Network Learner

Image Mining ToolsContent-based Image SearchPersonalized Image SearchConsensus-based Fusion for Image Retrieval


Publications

Papers(accepted / published)

Papers(submitted / in prep)

Theses(completed / in progress)

4.1 Information Extraction from Text

11 7 3/2

4.2 Semantic Knowledge Synthesis from Text

10 4 4/1

4.3 Knowledge Discovery through Categorization

12 10 4/1

4.4 Knowledge from Interaction 8 3 1/2

4.5 Knowledge from Image Mining 10 3 2/1

Total 51 27 14//7 = 21


Theme 4 TeamLeader: M. Kamel

PI’s: Dr. Basir Dr. Tizhoosh

Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi

Funding CRC/CFI/OIT NSERC PAMI Lab

Dr. Karray Asso PI (Wong, DiMarco

M. Shokri S. Hassan A. Farahat Dr. R. Khoury

PDS, Vestech, Desire2Learn

Graduated R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04


Pattern Analysis and Machine Intelligence Lab

Electrical and Computer EngineeringUniversity of WaterlooCanada

www.pami.uwaterloo.ca

www.pami.uwaterloo.ca/projects/lornet/software/

www.pami.uwaterloo.ca/kamel.html publications

pattern analysis & machine intelligence research group university of waterloo

Documents

text enriching text

data mining

mining images

original text

abstract text document

data partitioning

clusteringpami research

feature extraction