pattern analysis & machine intelligence research group university of waterloo
DESCRIPTION
Pattern Analysis & Machine Intelligence Research Group UNIVERSITY OF WATERLOO. LORNET Theme 4. Data Mining and Knowledge Extraction for LO. T L : Mohamed Kamel PI’s: O. Basir, F. Karray, H. Tizhoosh Assoc PI’s: A. Wong, C. DiMarco. PI’s: Dr. Basir Dr. Tizhoosh Researchers H. Ayad - PowerPoint PPT PresentationTRANSCRIPT
Pattern Analysis & Machine IntelligenceResearch Group
UNIVERSITY OF WATERLOO
LORNET Theme 4Data Mining and Knowledge Extraction for LO
T L : Mohamed KamelPI’s: O. Basir, F. Karray, H. TizhooshAssoc PI’s: A. Wong, C. DiMarco
PAMI Research Group, University of Waterloo
Theme 4 TeamLeader: M. Kamel
PI’s: Dr. Basir Dr. Tizhoosh
Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi
Funding CRC/CFI/OIT NSERC PAMI Lab
Dr. Karray Asso PI (Wong, DiMarco
M. Shokri S. Hassan A. Farahat Dr. R. Khoury
PDS, Vestech, Desire2Learn
Graduated R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04
PAMI Research Group, University of Waterloo
Data and Knowledge Mining
Knowledge extraction and discovery of patterns from data.
Labeling and categorization, summarization, classification, prediction, association rules, clustering
PAMI Research Group, University of Waterloo
Theme Overview
KnowledgeExtraction
Taggingand
Organizing
Matchingand
Ranking
LOMining
Classification (MCS, Data Partitioning, Imbalanced Classes)Clustering (Parallel/Distributed Clustering, Cluster Aggregation)
From Text Syntactic: Keyword, Keyphrase-based Semantic: Concept-basedFrom Images Image Features, Shape FeaturesFrom Text + Images Describing Images with Text Enriching Text with Images
LO Similarity and RankingAssociation Rules / Social Networks
Reinforcement LearningSpecialized / Personalized Search
PAMI Research Group, University of Waterloo
Types of Data in LORNET
LCMS
CourseCourseCourseModule Lesson LOModuleModule LessonLesson LOLO
Discussion Board
Thread PostThreadThread PostPostBoardBoardBoard
LOR
MetadataMetadataMetadataRecordRecordRecord
TELOS
SemanticLayer
ResourceResourceResourceSubject MatterText, Images, Flash, Applets, Metadata, Interaction Logs
DiscussionsText, Interaction Logs
LO DescriptorsMetadata
ResourcesMetadata,Semantic References
PAMI Research Group, University of Waterloo
Abstract View of Data for Mining
Text (Plain or Markup) Any resource that contains text is viewed as an abstract text document (some markup can be
preserved to indicate different weights); e.g. HTML page, Word document, email message, discussion post, even metadata records.
Suitable for text mining, information/metadata extraction, summarization, natural language processing, semantic/concept analysis, social network analysis.
Numeric Matrix (Vector Space Model) Requires text mining algorithms to convert the original text to numeric form through feature extraction
and statistical weighting. Suitable for machine learning algorithms that expect numeric input, especially classification and
clustering algorithms.
Feature Vectors Suitable for mining images: description, indexing, and retrieval (CBIR). Requires image processing
algorithms to extract image features. Also suitable for mining and learning from interaction logs, where each vector describes an event.
Relationship Provides domain knowledge about data, such as containment (e.g. LO within Course, Post within
Thread) and relatedness (collection of resources, cross-referenced LOs). The extra knowledge could be exploited to improve accuracy, or to apply the same algorithm to
different parts of the data (e.g. generating one summary for entire course, or one summary per lesson.)
PAMI Research Group, University of Waterloo
Data Representation
What level of granularity One representation or multiple Feature representation Dimensionality issues
PAMI Research Group, University of Waterloo
Document Modeling
Document is represented by a set of concepts called “indexing terms” Document segmentation sub-word level (decomposition of words and their
morphology) word level (words and lexical information) multi-word level (phrases and syntactic
information) semantic level (the meaning of the text) pragmatic level (the meaning of the text with
respect to the context and situation- ontology?)
PAMI Research Group, University of Waterloo
Document Modeling
required domain knowledge
pragmaticsemanticmultiwordwordsub-word
noise &redundancy
dimensionalitycontent-based
context-based
complex algorithms
PAMI Research Group, University of Waterloo
Document Modeling
pragmaticsemanticmultiwordwordsub-word
Term-level(most popular)
Emerging
Not explored
Not usual
PAMI Research Group, University of Waterloo
Document Modeling
Bag-of-words (VSM): most popular document representation model word sequence weighting terms by their importance (based on frequency) terms are independent and uncorrelated
Bag-of-words (VSM):Drawbacks ignoring term dependencies and correlations ignoring text structure ignoring ordering of the words in the document
• IR research shows that word ordering is not important. ignoring grammar language independent
Solutions: generalized VSM, LSI, Phrase based model, concept based representation
PAMI Research Group, University of Waterloo
Curse of Dimensionality
the number of training samples is exponential
function of the number of features For a fixed sample size, increasing the number of
features may degrade the performance (Peaking Phenomenon)
Limited sample size leads the overfitting problem which implies the lack of generalization and low performance.
PAMI Research Group, University of Waterloo
Dimensionality Reduction
Feature extraction employing all dimensions and measurement space
to obtain a new transformed space (compacting feature space without removing any)
• identifying important combination of the features (PCA, manifold learning, SVD and factor analysis)
• low dimensional embeddings (random projections) Pros and Cons
+ promising results+ solid mathematical background- high complexity (time and space)- lack of scalability- fails in high dimensional problems of data mining- extracted features usually have no meaning.
PAMI Research Group, University of Waterloo
Dimensionality Reduction
Feature selectionreducing the feature space dimensionality
by removing useless, redundant, irrelevant and noise features
it is a problem of searching for a subset of features among the total number of features based on one or more performance index (objective function)
Makrehchi and Kamel, IEEE SMC 07.
PAMI Research Group, University of Waterloo
New Representation Models
Phrase Based RepresentationDocument Index Graph(DIG) Hammouda and Kamel, KIS 2004, IEEE KDE 2004
Concept Based RepresentationShehata, Karray and Kamel, ICDM 2006, KDD 07, WI07
PAMI Research Group, University of Waterloo
Lang
uage
Inde
pend
ent
TextText
Lang
uage
Dep
ende
nt
l
Semantic Role Labeler
Syntax Parser
POS Tagger
Lang
uage
Dep
ende
nt
Natural Language Processing
Semantic Role Labeler
Syntax Parser
POS Tagger
Concept - based Model
Sentence Separator
Concept-based Statistical Analyzer
(tf : term frequency)(ctf: conceptual term frequency)
Conceptual Ontological Graph (COG)
Representation
Text Pre- processorText Pre- processor
ConceptsConceptsConceptsConcepts
Concept-based Mining Model
PAMI Research Group, University of Waterloo
Concept-based Statistical Analyzer
Concept-based Document Similarity
TextDocs
Text Preprocessing- Separate sentences- Label terms- Remove stop-words- Stem words
Clustering Techniques- Single Pass- HAC (ward)- HAC (complete)- k-NN
Cluster2
Cluster1
Cluster3
Concept-based Term Analysis- Term frequency (tf)- Conceptual term frequency (ctf)
PAMI Research Group, University of Waterloo
Evaluation
Single-Term Concept-based ImprovementReuters 0.723 0.925 +27.94%ACM 0.697 0.918 +31.70%Brown 0.581 0.906 +55.93%
F-measure of the HAC (Ward) (Higher is better)
Single-Term Concept-based ImprovementReuters 0.251 0.012 -95.21%ACM 0.317 0.043 -86.43%Brown 0.385 0.018 -95.32%
Entropy of the HAC (Ward) (Lower is better)
PAMI Research Group, University of Waterloo
Evaluation (cont.)
Single-Term Concept-based ImprovementReuters 0.511 0.917 +79.45%ACM 0.491 0.891 +81.46%Brown 0.462 0.902 +95.23%
F-measure of the k-NN
Single-Term Concept-based ImprovementReuters 0.348 0.015 -95.68%ACM 0.402 0.111 -29.1%Brown 0.316 0.023 -23.03%
Entropy of the k-NN
PAMI Research Group, University of Waterloo
Classification
Function that assigns an object to a class Infer that “object X is about sports” Automatically learn the function from a set
of examples
Classifier
sports
farming
finance
set of objects
Known Classes
PAMI Research Group, University of Waterloo
Classifiers
Template Matching: user need to supply template and metric NMC: nearest class mean, simple, no training K-NN: Asymptotically optimal, slow in testing Bayes: yields simple classifier for Gaussian distributions NN: nonlinear, sensitive to parameters, slow training DT: binary, transparent, sensitive to overtraining SVM: nonlinear, insensitive to overtraining, slow, good generalization
PAMI Research Group, University of Waterloo
Multiple Classifier Systems
Multiple classifier systems consist of a set of classifiers and a combination strategy.
Motivations: Existence of many alternative classifiers each has its
own feature and representation space Existence of different training sets collected at different
times and may even have different features. Each classifier may have good performance in its own
region of the feature space Classifiers may have different patterns for making
mistakes, even when they are trained on the same data
PAMI Research Group, University of Waterloo
Multiple Classifier Systems Design
Design of MCS can be accomplished at 4 levels [Kuncheva 04]
Aggregation LevelClassifier LevelFeature levelData Level
Classifier 1 Classifier nClassifier 2 . . .
Aggregation Rule
D1
Training data
DnD2
X1 X2 Xm
PAMI Research Group, University of Waterloo
Combining Schemes
Static vs Adaptive, Fixed vs Trainable Voting methods: Max, average, majority, Borda Weighted average, fuzzy integrals, belief theory. Decision Template, Behavior Knowledge space Feature Base Architecture (Adaptive) (Wanas and
Kamel 99-02) aggregation is trained and adapts to the data rather than postprocessing.
Data Level combining: partitioning technique for training multiple classifiers (Dara, .. and Kamel IF04, PR 06) that generates nearly optimal training partitions
PAMI Research Group, University of Waterloo
Imbalanced Classes
Sun and Kamel, ICDM 2006, PR 2007) Data Set: 20-Newsgroup Class size ratio: 1/15 Performance measure: F-measure Base classifier: Naïve Bayesian
NB AdaBoost AdaC1 AdaC2 AdaC3 58.25 59.26 64.11 69.08 68.91 F
Acc
97.13 97.98 98.28 98.31 98.42
94.63 96.15 96.73 96.80 97.00F
Data Set: SchoolNet Class size ratio: 1/12 Performance measure: F-measure Base classifier: Decision Trees
C4.5 C4.5 AdaBoost AdaC1 AdaC2 AdaC3
22.78 31.58 35.16 52.73 53.85 F
Acc
92.50 93.63 92.63 93.35 93.91
86.32 88.34 86.77 88.34 89.24 F
Performance on the small size classF FPerformance on the large size class1. Performance of the base classifier on the small class is poor 2. AdaBoost is capable to improve classification accuracy3. AdaBoost does not guarantee the improved performance on the small class 4. AdaC2 and AdaC3 are effective in increasing the identification performance
of the small class
Observations:
PAMI Research Group, University of Waterloo
Dealing with time dependant data
Time series data contains dynamic information and is difficult to be modelled by any individual representation methods
Traditional classifiers for time series data like Dynamic Time Warping (DTW) are not robust
Aggregating the decisions based on different representations could provide better and more reliable performances (Chen and Lei 2004-2006)
PAMI Research Group, University of Waterloo
Finding groups of objects such that objects in a group are similar to one another and different from (dissimilar) objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
Clustering
PAMI Research Group, University of Waterloo
Clustering Approaches
Hierarchal: single link Partitional: K-means, Fuzzy K-means, Bisecting, VQ Density based: DBScan, Chameleon Agglomerative: starts from individual clusters then
merge Divisive: start from one and divide Connectionest: SOM. ART
PAMI Research Group, University of Waterloo
Clusters Mapping Method
EnsembleCombination
SchemeEnsembleSummarization/Voting
Representation
Combining Method
Partial/Local Clustering
Combined/Global Clustering
Partial/Local Clustering
Partial/Local Clustering. . .
Generated Cluster Ensemble
Overview of Combining Cluster Ensembles
Multi-clustering
PAMI Research Group, University of Waterloo
Cluster Ensemble
Developed a prototype for cluster ensemble methods (Ayad and Kamel 2005-2007) include:- Generation of cluster ensembles based on: (1) multiple feature subsets, (2) statistical sampling techniques, and (3) variable number of clusters (multi-resolution ensembles).- Combiners of cluster ensembles based on (1) Shared nearest neighbors, (2) Different representations and distance measures between clusters, and (3) Voting.
Positive experimental results on text data, in addition to a variety of benchmark data for machine learning algorithms
PAMI Research Group, University of Waterloo
Categorization using cluster ensemble
Dataset # samples
# attributes
# classes
K-means’ Mean Error Rate in %
Ensemble’s Mean Error Rate in %
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition
500 64 10 27.31 16.40
PAMI Research Group, University of Waterloo
Projects Overview
Text Document
Information ExtractionAnalyzing content to extract relevant information
Keyword ExtractionSummarizationConcept ExtractionSocial Network Analysis
CategorizationOrganizing LOs according to their content
Text Document Classification
Clustering
- Traditional- MCS- Imbalanced
- Traditional- Ensembles- Distributed
PersonalizationProviding user-specific results
ReinforcementLearning
- Traditional- Opposition- based
Image MiningDescribing and finding relevant images
CBIR - Traditional- Fusion-based
ImageInteraction Logs
Integration and Applications
In Progress PublicationsTheme and Industry Collaboration
Software Components
PAMI Research Group, University of Waterloo
Information Extraction: Summarization
LO Content Package Summarization
Learning objects stored in IMS content pacakges are loaded and parsed. Textual content files are extracted for analysis.
Statistical term weighting and sentence ranking are performed on each document, and to the whole collection.
Top relevant sentences are extracted for each document.
Planned functionality: Summarization of whole modules or lessons (as opposed to single documents).
Benefits Provide summarized overview of learning objects
for quick browsing and access to learning material.
Scenarios Learning Management Systems can call the
summarization component to produce summaries for content packages.
Data is courtesy University of Saskatchewan
PAMI Research Group, University of Waterloo
Information Extraction: Social Network Analysis
Social Network Builder
Tasks Finding relationships between people based on their web pages
Progress Modeling
Actors are represented by their associated documents Links are modeled by
• Pair-wise Similarity of the actors’ documents• Merging actors’ documents relations are also modeled by
documents Learning
Some links are known learning social network is translated into text classification problem
No link is revealed a clustering problem with very low performance
PAMI Research Group, University of Waterloo
Information Extraction: Concept ExtractionLa
ngua
ge In
depe
nden
t
TextText
Lang
uage
Dep
ende
nt
l
Semantic Role Labeler
Syntax Parser
POS Tagger
Lang
uage
Dep
ende
nt
Natural Language Processing
Semantic Parser
Syntax Parser
POS Tagger
Concept - based Model
Sentence Separator
Concept-based Statistical Analyzer
(tf : term frequency)(ctf: conceptual term frequency)
Conceptual Ontological Graph (COG)
Representation
Text Pre- processorText Pre- processor
ConceptsConceptsConceptsConcepts
F-measure of Hierarchical ClusteringSingle-Term Concept-based Improvement
Reuters 0.723 0.925 +27.94%
ACM 0.697 0.918 +31.70%
Brown 0.581 0.906 +55.93%
Entropy of Hierarchical ClusteringSingle-Term Concept-based Improvement
Reuters 0.251 0.012 -95.21%
ACM 0.317 0.043 -86.43%
Brown 0.385 0.018 -95.32%
Precision of SearchSingle-Term Concept-based Improvement
Cran 0.536 0.901 +68.09%
Reuters 0.591 0.897 +51.77%
Recall of Search ResultSingle-Term Concept-based Improvement
Cran 0.486 0.827 +70.16%
Reuters 0.452 0.841 +86.06%
Concept-Based Statistical Analyser
Conceptual Ontological Graph (COG) Ranking
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction
Semantic Keyword Extraction
Tasks Developing tools and techniques to extract semantic keywords
toward facilitating metadata generation Developing algorithms to enrich metadata (tags) which can be
applied in index-based multimedia retrieval
Progress Proposed a new information theoretic inclusion index to measure
the asymmetric dependency between terms (and concepts), which can be used in term selection (keyword extraction) and taxonomy extraction (pseudo ontology)
Makrehchi, M. and Kamel, ICDM07, WI 07
PAMI Research Group, University of Waterloo
Information Extraction: Keyword Extraction
Learn rules to find keywords in English sentences
Rules represent sentence fragments Specific enough for reliable keyword
extraction General enough to be applied to
unseen sentences Rule generalization
Begin with an exact sentence fragment
Merge with another by moving different words to the lowest common level in the part-of-speech hierarchy
Keep merged rule if it does not reduce precision and recall of keyword extraction; keep original rules otherwise
Keyword extraction Find sequence of rules that best
cover an unseen sentence Extract keywords according to rules
Rule base size shows quick initial growth, followed by slow and irregular growth and rule elimination
Learns 20 rules from the first 50 training rules Learns 13 additional rules from the next 220
training rules
Both precision and recall values increase during training
Precision (blue) increases 10%Recall (red) shows slight upward trend
Rule-based Keyword Extraction
PAMI Research Group, University of Waterloo
Categorization: Ensemble-based Clustering
Consensus Clustering Categorization of learning objects using proposed consensus clustering
algorithms. The goal of consensus clustering is to find a clustering of the data objects
that optimally summarizes an ensemble of multiple clusterings. Consensus clustering can offer several advantages over a single data
clustering, such as the improvement of clustering accuracy, enhancing the scalability of clustering algorithms to large volumes of data objects, and enhancing the robustness by reducing the sensitivity to outlier data objects or noisy attributes.
Tasks Development of techniques for producing ensembles of multiple data
clusterings where diverse information about the structure of the data is likely to occur.
Development of consensus algorithms to aggregate the individual clusterings.
Develop solutions for the cluster symbolic-label matching problem Empirical analysis on real-world data and validation of proposed method.
PAMI Research Group, University of Waterloo
Categorization using cluster ensemble
Dataset # samples
# attributes
# classes
K-means’ Mean Error Rate in %
Ensemble’s Mean Error Rate in %
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition
500 64 10 27.31 16.40
PAMI Research Group, University of Waterloo
Distributed Environments
Distributed Data MiningApplying Data Mining in an environment where the data, the mining process, or both are distributed.
Motivation Natural distribution of data on the Web.
Scenarios that require the integration of disparate data and mining results are emerging (e.g. federation of repositories, news feed aggregation, digital libraries, business intelligence gathering, etc.)
Emerging technologies, such as Semantic Web, Web Services, Grid Computing, make it feasible to build distributed mining systems.
Availability of cheap low-end hardware that could be utilized in a distributed environment to achieve high-end goals (e.g. Google, SETI@Home, Folding@Home, etc.)
PAMI Research Group, University of Waterloo
Categorization: Distributed Clustering
Peer nodes are arranged into groups called “neighborhoods”.
Multiple neighborhoods are formed at each level of the hierarchy.
This size of each neighborhood is determined through a network partitioning factor.
Each neighborhood has a designated supernode.
Supernodes of level h form the neibhorhoods for level h+1.
Clustering is done within neighborhood boundaries, then is merged up the hierarchy through the supernodes.
Benefits Significant speedup over centralized clustering and
flat peer-to-peer clustering. Multiple levels of clusters. Distributed summarization of clusters using
CorePhrase keyphrase extraction.
Scenarios Distributed knowledge discovery in hierarchical
organizations.
Neighborhood (Q)
SuperNode (S)
h = 0
h = 1
h = 2
Root
h = H-1
h = H
h = 0β = 0.2
h = 1β = 0.33
h = 2β = 0
h = 3
},,{
},,{)0(
4)0(
1)0(
)0(16
)0(1
)0(
pp
Q
P
},{
},,,{)1(
2)1(
1)1(
)1(4
)1(3
)1(2
)1(1
)1(
pppp
Q
P}{
},{)2(
1)2(
)2(2
)2(1
)2(
Q
pp
Q
P
HP2PC Architecture
HP2PC Example3-level network, 16 nodes
Hierarchical P2P Document Clustering
PAMI Research Group, University of Waterloo
Categorization: Multiple Classifier Systems
Tasks To investigate various aspects of
cooperation in Multiple Classifier Systems (Classifier Ensembles)
To develop evaluation measures in order to estimate various types of cooperation in the system
To gain insight into the impact of changes in the cooperative components with respect to system performance using the proposed evaluation measures
To apply these findings to optimize existing ensemble methods
To apply these findings to develop novel ensemble methods with the goal of improving classification accuracy and reducing computation complexity
Progress Proposed a set of evaluation
measures to select sub-optimal training partitions for training classifier ensembles.
Proposed an ensemble training algorithm called Clustering, De-clustering, and Selection (CDS).
Proposed and optimized a cooperative training algorithm called Cooperative Clustering, De-clustering, and Selection (CO-CDS).
Investigated the applications of proposed training methods (CDS and CO-CDS) on LO classification.
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution
Objective Advance classification of multi-class imbalanced data
Tasks
To develop cost-sensitive boosting algorithm AdaC2.M1
To improve the identification performance on the important classes
To balance classification performance among several classes
PAMI Research Group, University of Waterloo
Categorization: Imbalanced Class Distribution
IndInd
..sizesize Dist.Dist.
C1C1 4949 7.84%7.84%
C2C2 288288 46.08%46.08%
C3C3 288288 46.08%46.08%
Class DistributionClass DistributionC4.5C4.5 HPWR (Od=3)HPWR (Od=3)
classclass Meas.Meas. BaseBase AdaBoostAdaBoost BaseBase AdaBoostAdaBoost
C1C1RR 00 5.115.11 10.7010.70 44.0644.06
PP N/AN/A 6.56.5 11.8211.82 32.8932.89
FF N/AN/A 5.845.84 10.8310.83 35.8435.84
C2C2RR 73.2173.21 92.2892.28 88.3188.31 87.4387.43
PP 69.5369.53 88.7588.75 86.7986.79 91.9991.99
FF 72.2972.29 90.3890.38 87.4387.43 89.6489.64
C3C3RR 67.9467.94 91.3691.36 87.6387.63 88.4288.42
PP 73.8973.89 87.8887.88 87.0787.07 89.9189.91
FF 71.9171.91 89.4289.42 86.9986.99 89.0389.03
G-measureG-measure 00 11.4611.46 33.3233.32 68.5068.50
Performance of Base Classification and AdaBoost
C4.5C4.5 HPWR (Od=3)HPWR (Od=3)
ClassClass Meas.Meas. BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1 BaseBase AdaBoostAdaBoost AdaC2.M1AdaC2.M1
C1C1 RR 00 5.115.11 77.5877.58 10.7010.70 44.0644.06 65.7265.72
PP N/AN/A 6.506.50 14.1214.12 11.8211.82 32.8932.89 30.8330.83
C2C2 RR 73.2173.21 92.2892.28 64.7364.73 88.3188.31 87.4387.43 83.1283.12
PP 69.5369.53 88.7588.75 97.2497.24 86.7986.79 91.9991.99 91.3891.38
C3C3 RR 67.9467.94 91.3691.36 65.2365.23 87.6387.63 88.4288.42 83.9583.95
PP 73.8973.89 87.8887.88 93.2293.22 87.0787.07 89.9189.91 90.8190.81
G-meanG-mean 00 11.4611.46 68.4268.42 33.3233.32 68.5068.50 76.0876.08
Balanced performance among classes - Evaluated by G-mean
PAMI Research Group, University of Waterloo
Personalization
Opposition-based Reinforcement Learning for Personalizing Image Search
Developing a reliable technique to assist users, facilitate and enhance the learning process
Personalized ORL tool assists user to observe the searched images desirable for her/him
Personalized tool gathers images of the searched results, selects a sample of them
By interacting with user and presenting the sample, it learns the user’s preferences
PAMI Research Group, University of Waterloo
Personalization
Opposition-based RL algorithms:
OQ(lambda) (International Joint Conference
on Neural Networks-2006) and
NOQ(lambda) (IEEE Symposium on Approximate
Dynamic Programming and Reinforcement Learning
2007)
PAMI Research Group, University of Waterloo
Image Mining: CBIR
Content based image retrieval Build an IR system that can retrieve images based on:
Textual Cues, Image content, NL Queries
imag
esR
ich
Doc
umen
ts
Documents contain QI
Images match QI
NL Description of Image
Images contain QT
Automated image tagging
Image RetrievalTool Set
Query Image QIQuery Text QTQuery Document
PAMI Research Group, University of Waterloo
Accuracy= 70%
Accuracy= 55%
Accuracy= 60%
Accuracy= 95%
IZM FD
MTAR The proposed approachx x x
xx
x x
x
x x
x x x x
x x x
x
x
xxxxx
Illustrative Example
PAMI Research Group, University of Waterloo
The Performance of the proposed approach
Experimental Results (Cont’d)
PAMI Research Group, University of Waterloo
Image Mining: CBIR
Interface Module to TELOS
TELOSIKB-BLDR
LOR
ImageAdmissionInterface
LOImage
Repository
Compound Document
Image
TELOSIR
Image
Compound Document
TextQuery
Response
PAMI Research Group, University of Waterloo
Integration and Applications
Progress
Finished core parts of the common data mining framework.
Built components and services from theme researchers’ work around the data mining framework.
Provided documentation for the data mining framework and software components.
Launched web site to host components and documentation from Theme 4:http://pami.uwaterloo.ca/projects/lornet/software/
PAMI Research Group, University of Waterloo
Integration and Applications
Progress
Core parts of the common data mining framework are available, including:
• Vector and matrix manipulation.• Document parsing and tokenization.• Statistical term and sentence analysis.• Similarity calculation using multiple distance functions.• IMS Content Package compliant parser.
Components and tools built around the common data mining framework:
• Metadata extraction from single documents; supports Dublin Core encoding.• Document similarity calculation using cosine similarity.• Single document and content package summarization.• Building of standard text datasets from large document collections.
Integration with TELOS:• Developed C# TELOS connector for integrating Theme 4 components.• Worked on component manifest specification with Theme 6.• Provided metadata extraction as part of a complete scenario for TELOS components integration.• The following components were wrapped for use by TELOS through the C# connector: Automatic
Metadata Extractor, Document Similarity, and Document Summarizer.
PAMI Research Group, University of Waterloo
Theme and Industry Collaboration
Other LORNET themes Providing tools for concept-based metadata extraction to SFU and U of
Saskatchewan. Providing tools for semantic-based ontology representation to SFU. Providing tools for searching course content and discussion data provided by U of
Saskatchewan. Providing tools for comparing between course content and discussion board data
provided by U of Saskatchewan.
Industry Pattern Discovery Software (PDS) provided data mining software tools for use by
researchers. Vestech provided opportunities for researchers to work on speech technologies. Desire2Learn opened job opportunities for LORNET researchers.
PAMI Research Group, University of Waterloo
Software Components
Learning Object Repository
Metadata Structured Text Categorical
e-Learning Environment
Structured Text Images Object Relationships Context
Automatic metadata extraction LO automatic classification LO organization through clustering Multiple organization strategies through
cluster ensembles
Extracting concepts from LO Summarizing Documents Grouping LOs Tagging LOs Discovering Similar Topics Discovering Similar Peers Building Social Networks Detecting Plagiarism LO recommendation using similarity ranking Personalization / Specialization through
reinforcement learning
Legend Integrated Ready In Progress Year 5
TELOS Metadata Ontology
Ontology construction and unification Finding relations between components Ranking components Grouping components Tagging components
General ToolsC# Connector for TELOSCommon Data Mining Framework
Standard Text Mining ToolsMetadata ExtractorDocument SummarizerContent Package SummarizerDocument SimilarityLO RecommenderMetadata HarvesterKeyword ExtractorTaxonomy ExtractorMetadata Enrichment Tools
Concept-based and Semantic Text Mining Tools
Metadata ExtractorLO Search EngineDocument SimilarityDocument ClassifierDocument ClustererSemantic-based Ontology
RepresentationSemantic Metadata MatchingPOS Rule-Learning SystemTriplet Representation System
Categorization ToolsLO ClassifierLO Multiple ClassifierLO ClustererLO Ensemble ClustererLO Consensus ClustererLO Distributed Clusterer
Overview of ComponentsEnvironment Data Types Tasks
Scenarios for Use of Software Components
User-centric ToolsPersonalized Search EngineSocial Network Learner
Image Mining ToolsContent-based Image SearchPersonalized Image SearchConsensus-based Fusion for Image Retrieval
PAMI Research Group, University of Waterloo
Publications
Papers(accepted / published)
Papers(submitted / in prep)
Theses(completed / in progress)
4.1 Information Extraction from Text
11 7 3/2
4.2 Semantic Knowledge Synthesis from Text
10 4 4/1
4.3 Knowledge Discovery through Categorization
12 10 4/1
4.4 Knowledge from Interaction 8 3 1/2
4.5 Knowledge from Image Mining 10 3 2/1
Total 51 27 14//7 = 21
PAMI Research Group, University of Waterloo
Theme 4 TeamLeader: M. Kamel
PI’s: Dr. Basir Dr. Tizhoosh
Researchers H. Ayad R. Kashef A. Ghazel Dr. Makhreshi
Funding CRC/CFI/OIT NSERC PAMI Lab
Dr. Karray Asso PI (Wong, DiMarco
M. Shokri S. Hassan A. Farahat Dr. R. Khoury
PDS, Vestech, Desire2Learn
Graduated R. Khoury, PhD 07 L. Chen, PhD 07 M. Makhreshi,PhD 07 K.Hammouda,PhD 07 R. Dara, PhD 07 Y.Sun, PhD 07 K. Shaban, PhD 06 Y. Sun, PhD 06 M. Hussin, PhD 05 Jan Bakus, PhD 05 A. Adegorite, MA.Sc04 A. Khandani, MA.Sc05. S. Podder, MA.Sc.04