application of text mining in developing standardized...
TRANSCRIPT
1
Standardized Descriptions of Taxa in
Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Application of Text Mining in Developing Standardized Descriptions of Taxa in
Paleontology: A Framework
Francisca E. Oboh-Ikuenobe, Geological Science & Engineering, [email protected]
Wen-Bin (Vincent) Yu, Information Science & Technology, [email protected]
Bih-Ru Lea, Business Administration, [email protected]
University of Missouri – Rolla
Rolla, MO 65401, USA
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Problems In PalynomorphClassifications And Interpretations
Description #2: Large proximate, cyst with irregular perforated crests, well developed paratabulation.
Description #1: Proximate, acavate cyst, large (75-95 µm) with irregular perforated parasuturalcrests, well developed paratabulation (including parasulcal paraplates), and precingular archaeopyle.
40 µm
2
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining or Information Retrieval?Google Search/ online article database– Key word search (existing knowledge)
Web Mining/ Semantic Web – Structured ontology for Information Extraction
Text mining– A special case of data mining – Discover new knowledge by finding interesting and
non-trivial patterns in textural documents – Usually starts with clustering similar documents
and extracting key terms associated with each cluster
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
3
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining process IITerm Frequency Matrix – Represent each document with selected terms– A note on Discriminating power: Terms with
the most discriminating power are those that occur frequently, but only in a few documents.
– Determine Term weights and Frequency weights
0111Doc 3
1100Doc 2
1011Doc 1
Term 4:id
Term 3:voice
Term 2:web
Term 1:process
4
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Term Weights: Entropy MethodTerm weight (Gij)=
where – pij = aij / gi– aij: frequency that term i appears in document j. – gi: frequency that term i appears in document
collection.,– n: number of documents in the collection
Weighted Frequency Lij = log2(aij+1)
∑+=j
ijijj n
ppG
)(log)(log
12
2
5
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
Variable Reduction & Transformation
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text mining process IIIVariable reduction (Transformations)– SVD (Singular Value Decomposition)
• Principal components decomposition.
– Roll Up Term• use the n terms with the largest term weights.
Application in Data analysis – Cluster analysis– Classification– Predictions
6
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text Mining Process – Example 1
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Text Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Clustering
Text Mining is to find hidden patterns for textual documents by grouping similar descriptions
… portable and mobile video services could be ported to other locations, other devices or other media…. [IPTV, WiMAX get star billing at Supercomm 2005]
Data (Documents) Cleansing Text processing
– Parsing.– Stop words and start words.– Parts of speech– Stemming – Synonyms, Jargons, Abbreviations
Term Frequency MatrixWeighting Scheme
Variable Reduction & Transformation
7
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Clusters with SVD 1
40 µm
Four different descriptions for the same dinocyst
* COL1, COL2 & COL3 represent the principle components (SVD variables) of the descriptive terms
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Clusters with SVD 2
8
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Text Mining Process – Example 2
Ifecysta sp. 2 Ifecysta sp. 1
Achomosphaera sp.Diphyes sp.
Ifecysta sp. 1
Ifecysta sp. 2
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework- Data Mining Process
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Input (independent variable): – the SVD variables
output (dependent variable): – clusters
Descriptive models – Regression– Neural network– decision tree
Model EvaluationModel Selection
9
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: ANN
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: Regression
10
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Modelling: Decision Tree
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Dinocyst Description Modeling
11
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Model Evaluation
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Framework – Descriptive Model
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
12
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Scoring Dinocyst DescriptionAssume regression model is selected in the model selection process
Classification is proposed
A description is entered, analyzed by text mining process, and feed into the selected model (i.e., regression in this example)
Standardized Taxon
Recommendation Modeling Module
Standardized description is proposed
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
User/Human Interaction
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
the number of clustersthe validity of key termspossible start/stop words list…
13
By
Bih-Ru Lea
Standardized Descriptions of
Taxa in Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
Adoptive Model
Sample Database
CHRONOS
Text preprocessing
& Pattern Identification
Module
Clustering Module
Data Sources Text Mining
Model Creation Module
Model Selection Module
Data Mining
Model Evaluation
ModuleDescriptive
Model
Predictive Model
Variable Reduction
ModuleUser Input
…
Standardized Taxon
Recommendation Modeling Module
DinoSys
Data Mining/ Software Agent
Dinocyst descriptions collected and mined periodically to update model
Standardized Descriptions of Taxa in
Paleontology
ByVincent YuFrancisca
Oboh-Ikuenobe
Bih-Ru Lea
The End