text mining, association rules and decision tree learning

Decision Tree LearningSupervised Learning

Adrian CuyuganInformation Analytics

Multidisciplinary Subject

Statistics

Data Mining

Machine Learning

Text Mining

Business Process Mining

Natural Language Processin

Database Managem

Library Science

Mathematics

Computer Science

Machine Learning

Supervised vs Unsupervised Learning• Supervised learning assumes labeled data, i.e. there is

response variable that labels each record.• Unsupervised learning, on the other hand, does not expect

a response variable because the algorithm itself can learn from the distinct patterns within the data. Examples are clustering and pattern discovery.

Supervised Learning Techniques• Regression techniques assume a numerical response

variable. The most frequently used is linear regression by minimizing the sum or errors.

• Classification techniques assume a categorical response variable. The foundation of classification technique is decision tree algorithm.

EntropyIn other words, the algorithm splits the set of instances in subsets such that the variation within each subset becomes smaller.

Entropy is an information-theoritic measure for the uncertainly in a multi-set of elements.

If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many bits to encode the individual elements. Hence, the entropy is considered high.

If all elements, on the other hand, are the same, then actually no bits are needed to encode the individual elements, hence it is a low entropy.

DecisionY N

Entropy

DecisionY N

Decision

High Entropy

High EntropyLow Entropy

Low Entropy

Low Entropy Low Entropy

Entropy

𝐸=−∑𝑖=1

𝑘 𝑐 𝑖

𝑛𝑙𝑜𝑔2

𝑐 𝑖

Entropy

𝐸=−∑𝑖=1

𝑘 𝑐 𝑖

𝑛𝑙𝑜𝑔2

𝑐 𝑖

Entropy

𝐸=−∑𝑖=1

𝑘 𝑐 𝑖

𝑛𝑙𝑜𝑔2

𝑐 𝑖

Weighted Average Entropy

𝐸�̂�1=66∗1=1

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃

𝐸�̂�2=26∗0+

46∗0.811=0.54

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃

𝐸�̂�3=26∗0+

16∗0+

36∗0=0

𝐸�̂�= ∑𝑖 , 𝑗=1

𝑘 𝑐 𝑖𝑗

𝑛𝜃

Information Gain

DecisionY N

Decision

𝐸𝜇1=1

𝐼𝐺=𝐸𝜇 (𝑇 )−𝐸𝜇(𝑇 ,𝑎)

𝐸𝜇2=0.54 𝐸𝜇3

Different Variations

Additional Settings

• Minimal size of the nodes

• Maximum depth of the tree

• Bootstrapping at nodes• Setting minimal

threshold of IG• Using Gini Index than

Information Gain• Post-pruning of the tree

Different Algorithms• ID3 (Iterative Dichotomiser 3)

First decision tree classifier.• CART (Classification and Regression Trees)

A binary classifier. The generic decision tree learning algorithm like in the example.

• C4.5 and C5.0Can handle numerical independent variables. The latter offers more computational speed and varies in splitting rule.

• CHAID (Chi-square Automatic Interaction Detector)Uses significance testing in splitting.

• Ensembles i.e. Random Forest, Ada Boost, Gradient BoostingUses bagging, bootstrapping and weighting. Very flexible and most recent innovations in decision tree learning.

Suggested Topics to Read1. Dividing datasets for model evaluation

a) Training and testing setsb) Cross-validation

2. Confusion matrix for binary classifiersa) True Positive and True Negativeb) False Positive and False Negative

3. Quality measures in evaluating classification modelsa) Error and Accuracyb) Precision and Recallc) F1 score (harmonic mean)d) ROC Charte) Area Under the Curve

4. Ensemble methods5. Bootstrapping and resampling statistics

Text Mining and AnalyticsUnsupervised Learning

Adrian CuyuganInformation Analytics

Text Mining Overview

Data Extraction

•File Types and Sources (Spreadsheet, Word Documents, HTML, JSON, API, etc.)

•Regular expressions•Data File Systems (RDBMS, Google File System, Hadoop, MapReduce)

Information

Retrieval

•Intro to Natural Language Analysis•Vector Space Model – Bag of Words•Term Frequency Matrix•Inverted Document Frequency Matrix•TF-IDF Matrix•Stop words and Stemming•Document Length Normalization (PL2, Okapi/BM25)

•Evaluation (Average Precision, Reciprocal Rank, F-meaure and nDCG)

•Query Likelihod, Statistical Language Probability Unigram Language Model

•Rocchio Feedback and KL Divergence•Recommender Systems

Pattern Analysis

•Pattern Discovery Concepts (Frequent, Closed and Max)

•Association Rules•Quantitave Measures (Support, Confidence and Lift)

•Other measures•Apriori Algorithm, ECLAT and FPGrowth Algorithms

•Multi-level and Multi-dimensional levels, Compressed and Colossal Patterns

•Sequential Patterns•Graph Patterns•Topic Modelling for Text Data

Clustering

•Partitioning, Hierarchical and Density based methods

•Spectral Clustering•Probabilistic Models and EM Algorithm

•Evaluating Clustering Models•Clustering streaming data•Graph Theory•Social Network Analysis

Analytics

•Text clustering, categorization and summarization

•Topic-based modelling•Sentiment analysis•Integration of free-form text and structured data

Visualization

•Basic charts and graphs •Animating and interactivity•Visualizing relationships (hierarchies, clusters and networks)•Visualizing text

Text Retrieval

Text Mining and Analytics

Natural Language Analysis

The quick brown fox jumped over the lazy dog.

Pragmatic Analysis

article adjective adjective noun verb preposition article adjective noun

Prepositional phraseNoun phrase

Subject Predicate

Lexical Analysis (part of speech tagging)

Syntactic Analysis(parsing)

Semantic Analysis fox (f1) dog (d1) jump (f1, d1)

How quick was the fox that it jumped over the dog.Could the dog escaped the quick fox if it wasn’t lazy?Why did the fox jump over the dog?

Vector Space Model

Document (d) The quick brown fox jumped over the lazy dog.

Query (q) How many times does “dog” occur in document?

Term frequency (tf)

Count of query in a document.Example: count(“dog”, d)

Document length |d| How long is the document?

Document frequency (df)

How often do we see “dog” in the entire collection?Example: df(“dog”) = p(“dog” | collection)

Simplest VSM Bag of Words𝑉𝑆𝑀 (𝑞 ,𝑑 )=𝑞 .𝑑

+ … +

¿∑𝑖=1

𝑥1 𝑦1

quick fox over dog

… The quick brown …

The quick brown and over cunny fox…

… the fox is brown and quick…

The quick brown fox … fox … over…

The quick fox … the … the … over brown fox ... fox

How would you order of ranking of documents based on bit-vector term frequency?

1 = word is present0 = word is absent

Bit-Vector Term Frequency Matrix

quick fox over dog

The quick fox … the … the … over brown fox … fox

thequic

nand is

fox over dog

0 1 0 0 0 0 1 1 1

0 1 0 0 0 0 0 0 0

0 1 0 0 0 0 1 1 0

0 1 0 0 0 0 1 0 0

0 1 0 0 0 0 1 1 0

𝑓 (𝑞 ,𝑑1 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0=1𝑓 (𝑞 ,𝑑3 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+0∗0+0∗0=2𝑓 (𝑞 ,𝑑5 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+1∗1+0∗0=3

the quick brown and is cunny fox over dog

1 = word is present0 = word is absent

Raw Term Frequency Matrix

quick fox over dog

the quickbrow

nand is

fox over dog

0 1 0 0 0 0 1 1 1

0 1 0 0 0 0 0 0 0 1

0 1 0 0 0 0 1 1 0 3

0 1 0 0 0 0 1 0 0 2

0 1 0 0 0 0 2 1 0 4

0 1 0 0 0 0 3 1 0 5

= sum of terms0 = word is absent

Limitation of Term Frequency

quick fox over dog

• fox deserves more credit to the matrix.• fox is perceived to have higher importance than compared

to over.

TF Weighting Matrix

quick fox over dog

the quickbrow

nand is

fox over dog

0 1 0 0 0 0 1 1 1

w 0 2.0 0 0 0 0 5.0 1.0 5.0

0 1 0 0 0 0 0 0 0 2.0

0 1 0 0 0 0 1 1 0 8.0

0 1 0 0 0 0 1 0 0 7.0

0 1 0 0 0 0 2 1 0 13.0

0 1 0 0 0 0 3 1 0 18.0

= sum of tf0 = word is absent

= weight of term0 = word is absent

Inverse Document Frequency w/ Smoothing

𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ] total number of docs in collection

document frequency

𝑇𝐹

𝐼𝐷𝐹

Term Frequency-Inverse Document Frequencythe quick brown and is cunny fox over dog

5 5 5 5 5 5 5 5 5

5 5 5 1 1 1 4 3 0

0.08 0.08 0.08 0.78 0.78 0.78 0.18 0.30 0.00

quick fox over dog

the quickbrow

nand is

fox over dog

0.000 0.079 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.08

0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.301 0.000 0.56

0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.000 0.000 0.26

0.000 0.079 0.000 0.000 0.000 0.000 0.352 0.301 0.000 0.73

0.000 0.079 0.000 0.000 0.000 0.000 0.528 0.301 0.000 0.91

𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ]

𝑇𝐹 𝐼𝐷𝐹=∑𝑖=1

𝑥𝑛 𝑦𝑛 𝑙𝑜𝑔[ (𝑀+1 )𝑘 ]

Comparing Matrices

quick fox over dog

Bit-Vector

Term Frequen

TF Weightin

gTF-IDF

1 1 2.0 0.08

3 3 8.0 0.56

2 2 7.0 0.26

3 4 13.0 0.73

3 5 18.0 0.91

Stop Words

First person•I, me, myself•We, us, ourselves

Second person•You, yours, yourself, yourselves

Third person•He, him, his, himself•She, her, hers, herself•It, its, itself•They, them, themselves

Interrogatives and Demonstratives •What, which, who, whom•This, that, those, these

Be•Am, is, are, were•Be, been, being

Have•Have, has, had, having

Do•Do, does, did, doing

Auxiliary•Will, would, shall, should, can, could•May, might, must, ought

Pronoun + Verb•I’m, you’re, she’s, they’d, we’ll

Verb + Negation•Isn’t, aren’t, haven’t, doesn’t, didn’t

Auxiliary + Negation•Won’t, wouldn’t, can’t, cannot, mustn’t•Daren’t, oughtn’t

Miscellaneous•Let’s, there’s, how’s, what’s, here’s

Articles / Determiners•A, an, the

Conjunctions•For, an, nor, but, or, yet, so

StemmingOriginal Such an analysis can reveal feature that are not easily

visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation

Lovins such an analys can reve featur ar not eas vis from th vari in the individu gen and can lead to a pictur of expres that is mor biolog transpar and access to interpres

Paice such an analys can rev feat that are not easy vis from the vary in the invdivid gen and can lead to a pict of express that is mor biolog transp and access to interpret

Porter such an analysi can reveal featur that ar not easili visibl from the variat in the individ gene and can lead to a pictur of express that is more biolog transpar and access to interpret

Association Rules

Text Mining and Analytics

Pattern Discovery

What is Pattern Discovery?

• A pattern is a set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set.

• Patterns represent intrinsic and important properties of data sets.• Pattern discovery – uncovers patterns from massive data sets.

Why do Pattern Discovery?

• Foundation for many essential data mining tasks• Association, correlation, and causality analysis• Mining sequential, structural (e.g., sub-graph) patterns• Pattern analysis in spatiotemporal, multimedia, time-series, and stream data• Classification: Discriminative pattern-based analysis• Cluster analysis: Pattern-based subspace clustering

Pattern Discovery

Motivation Use

• Which products were often purchased together?• What are the subequent purchases after buying an iPhone?• What software scripts likely contain copy-and-paste expression?• What word sequences likely form phrases in the corpus?

Applications

• Market basket analysis, cross-marketing, sale campaign analysis, Web log analysis, biochemistry sequence analysis.

Frequent ItemsetsID Product Names

Outlook, SAP, Active Directory

Outlook, Desktop, Active Directory

Outlook, Active Directory, Sharepoint

SAP, Sharepoint, Voicemail

SAP, Desktop, Active Directory, Sharepoint, Voicemail

Itemset – a set of one or more itemsk-itemset –

Absolute Support – frequency of occurrences of an itemset .Relative Support – the fraction of transactions that contains .

An itemset is frequent if the support of is no less than the threshold. This is denoted as .

Frequent 1-itemsets:Outlook: 3 (60%)SAP: 3 (60%)Active Directory: 4 (80%)Sharepoint: 3 (60%)

Frequent 2-itemsets:{Outlook, Active Directory}: 3 (60%)

Association RulesID Product Names

10 Outlook, SAP, Active Directory

20 Outlook, Desktop, Active Directory

30 Outlook, Active Directory, Sharepoint

40 SAP, Sharepoint, Voicemail

50SAP, Desktop, Active Directory, Sharepoint, Voicemail

OutlookActive

text mining, association rules and decision tree learning

Documents

decision tree under mapreduce week 14 part ii. decision tree

distributed decision tree learning for mining big data...

southern privacy-preserving decision tree...

m.sc. chemistry - nizam...

lncs 3995 - privacy-preserving decision tree mining based...

decision tree algorithm weka tutorial - uniroma2.it ·...

decision tree

chapter 3: decision tree learning. decision tree learning t...

data mining technique (decision tree)

decision tree (rule induction). poll: which data mining...

data-driven decision tree classiﬁcation for product...

example of data mining process with decision tree using...

emotional disturbance decision tree emotional disturbance...

integration of techniques with data mining ......the data...

decision tree-based data mining and rule …...decision...

distributed decision tree learning for mining big data...

the cart decision tree for mining data...

chapter 7 decision trees - shandong...

distributed decision tree learning for mining big data...

decision tree