text mining, association rules and decision tree learning
Post on 16-Aug-2015
64 Views
Preview:
TRANSCRIPT
Decision Tree LearningSupervised Learning
Adrian CuyuganInformation Analytics
Multidisciplinary Subject
Statistics
Data Mining
Machine Learning
AI
Text Mining
Business Process Mining
Natural Language Processin
g
Database Managem
ent
Library Science
Mathematics
Computer Science
Machine Learning
Supervised vs Unsupervised Learning• Supervised learning assumes labeled data, i.e. there is
response variable that labels each record.• Unsupervised learning, on the other hand, does not expect
a response variable because the algorithm itself can learn from the distinct patterns within the data. Examples are clustering and pattern discovery.
Supervised Learning Techniques• Regression techniques assume a numerical response
variable. The most frequently used is linear regression by minimizing the sum or errors.
• Classification techniques assume a categorical response variable. The foundation of classification technique is decision tree algorithm.
EntropyIn other words, the algorithm splits the set of instances in subsets such that the variation within each subset becomes smaller.
Entropy is an information-theoritic measure for the uncertainly in a multi-set of elements.
If the multi-set contains many different elements and each element is unique, then variation is maximal and it takes many bits to encode the individual elements. Hence, the entropy is considered high.
If all elements, on the other hand, are the same, then actually no bits are needed to encode the individual elements, hence it is a low entropy.
DecisionY N
Entropy
DecisionY N
DecisionY N
Decision
High Entropy
High EntropyLow Entropy
Low Entropy
Low Entropy Low Entropy
Y N
Entropy
𝐸=−∑𝑖=1
𝑘 𝑐 𝑖
𝑛𝑙𝑜𝑔2
𝑐 𝑖
𝑛
Entropy
𝐸=−∑𝑖=1
𝑘 𝑐 𝑖
𝑛𝑙𝑜𝑔2
𝑐 𝑖
𝑛
Entropy
𝐸=−∑𝑖=1
𝑘 𝑐 𝑖
𝑛𝑙𝑜𝑔2
𝑐 𝑖
𝑛
Weighted Average Entropy
𝐸�̂�1=66∗1=1
𝐸�̂�= ∑𝑖 , 𝑗=1
𝑘 𝑐 𝑖𝑗
𝑛𝜃
Weighted Average Entropy
𝐸�̂�2=26∗0+
46∗0.811=0.54
𝐸�̂�= ∑𝑖 , 𝑗=1
𝑘 𝑐 𝑖𝑗
𝑛𝜃
Weighted Average Entropy
𝐸�̂�3=26∗0+
16∗0+
36∗0=0
𝐸�̂�= ∑𝑖 , 𝑗=1
𝑘 𝑐 𝑖𝑗
𝑛𝜃
Information Gain
DecisionY N
DecisionY N
Decision
Y N
𝐸𝜇1=1
𝐼𝐺=𝐸𝜇 (𝑇 )−𝐸𝜇(𝑇 ,𝑎)
𝐸𝜇2=0.54 𝐸𝜇3
=0
Stop!
Different Variations
Additional Settings
• Minimal size of the nodes
• Maximum depth of the tree
• Bootstrapping at nodes• Setting minimal
threshold of IG• Using Gini Index than
Information Gain• Post-pruning of the tree
Different Algorithms• ID3 (Iterative Dichotomiser 3)
First decision tree classifier.• CART (Classification and Regression Trees)
A binary classifier. The generic decision tree learning algorithm like in the example.
• C4.5 and C5.0Can handle numerical independent variables. The latter offers more computational speed and varies in splitting rule.
• CHAID (Chi-square Automatic Interaction Detector)Uses significance testing in splitting.
• Ensembles i.e. Random Forest, Ada Boost, Gradient BoostingUses bagging, bootstrapping and weighting. Very flexible and most recent innovations in decision tree learning.
Suggested Topics to Read1. Dividing datasets for model evaluation
a) Training and testing setsb) Cross-validation
2. Confusion matrix for binary classifiersa) True Positive and True Negativeb) False Positive and False Negative
3. Quality measures in evaluating classification modelsa) Error and Accuracyb) Precision and Recallc) F1 score (harmonic mean)d) ROC Charte) Area Under the Curve
4. Ensemble methods5. Bootstrapping and resampling statistics
Text Mining and AnalyticsUnsupervised Learning
Adrian CuyuganInformation Analytics
Text Mining Overview
Data Extraction
•File Types and Sources (Spreadsheet, Word Documents, HTML, JSON, API, etc.)
•Regular expressions•Data File Systems (RDBMS, Google File System, Hadoop, MapReduce)
Information
Retrieval
•Intro to Natural Language Analysis•Vector Space Model – Bag of Words•Term Frequency Matrix•Inverted Document Frequency Matrix•TF-IDF Matrix•Stop words and Stemming•Document Length Normalization (PL2, Okapi/BM25)
•Evaluation (Average Precision, Reciprocal Rank, F-meaure and nDCG)
•Query Likelihod, Statistical Language Probability Unigram Language Model
•Rocchio Feedback and KL Divergence•Recommender Systems
Pattern Analysis
•Pattern Discovery Concepts (Frequent, Closed and Max)
•Association Rules•Quantitave Measures (Support, Confidence and Lift)
•Other measures•Apriori Algorithm, ECLAT and FPGrowth Algorithms
•Multi-level and Multi-dimensional levels, Compressed and Colossal Patterns
•Sequential Patterns•Graph Patterns•Topic Modelling for Text Data
Clustering
•Partitioning, Hierarchical and Density based methods
•Spectral Clustering•Probabilistic Models and EM Algorithm
•Evaluating Clustering Models•Clustering streaming data•Graph Theory•Social Network Analysis
Analytics
•Text clustering, categorization and summarization
•Topic-based modelling•Sentiment analysis•Integration of free-form text and structured data
Visualization
•Basic charts and graphs •Animating and interactivity•Visualizing relationships (hierarchies, clusters and networks)•Visualizing text
Text Retrieval
Text Mining and Analytics
Natural Language Analysis
The quick brown fox jumped over the lazy dog.
Pragmatic Analysis
article adjective adjective noun verb preposition article adjective noun
Prepositional phraseNoun phrase
Subject Predicate
Lexical Analysis (part of speech tagging)
Syntactic Analysis(parsing)
Semantic Analysis fox (f1) dog (d1) jump (f1, d1)
How quick was the fox that it jumped over the dog.Could the dog escaped the quick fox if it wasn’t lazy?Why did the fox jump over the dog?
Vector Space Model
Document (d) The quick brown fox jumped over the lazy dog.
Query (q) How many times does “dog” occur in document?
Term frequency (tf)
Count of query in a document.Example: count(“dog”, d)
Document length |d| How long is the document?
Document frequency (df)
How often do we see “dog” in the entire collection?Example: df(“dog”) = p(“dog” | collection)
Simplest VSM Bag of Words𝑉𝑆𝑀 (𝑞 ,𝑑 )=𝑞 .𝑑
+ … +
¿∑𝑖=1
𝑛
𝑥1 𝑦1
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox ... fox
How would you order of ranking of documents based on bit-vector term frequency?
1 = word is present0 = word is absent
Bit-Vector Term Frequency Matrix
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
thequic
kbrow
nand is
cunny
fox over dog
0 1 0 0 0 0 1 1 1
0 1 0 0 0 0 0 0 0
0 1 0 0 0 0 1 1 0
0 1 0 0 0 0 1 0 0
0 1 0 0 0 0 1 1 0
0 1 0 0 0 0 1 1 0
𝑓 (𝑞 ,𝑑1 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0+0∗0=1𝑓 (𝑞 ,𝑑3 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+0∗0+0∗0=2𝑓 (𝑞 ,𝑑5 )=0∗0+1∗1+0∗0+0∗0+0∗0+0∗0+1∗1+1∗1+0∗0=3
the quick brown and is cunny fox over dog
1 = word is present0 = word is absent
Raw Term Frequency Matrix
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
the quickbrow
nand is
cunny
fox over dog
0 1 0 0 0 0 1 1 1
0 1 0 0 0 0 0 0 0 1
0 1 0 0 0 0 1 1 0 3
0 1 0 0 0 0 1 0 0 2
0 1 0 0 0 0 2 1 0 4
0 1 0 0 0 0 3 1 0 5
= sum of terms0 = word is absent
Limitation of Term Frequency
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
• fox deserves more credit to the matrix.• fox is perceived to have higher importance than compared
to over.
TF Weighting Matrix
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
the quickbrow
nand is
cunny
fox over dog
0 1 0 0 0 0 1 1 1
w 0 2.0 0 0 0 0 5.0 1.0 5.0
0 1 0 0 0 0 0 0 0 2.0
0 1 0 0 0 0 1 1 0 8.0
0 1 0 0 0 0 1 0 0 7.0
0 1 0 0 0 0 2 1 0 13.0
0 1 0 0 0 0 3 1 0 18.0
= sum of tf0 = word is absent
= weight of term0 = word is absent
Inverse Document Frequency w/ Smoothing
𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ] total number of docs in collection
document frequency
𝑇𝐹
𝐼𝐷𝐹
𝑀
𝑘
Term Frequency-Inverse Document Frequencythe quick brown and is cunny fox over dog
5 5 5 5 5 5 5 5 5
5 5 5 1 1 1 4 3 0
0.08 0.08 0.08 0.78 0.78 0.78 0.18 0.30 0.00
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
the quickbrow
nand is
cunny
fox over dog
0.000 0.079 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.08
0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.301 0.000 0.56
0.000 0.079 0.000 0.000 0.000 0.000 0.176 0.000 0.000 0.26
0.000 0.079 0.000 0.000 0.000 0.000 0.352 0.301 0.000 0.73
0.000 0.079 0.000 0.000 0.000 0.000 0.528 0.301 0.000 0.91
𝑇𝐹 𝐼𝐷𝐹=log [ (𝑀+1 )𝑘 ]
𝑇𝐹 𝐼𝐷𝐹=∑𝑖=1
𝑛
𝑥𝑛 𝑦𝑛 𝑙𝑜𝑔[ (𝑀+1 )𝑘 ]
Comparing Matrices
quick fox over dog
… The quick brown …
The quick brown and over cunny fox…
… the fox is brown and quick…
The quick brown fox … fox … over…
The quick fox … the … the … over brown fox … fox
Bit-Vector
TF
Term Frequen
cy
TF Weightin
gTF-IDF
1 1 2.0 0.08
3 3 8.0 0.56
2 2 7.0 0.26
3 4 13.0 0.73
3 5 18.0 0.91
Stop Words
First person•I, me, myself•We, us, ourselves
Second person•You, yours, yourself, yourselves
Third person•He, him, his, himself•She, her, hers, herself•It, its, itself•They, them, themselves
Interrogatives and Demonstratives •What, which, who, whom•This, that, those, these
Be•Am, is, are, were•Be, been, being
Have•Have, has, had, having
Do•Do, does, did, doing
Auxiliary•Will, would, shall, should, can, could•May, might, must, ought
Pronoun + Verb•I’m, you’re, she’s, they’d, we’ll
Verb + Negation•Isn’t, aren’t, haven’t, doesn’t, didn’t
Auxiliary + Negation•Won’t, wouldn’t, can’t, cannot, mustn’t•Daren’t, oughtn’t
Miscellaneous•Let’s, there’s, how’s, what’s, here’s
Articles / Determiners•A, an, the
Conjunctions•For, an, nor, but, or, yet, so
Pro
nou
ns
Verb
sC
om
pou
nd
StemmingOriginal Such an analysis can reveal feature that are not easily
visible from the variations in the individual genes and can lead to a picture of expression that is more biologically transparent and accessible to interpretation
Lovins such an analys can reve featur ar not eas vis from th vari in the individu gen and can lead to a pictur of expres that is mor biolog transpar and access to interpres
Paice such an analys can rev feat that are not easy vis from the vary in the invdivid gen and can lead to a pict of express that is mor biolog transp and access to interpret
Porter such an analysi can reveal featur that ar not easili visibl from the variat in the individ gene and can lead to a pictur of express that is more biolog transpar and access to interpret
Association Rules
Text Mining and Analytics
Pattern Discovery
What is Pattern Discovery?
• A pattern is a set of items, subsequences, or substructures that occur frequently together (or strongly correlated) in a data set.
• Patterns represent intrinsic and important properties of data sets.• Pattern discovery – uncovers patterns from massive data sets.
Why do Pattern Discovery?
• Foundation for many essential data mining tasks• Association, correlation, and causality analysis• Mining sequential, structural (e.g., sub-graph) patterns• Pattern analysis in spatiotemporal, multimedia, time-series, and stream data• Classification: Discriminative pattern-based analysis• Cluster analysis: Pattern-based subspace clustering
Pattern Discovery
Motivation Use
• Which products were often purchased together?• What are the subequent purchases after buying an iPhone?• What software scripts likely contain copy-and-paste expression?• What word sequences likely form phrases in the corpus?
Applications
• Market basket analysis, cross-marketing, sale campaign analysis, Web log analysis, biochemistry sequence analysis.
Frequent ItemsetsID Product Names
10
Outlook, SAP, Active Directory
20
Outlook, Desktop, Active Directory
30
Outlook, Active Directory, Sharepoint
40
SAP, Sharepoint, Voicemail
50
SAP, Desktop, Active Directory, Sharepoint, Voicemail
Itemset – a set of one or more itemsk-itemset –
Absolute Support – frequency of occurrences of an itemset .Relative Support – the fraction of transactions that contains .
An itemset is frequent if the support of is no less than the threshold. This is denoted as .
Let
Frequent 1-itemsets:Outlook: 3 (60%)SAP: 3 (60%)Active Directory: 4 (80%)Sharepoint: 3 (60%)
Frequent 2-itemsets:{Outlook, Active Directory}: 3 (60%)
Association RulesID Product Names
10 Outlook, SAP, Active Directory
20 Outlook, Desktop, Active Directory
30 Outlook, Active Directory, Sharepoint
40 SAP, Sharepoint, Voicemail
50SAP, Desktop, Active Directory, Sharepoint, Voicemail
OutlookActive
Directory
Outlook
ActiveDirectory
{Outlook} {Active Directory} = {Outlook, Active Directory}
Support (s) – The probability that a transaction contains
– denotes
Confidence (c) – The conditional probability that a transaction containing also contains .
Association Rule MiningID Product Names
10 Outlook, SAP, Active Directory
20 Outlook, Desktop, Active Directory
30 Outlook, Active Directory, Sharepoint
40 SAP, Sharepoint, Voicemail
50SAP, Desktop, Active Directory, Sharepoint, Voicemail
Frequent itemsets – finding items that meet the threshold.
Association rule mining – finding all the rules that meet both support and confidence, .
1-itemsets
Outlook: 3 (60%)SAP: 3 (60%)Active Directory: 4 (80%)Sharepoint: 3 (60%)
2-itemsets
{Outlook, Active Directory}: 3 (60%)
Frequent itemsets Association rule
Outlook Active Directory: (60%, 100%)Active Directory Outlook: (60%, 75%)
Downward Closure of Frequent Patterns
Scenario: • A database contains two transactions with itemsets: • We get a frequent itemset: .• Also, its subsets are all frequent: , , … , …• This is equivalent to vignitillion .
Efficient mining: • If {Outlook, SAP, Active Directory} is frequent, so is {Outlook, Active
Directory}.• So, every transaction containing {Outlook, SAP, Active Directory} also
contains {Outlook, Active Directory}.• Any subset of a frequent itemset must be frequent.• So, if any subset of an itemset is infrequent, then there is no chance
for to be frequent.
Limitation of Support-Confidence Framework
Scenario:
• Active Directory Password Reset
Active Directory Active Directory Sum of Rows
Password Reset 400 350 750
Password Reset 200 50 250
Sum of Columns 600 400 1000
• Active Directory Password Reset
Lift
Active Directory, Password Reset
Active Directory Active Directory Sum of Rows
Password Reset 400 350 750
Password Reset 200 50 250
Sum of Columns 600 400 1000
𝑙𝑖𝑓𝑡 ( X ,Y )=𝑐 ( 𝑋⇒𝑌 )𝑠 (𝑌 )
=𝑠 (𝑋∪𝑌 )𝑠 ( 𝑋 )∗𝑠(𝑌 )
is independent is positively correlatedis negatively correlated
Active Directory, Password Reset
Expected Value for Chi-SquareActive Directory Active Directory Sum of Rows
Password Reset 400 350 750
Password Reset 200 50 250
Sum of Columns 600 400 1000
𝑥2=∑ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 )2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
is independent correlated, either positive or negative, therefore it needs more tests
𝐸𝑖 . 𝑗=𝑇 𝑖𝑇 𝑗
𝑇𝑜𝑡𝑎𝑙𝑠 total of ith rowtotal of ith column
𝐸1.1=750∗6001000
=450
𝐸1.2=750∗4001000
=300
𝐸2.1=250∗6001000
=150
𝐸2.2=750∗4001000
=100
Chi-SquareActive Directory Active Directory Sum of Rows
Password Reset 400 (450) 350 (300) 750
Password Reset 200 (150) 50 (100) 250
Sum of Columns 600 400 1000
𝑥2=(400−450 )2
450+
(350−300 )2
300+
(200−150 )2
150+
(50−100 )2
10 0=55.56
𝑥2=∑ (𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑−𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 )2
𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑
shows Active Directory and Password Reset are negatively correlated since the expected value is higher than the observed value.
Apriori Algorithm Pseudo Code
candidate itemset of k frequent itemset of k
= 1;
{frequent items}; // frequent 1-itemsetWhile () do { // when is non-zero candidates generated from ; // candidate generation Derive by counting candidates in with respect to database at ; ; }
return // return generated at each level
Apriori AlgorithmID Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Let
Itemset support
{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset support
{A} 2
{B} 3
{C} 3
{E} 3
Itemset support
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2
{B, E} 3
{C, E} 2
Itemset support
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset support
{B, C, E} 2
1st scan
𝐶1
1st scan
𝐹 1
2nd scan
𝐶2
2nd scan𝐹 2
3rd scan
𝐶3
candidate itemset of k frequent itemset of k
= 1;
{frequent items};While () do { candidates generated from ; Derive by counting candidates in with respect to database at }
return
Transactions Sparse Matrix
ID Product Names Outlook SAPActive
Directory
DesktopSharepo
intVoicema
il
10 Outlook, SAP, Active Directory 1 0 1 0 0 0
20 Outlook, Desktop, Active Directory 1 0 1 1 0 0
30 Outlook, Active Directory, Sharepoint 1 0 1 0 1 0
40 SAP, Sharepoint, Voicemail 0 1 0 0 1 1
50SAP, Desktop, Active Directory, Sharepoint, Voicemail
0 1 1 1 1 1
top related