data mining in bioinformatics day 4: text mining · 2014. 10. 29. · inductive vs. transductive.:...

31
.: Data Mining in Bioinformatics, Page 1 Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 21 to March 4, 2011 Machine Learning & Computational Biology Research Group MPIs Tübingen

Upload: others

Post on 03-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

.: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 4: Text Mining

Karsten Borgwardt

February 21 to March 4, 2011

Machine Learning & Computational Biology Research GroupMPIs Tübingen

Page 2: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

What is text mining?

.: Data Mining in Bioinformatics, Page 2

DefinitionText mining is the use of automated methods for exploit-ing the enormous amount of knowledge available in the(biomedical) literature.

MotivationMost knowledge is stored in terms of texts, both in in-dustry and in academiaThis alone makes text mining an integral part of knowl-edge discovery!Furthermore, to make text machine-readable, one hasto solve several recognition (mining) tasks on text

Page 3: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

What is text mining?

.: Data Mining in Bioinformatics, Page 3

Common tasksInformation retrieval: Find documents that are relevantto a user, or to a query in a collection of documents

Document ranking: rank all documents in the collec-tionDocument selection: classify documents into relevantand irrelevant

Information filtering: Search newly created documentsfor information that is relevant to a userDocument classification: Assign a document to a cate-gory that describes its contentKeyword co-occurrence: Find groups of keywords thatco-occur in many documents

Page 4: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Evaluating text mining

.: Data Mining in Bioinformatics, Page 4

Precision and RecallLet the set of documents that are relevant to a querybe denoted as {Relevant} and the set of retrieved doc-uments as {Retrieved}.The precision is the percentage of retrieved documentsthat are relevant to the query

precision =|{Relevant} ∩ {Retrieved}|

|{Retrieved}|(1)

The recall is the percentage of relevant documents thatwere retrieved by the query:

recall =|{Relevant} ∩ {Retrieved}|

|{Relevant}|(2)

Page 5: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Text representation

.: Data Mining in Bioinformatics, Page 5

TokenizationProcess of identifying keywords in a documentNot all words in a text are relevantText mining ignores stop wordsStop words form the stop listStop lists are context-dependent

Page 6: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Text representation

.: Data Mining in Bioinformatics, Page 6

Vector space modelGiven #d documents and #t termsModel each document as a vector v in a t-dimensionalspace

Weighted Term-frequency matrixMatrix TF of size #d×#t

Entries measure association of term and documentIf a term t does not occur in a document d, thenTF (d, t) = 0

If a term t does occur in a document d, then TF (d, t) >0.

Page 7: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Text representation

.: Data Mining in Bioinformatics, Page 7

If term t occurs in document d, thenTF (d, t) = 1

TF (d, t) = frequency of t in d (freq(d, t))

TF (d, t) = freq(d,t)∑t′∈T freq(d,t

′)

TF (d, t) = 1 + log(1 + log(freq(d, t)))

Page 8: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Text representation

.: Data Mining in Bioinformatics, Page 8

Inverse document frequencyrepresents the scaling factor, or importance, of a termA term that appears in many document is scaled down

IDF (t) = log1 + |d||dt|

(3)

where |d| is the number of all documents, and |dt| is thenumber of documents containing term t

TF-IDF measureProduct of term frequency and inverse document fre-quency:

TF -IDF (d, t) = TF (d, t)IDF (t); (4)

Page 9: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Measuring similarity

.: Data Mining in Bioinformatics, Page 9

Cosine measureLet v1 and v2 be two document vectors.The cosine similarity is defined as

sim(v1, v2) =v>1 v2|v1||v2|

(5)

Kernelsdepending on how we represent a document, there aremany kernels available for measuring similarity of theserepresentations

vectorial representation: vector kernels like linear,polynomial, Gaussian RBF kernelone long string: string kernels that count common k-mers in two strings (more on that later in the course)

Page 10: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Keyword co-occurrence

.: Data Mining in Bioinformatics, Page 10

ProblemFind sets of keyword that often co-occurCommon problem in biomedical literature: find associ-ations between genes, proteins or other entities usingco-occurrence searchKeyword co-occurrence search is an instance of a moregeneral problem in data mining, called association rulemining.

Page 11: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Association rules

.: Data Mining in Bioinformatics, Page 11

DefinitionsLet I = {I1, I2, . . . , Im} be a set of items (keywords)Let D be the database of transactions T (collection ofdocuments)A transaction T ∈ D is a set of items: T ⊆ I (a docu-ment is a set of keywords)Let A be a set of items: A ⊆ T . An association rule isan implication of the form

A ⊆ T ⇒ B ⊆ T, (6)

where A,B ⊆ I and A ∩B = ∅

Page 12: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Association rules

.: Data Mining in Bioinformatics, Page 12

Support and ConfidenceThe rule A⇒ B holds in the transaction set D with sup-port s, where s is the percentage of transactions in Dthat contain A ∪B:

support(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}|

|{T ∈ D}|(7)

The rule A ⇒ B has confidence c in the transactionset D, where c is the percentage of transactions in Dcontaining A that also contain B:

confidence(A⇒ B) =|{T ∈ D|A ⊆ T ∧B ⊆ T}||{T ∈ D|A ⊆ T}|

(8)

Page 13: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Association rules

.: Data Mining in Bioinformatics, Page 13

Strong rulesRules that satisfy both a minimum support thresh-old (minsup) and a minimum confidence threshold(minconf) are called strong association rules— andthese are the ones we are after!

Finding strong rules1. Search for all frequent itemsets (set of items that occur

in at least minsup % of all transactions)2. Generate strong association rules from the frequent

itemsets

Page 14: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Association rules

.: Data Mining in Bioinformatics, Page 14

Apriori algorithmMakes use of the Apriori property: If an itemset A isfrequent, then any subset B of A (B ⊆ A) is frequentas well. If B is infrequent, then any superset A of B(A ⊇ B) is infrequent as well.

Steps1. Determine frequent items = k-itemsets with k = 1

2. Join all pairs of frequent k-itemsets that differ in at most1 item = candidatesCk+1 for being frequent k+1 itemsets

3. Check the frequency of these candidates Ck+1: the fre-quent ones form the frequent k + 1-itemsets (trick: dis-card any candidate immediately that contains an infre-quent k-itemset)

4. Repeat from Step 2 until no more candidate is frequent

Page 15: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Transduction

.: Data Mining in Bioinformatics, Page 15

Known test setClassification on text databases often means that weknow all the data we will work with before trainingHence the test set is known aprioriThis setting is called ’transductive’Can we define classifiers that exploit the known test set?Yes!

Transductive SVM (Joachims, ICML 1999)Trains SVM on both training and test setUses test data to maximise margin

Page 16: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Inductive vs. transductive

.: Data Mining in Bioinformatics, Page 16

ClassificationTask: predict label y from features x

Classic inductive settingStrategy: Learn classifier on (labelled) training dataGoal: Classifier shall generalise to unseen data fromsame distribution

Transductive settingStrategy: Learn classifier on (labelled) training dataAND a given (unlabelled) test datasetGoal: Predict class labels for this particular dataset

Page 17: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Why transduction?

.: Data Mining in Bioinformatics, Page 17

Really necessary?Classic approach works: train on training dataset, teston test datasetThat is what we usually do in practice, for instance, incross-validation.We usually ignore or neglect that the fact that settingsare transductive.

The benefits of transductive classificationInductive setting: infinitely many potential classifiersTransductive setting: finite number of equivalenceclasses of classifiersf and f ′ in same equivalence class⇔ f and f ′ classifypoints from training and test dataset identically

Page 18: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Why transduction?

.: Data Mining in Bioinformatics, Page 18

Idea of Transductive SVMsRisk on Test data ≤ Risk on Training data + confidenceinterval (depends on number of equivalence classes)Theorem by Vapnik(1998): The larger the margin, thelower the number of equivalence classes that contain aclassifier with this marginFind hyperplane that separates classes in training dataAND in test data with maximum margin.

Page 19: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Why transduction?

.: Data Mining in Bioinformatics, Page 19

Page 20: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Transduction on text

.: Data Mining in Bioinformatics, Page 20

Page 21: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Transductive SVM

.: Data Mining in Bioinformatics, Page 21

Linearly separable case

minw,b,y∗

1

2‖w‖2

s.t. ∀ni=1 yi[w>xi + b] ≥ 1

∀kj=1 y∗j [w>x∗j + b] ≥ 1

Page 22: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Transductive SVM

.: Data Mining in Bioinformatics, Page 22

Non-linearly separable case

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗k∑j=0

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

∀ni=1 ξi ≥ 0

∀kj=1 ξ∗j ≥ 0

Page 23: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Transductive SVM

.: Data Mining in Bioinformatics, Page 23

OptimisationHow to solve this OP?Not so nice: combination of integer and convex OPJoachims’ approach: find approximate solution by itera-tive application of inductive SVM

train inductive SVM on training data, predict on testdata, assign labels to test dataretrain on all data, with special slack weights for testdata (C∗−, C

∗+)

Outer loop: repeat and slowly increase (C∗−, C∗+)

Inner loop: within each repetition switch pairs of ’mis-classified’ data points repeatedly

Local search with approximate solution to OP

Page 24: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Inductive SVM for TSVM

.: Data Mining in Bioinformatics, Page 24

Variant of inductive SVM

minw,b,y∗,ξ,ξ∗

1

2‖w‖2 + C

n∑i=0

ξi + C∗−

k∑j:y∗j=−1

ξ∗j + C∗+

k∑j:y∗j=1

ξ∗j

s.t. ∀ni=1 yi[w>xi + b] ≥ 1− ξi

∀kj=1 y∗j [w>x∗j + b] ≥ 1− ξ∗j

Three different penalty costsC for points from training datasetC∗− for points from in test dataset currently in class −1C∗+ for points from in test dataset currently in class +1

Page 25: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Experiments

.: Data Mining in Bioinformatics, Page 25

Average P/R-breakeven point on the Reuters dataset fordifferent training set sizes and a test size of 3,299

Page 26: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Experiments

.: Data Mining in Bioinformatics, Page 26

Average P/R-breakeven point on the Reuters dataset for 17training documents and varying test set size for the TSVM

Page 27: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Experiments

.: Data Mining in Bioinformatics, Page 27

Average P/R-breakeven point on the WebKB category’course’ for different training set sizes

Page 28: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Experiments

.: Data Mining in Bioinformatics, Page 28

Average P/R-breakeven point on the WebKB category’project’ for different training set sizes

Page 29: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

Summary

.: Data Mining in Bioinformatics, Page 29

ResultsTransductive version of SVMMaximizes margin on training and test dataImplementation uses variant of classic inductive SVMSolution is approximate and fastWorks well on text, in particular on small training sam-ples and large test sets

Page 30: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

References and further reading

.: Data Mining in Bioinformatics, Page 30

References

[1] T.-Joachims. Transductive Inference for Text Classifica-tion using Support Vector Machines ICML, 1999: 200-209.

[2] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Elsevier, Morgan-Kaufmann Publishers,2006.

Page 31: Data Mining in Bioinformatics Day 4: Text Mining · 2014. 10. 29. · Inductive vs. transductive.: Data Mining in Bioinformatics, Page 16 Classification Task: predict label yfrom

The end

.: Data Mining in Bioinformatics, Page 31

See you tomorrow! Next topic: Graph Mining