working with text data - part i

34
WORKING WITH TEXT DATA - PART I Practical Lectures by Chris Emmery (MSc)

Upload: others

Post on 01-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WORKING WITH TEXT DATA - PART I

WORKING WITH TEXT DATA - PART I

Practical Lectures by Chris Emmery (MSc)

Page 2: WORKING WITH TEXT DATA - PART I

TODAY'S LECTURERepresenting text as vectors.Binary vectors for Decision Tree classi�cation.Using Vector Spaces and weightings.Document classi�cation using -NN.k

Page 3: WORKING WITH TEXT DATA - PART I

HOW IS THIS DIFFERENT THAN BEFORE?Numbers are numbers. Their scales and distributionsmight be different; the information leaves little tointerpretation.Language is complex:

Representing language is complex.Mathematically interpreting language is complex.Inferring knowledge from language is complex.Understanding language is complex.

Page 4: WORKING WITH TEXT DATA - PART I

NOISY LANGUAGE

Just net�ixed pixels, best time ever lol -1/5

Page 5: WORKING WITH TEXT DATA - PART I

LANGUAGE AS A STRINGtitle,director,year,score,budget,gross,plot

"Dunkirk","Christopher Nolan",2017,8.4,100000000,183836652,"Allied soldiers

"Interstellar","Christopher Nolan",2014,8.6,165000000,187991439,"A team of e

"Inception","Christopher Nolan",2010,8.8,160000000,292568851,"A thief, who s

"The Prestige","Christopher Nolan",2006,8.5,40000000,53082743,"After a tragi

"Memento","Christopher Nolan",2000,8.5,9000000,25530884,"A man juggles searc

Page 6: WORKING WITH TEXT DATA - PART I

TEXT TO VECTORS

Page 7: WORKING WITH TEXT DATA - PART I

CONVERTING TO NUMBERSd = the cat sat on the mat → = ⟨?⟩d ⃗ 

Page 8: WORKING WITH TEXT DATA - PART I

WORDS AS FEATURES d = the cat sat on the mat →

[ ]cat

1

mat

1

on

1

sat

1

the

1

Bag-of-Words Representation

Page 9: WORKING WITH TEXT DATA - PART I

DOCUMENTS AS INSTANCES

= the cat sat on the matd0

= my cat sat on my catd1

⎣⎢

cat

1

1

mat

1

0

my

0

1

on

1

1

sat

1

1

the

1

0

⎦⎥

Page 10: WORKING WITH TEXT DATA - PART I

DOCUMENTS * TERMS

V = [ ]cat mat my on sat the

X = [ ]1

1

1

0

0

1

1

1

1

1

1

0

Page 11: WORKING WITH TEXT DATA - PART I

DOCUMENT SIMILARITYWikipedia articles:

data language learning mining text vision

1 0 1 0 0 1 CV

1 1 1 0 1 0 NLP

1 0 1 1 1 0 TMCV = Computer visionNLP = Natural Language ProcessingTM = Text Mining

y

Page 12: WORKING WITH TEXT DATA - PART I

DOCUMENT SIMILARTY - JACCARDCOEFFICIENT

= ⟨1, 0, 1, 0, 0, 1⟩d0

= ⟨1, 1, 1, 0, 1, 0⟩d1

= ⟨1, 0, 1, 1, 1, 0⟩d2

J( , ) = 2/5 = 0.4d0 d1

J( , ) = 2/5 = 0.4d0 d2

J( , ) = 3/5 = 0.6d1 d2

J(A,B) =|A∩B|

|A∪B|

words in A and B (intersection) /words in A or B (union)

Page 13: WORKING WITH TEXT DATA - PART I

DECISION TREES (ID3)

Page 14: WORKING WITH TEXT DATA - PART I

CLASSIFICATION RULESdata language learning mining text vision

1 0 1 0 0 1 CV

1 1 1 0 1 0 NLP

1 0 1 1 1 0 TM

y

if 'vision' in d:

label = 'CV'

else:

if 'language' in d:

label = 'NLP'

else:

label = 'TM'

Page 15: WORKING WITH TEXT DATA - PART I

INFERRING RULES (DECISIONS) BYINFORMATION GAIN

Page 16: WORKING WITH TEXT DATA - PART I

INFERRING RULES (DECISIONS) BYINFORMATION GAIN

Page 17: WORKING WITH TEXT DATA - PART I

INFERRING RULES (DECISIONS) BYINFORMATION GAIN

spam ham total

free 0   1 3 4

1   4 1 5

E(free, Y) = P(c)E(c)∑c∈Y

E(free=0, Y ) = −( ⋅ ) − ( ⋅ )1

4log2

1

4

3

4log2

3

4

E(free=1, Y ) = −( ⋅ ) ⋅ −( ⋅ )4

5log2

4

5

1

5log2

1

5

E(free, Y) = ⋅ 0.811 + ⋅ 0.722 = 0.7624

9

5

9

Page 18: WORKING WITH TEXT DATA - PART I

ID3 ALGORITHMFeature (word) with highest Information Gain will besplit on �rst.Instances divided over both sides calculations will berepeated on leftovers (recursion).If all leftover instances belong to one class, makedecision.Create more and more rules until some stoppingcriterion.

More info: , and .here here here

Page 19: WORKING WITH TEXT DATA - PART I

WORDS IN VECTOR SPACES

Page 20: WORKING WITH TEXT DATA - PART I

BINARY VS. FREQUENCY(+) Binary is a very compact representation (in terms ofmemory).(+) Algorithms like Decision Trees have a very straight-forward and compact structure.(-) Binary says very little about the weight of each word(feature).(-) We can't use more advanced algorithms that workwith Vector Spaces.

Page 21: WORKING WITH TEXT DATA - PART I

TERM FREQUENCIES - SOME NOTATIONLet be a set of documents, and

(previously ) a set of index terms for .

Each document can be represented as a frequencyvector:

where denotes the frequency of term fordocument .

Thus, would be the word length of some document .

D = { , , … , } d1 d2 dN

T = { , , … , }t1 t2 tM V

D

∈ Ddi

= ⟨tf( , ), … , tf( , )⟩d ⃗ i t1 di tM di

tf(t, d) ∈ Ttj

di

∑Jj=1 dj d

Page 22: WORKING WITH TEXT DATA - PART I

TERM FREQUENCIES

= the cat sat on the matd0

= my cat sat on my catd1

T = [ ]cat mat my on sat the

X = [ ]1

2

1

0

0

2

1

1

1

1

1

0

Page 23: WORKING WITH TEXT DATA - PART I

TERM FREQUENCIES?

less important than ,but also ?Information Theory to the rescue!

Notice +1 smoothing to avoid

d0 = 'natural-language-processing.wiki'

d1 = 'information-retrieval.wiki'

d2 = 'artificial-intelligence.wiki'

d3 = 'machine-learning.wiki'

d4 = 'text-mining.wiki'

d5 = 'computer-vision.wiki'

t = [ ]learning

=   log( ) =Xt

⎢⎢⎢⎢⎢⎢⎢⎢

27

2

46

134

6

10

⎥⎥⎥⎥⎥⎥⎥⎥

Xt

⎢⎢⎢⎢⎢⎢⎢⎢

3.33

1.10

3.85

4.91

1.95

2.40

⎥⎥⎥⎥⎥⎥⎥⎥

tf = 10 tf = 100

\*10

log(tf(t, d) + 1)

log(0) = −inf

Page 24: WORKING WITH TEXT DATA - PART I

SOME REMAINING PROBLEMSThe longer a document, the higher the probability aterm will occur often, and will thus have more weight.Rare terms should actually be informative, especially ifthey occur amongst few documents.

If and both have in theirvectors, and all the other documents do not strong similarty.

d1 d2 cross-validation

Latter: Document Frequency

Page 25: WORKING WITH TEXT DATA - PART I

(INVERSE) DOCUMENT FREQUENCY

=idft logb

N

dft

t = [ ]naive

=   = 3    = = 0.30Xt

⎢⎢⎢⎢⎢⎢⎢⎢

1

0

2

3

0

0

⎥⎥⎥⎥⎥⎥⎥⎥

dft idft logb

6

3

Page 26: WORKING WITH TEXT DATA - PART I

PUTTING IT TOGETHER: WEIGHTING

learning text language intelligence

0 5 0.32 1 10 0

1 2 21 0.0 6 0

2 0 3 0 1 0.33

tf ∗ idf

= log(tf(t, d) + 1) ⋅wt,d logbN

dft

d

Page 27: WORKING WITH TEXT DATA - PART I
Page 28: WORKING WITH TEXT DATA - PART I

NORMALIZING VECTOR REPRESENTATIONSWe �xed the global information per instance.Despite , we still don't account for the length ofdocuments (i.e. the amount of words in total).Why is this an issue?

document ∗ term

tf ∗ idf

Page 29: WORKING WITH TEXT DATA - PART I

-NEAREST NEIGHBOURSk

Page 30: WORKING WITH TEXT DATA - PART I

EUCLIDEAN DISTANCE

d( , ) =x⃗  y ⃗  ( −∑i=1

n

x⃗ i y ⃗ i)2

− −−−−−−−−−

Documents with many words are far away.

Page 31: WORKING WITH TEXT DATA - PART I

NORMALIZATIONℓ2

|| | =x⃗ |2 ∑i

x2i

− −−−−√

Divide all feature values by norm.

Page 32: WORKING WITH TEXT DATA - PART I

COSINE SIMILARITY∙ = = + + … +a⃗  b ⃗  ∑n

i=1 a⃗ ib ⃗ i a⃗ 1b ⃗ 

1 a⃗ 2b ⃗ 2 a⃗ nb ⃗ 

n

Under the norm only (otherwisenormalize vectors before)!

ℓ2

Page 33: WORKING WITH TEXT DATA - PART I

USING SIMILARITY IN -NNStore the complete training matrix in memory.Calculate cosine / euclidean metric between a given and all .Choose the vectors from with the highestsimilarity to .Look up the labels for these vectors, take majoritylabel this is the classi�cation.

k

Xtrain

x⃗ test

∈x⃗ train Xtrain

k Xtrain

x⃗ test

k

Page 34: WORKING WITH TEXT DATA - PART I

AUGMENTATIONS TO -NNUse different metrics.Weight labels by:

Frequency (majority preferred).Inverse frequency (rarer preferred).Distance (closer instances count heavier).

k

Weightings avoid ties!