![Page 1: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/1.jpg)
Text Mining Overview
Piotr [email protected]
Warsaw University of Technology
Data Mining Group
22 November 2001
![Page 2: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/2.jpg)
Topics1. Natural Language Processing
2. Text Mining vs. Data Mining
3. The toolbox• Language processing methods• Single document processing• Document corpora processing
4. Document categorization – a closer look
5. Applications• Classic• Profiled document delivery• Related areas
• Web Content Mining & Web Farming
WUTDMGNOV 2001
![Page 3: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/3.jpg)
Natural Language Processing
• Natural language – test for Artificial Intelligence• Alan Turing
• NLP and NLU
WUTDMGNOV 2001
• Linguistics – exploring mysteries of a language• William Jones• Comparative linguistics - Jakob Grimm, Rasmus Rask• Noam Chomsky
• I-Language and E-Language• poverty of stimulus
• Statistical approaches – Markov and Shannon
Natural language processing (NLP)
anything that deals with text content
Natural language understanding (NLU)
semantics and logic
![Page 4: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/4.jpg)
Information explosion
WUTDMGNOV 2001
1970 19801990 2000
1
10
100
1000
10000
100000
Number of bookspublished weekly
Number of articlespublished monthly
• Increasing popularity of the Internet as a publishing medium• Electronic media’s minimal duplication costs
Primitive information retrieval and data management tools
![Page 5: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/5.jpg)
Data Mining
WUTDMGNOV 2001
Data Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from large databases. – Piatetsky-Shapiro
• Association rule discovery• Sequential pattern discovery• Categorization• Clustering• Statistics (mostly regression)• Visualization
![Page 6: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/6.jpg)
Knowledge pyramid
WUTDMGNOV 2001
Signals
Data Mining area
Data
Information
Knowledge
Wisdom
Resources occupied
Semantic level
![Page 7: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/7.jpg)
Text Mining – a definition
Text Mining =
Data Mining (applied to text data) +
basic linguistics
WUTDMGNOV 2001
Text Mining is understood as a process of automatically extracting meaningful, useful, previously unknown and ultimately comprehensible information from textual document repositories.
![Page 8: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/8.jpg)
Language tools
Single document tools
Multiple document tools
Text Mining tools
• Linguistic analysis• Thesauri, dictionaries, grammar analysers etc.
• Machine translation
• Automatic feature extraction
• Automatic summarization
• Document categorization
• Document clustering
• Information retrieval
• Visualization methods
WUTDMGNOV 2001
![Page 9: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/9.jpg)
Language analysis
WUTDMGNOV 2001
• Syntactic analysers construction• Grammatical sentence decomposition• Part-of-speech tagging• Word sense disambiguation
This is not that simple – consider for example
This is a delicious butter - noun
You should butter your toast - verb
Rule based systems or self-learning classification systems (using VMM and HMM)
![Page 10: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/10.jpg)
Thesaurus construction
WUTDMGNOV 2001
Telephone
Cell phone
Telecommunications
Fax machine
Data transmission network
Electronic mail
ADBTRT
Post and telecom
Thesaurus (semantic network) stores information about relationships between terms
• Ascriptor - Descriptor relations• „Broader term” – „Narrower term” relations• „Related term” relations
The U.S.S Nashville arrived in Colon harbour with 42 marines
With the warship in Colon harbour, the Colombian troops withdrew
Construction can be manual (but this is a laborious process) or automatic.
![Page 11: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/11.jpg)
Machine translation
Problems
WUTDMGNOV 2001
Word level W łóżku jest szybka In bed is window-pane
Syntactic level She is a window-pane in bedW łóżku jest szybka
Semantic level She is quick in bedW łóżku jest szybka
Knowledge representation
She is quick in bedW łóżku jest szybka
Formal knowledge representation language
Source: Polish
Target: English
• Different vocabularies• Different grammars and flexion rules• Even different character sets
![Page 12: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/12.jpg)
Książka okazała się adjective, The book turned out to be adjective
WUTDMGNOV 2001
Fully automatic approach
Based on learning word usage patterns from large corpora of translated documents (bitext)
Problems
• Still quite few bitexts exist• Sentences must be aligned prior to learning
• Keyword matching• Sentence length based alignment
• Parameterisation is necessary
![Page 13: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/13.jpg)
Feature extraction
Not all words are equally important
WUTDMGNOV 2001
• Technical multiword terminology• Abbreviations• Relations• Names• Numbers
Discovering important terms
• Finding lexical affinities• Gap variance measurement
• Dictionary-based methods• Grammar based heuristics
Data bases
Databases
Knowledge discovery in databases
MineIT
Microsoft
Micro$oft
Knowledge discovery in databases
Knowledge discovery in large databases
Knowledge discovery in big databases
![Page 14: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/14.jpg)
Document summarization
Abstracts
Extracts
Indicative summaries
Summaries
Summary creation methods: • statistical analysis of sentence and word frequency + dictionary analysis (i.e. „abstract”, „conclusion” words etc.)
• text representation methods – grammatical analysis of sentences
• document structure analysis (question-answer patterns, formatting, vocabulary shifts etc.)
WUTDMGNOV 2001
Informative summaries
![Page 15: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/15.jpg)
Unknown document
Document categorization & clustering
Clustering – dividing set of documents into groupsCategorization – grouping based on predefined category scheme
WUTDMGNOV 2001
Typical categorization scenario
Step 1 : Create training hierarchy
Step 2 : Perform training
Step 3 : Actual classification
Class 2Class 1
Repository
Class fingerprints
categorization
![Page 16: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/16.jpg)
Categorization/clustering system
Documents Representation conversion
Classic DM algorithm
Clustering – k-means, agglomerative,...Categorization – kNN, DT, Bayes,...
Representation processingDeriving metrics
WUTDMGNOV 2001
![Page 17: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/17.jpg)
Information retrieval
Two types of search methods
• exact match – in most cases uses some simple Boolean query specification language
• fuzzy – uses statistical methods to estimate relevance of the document
1999 data - Scooter (AltaVista) : 1.5GB RAM, 30GB disk, 4x533 MHz Alpha, 1GB/s I/O (crawler) - about 1 month needed to recrawl
WUTDMGNOV 2001
Modern IR tools seem to be very effective...
2000 data - 40-50% of the Web indexed at all
![Page 18: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/18.jpg)
IR – exact match
Most popular method – inverted files
a
b
c
d
...
z
• Very fast• Boolean queries very easy to process• Very simple
WUTDMGNOV 2001
![Page 19: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/19.jpg)
IR – fuzzy search
k
lil
k
ll
k
llil
ii
dq
qd
QDQDsim
1
2
1
2
1),cos(),(
Query can be a set of keywords, a document, or even a set of documents – also represented as a vector
WUTDMGNOV 2001
Documents are represented as vectors over word (feature) space
Repository
Initial query
IROutput Selection Output
It’s possible to perform it iteratively – relevance feedback
![Page 20: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/20.jpg)
Document visualization
Peak represents many strongly related documents
Water represents assorted documents, creating semantic noise
Island represents several documents sharing similar subject, and separated from others - hence creating a group of interest
WUTDMGNOV 2001
![Page 21: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/21.jpg)
Document visualization
WUTDMGNOV 2001
![Page 22: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/22.jpg)
Document categorization
A closer look
![Page 23: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/23.jpg)
Measuring quality
Binary categorization scenario is analogous to document retrieval
DB
dr
ds dr – relevant documents
ds – documents labelled as relevant
DB – document database
ds
drdsPR
dr
drdsR
DB
drdsDBdrdsA
drDB
drdsFO
WUTDMGNOV 2001
![Page 24: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/24.jpg)
Metrics
1),(0;),(
gfPRbaba
agfPR1),(0;),(
gfRca
ca
agfR
dcba
dagfA
),(1),(0;),(
gfFOdbdb
bgfFO
RPR
F1
)1(1
1
WUTDMGNOV 2001
![Page 25: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/25.jpg)
Multiple class scenario
l
PRgfPR
l
ii
ma
1),(
Mk
M={M1, M2,...,Ml}
Macro-averaging Micro-averaging
PR={PR1, PR2, ..., PRl}
WUTDMGNOV 2001
![Page 26: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/26.jpg)
Categorization example
WUTDMGNOV 2001
![Page 27: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/27.jpg)
Document representations
• unigram representations (bag-of-words)• binary• multivariate
• n-gram representations
• -gram representation
• positional representation
WUTDMGNOV 2001
![Page 28: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/28.jpg)
Bigram example
Twas brillig, and the slithy tovesDid gyre and gimble in the wabe
WUTDMGNOV 2001
![Page 29: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/29.jpg)
Probabilistic interpretation
)()))((( DRDRGR
Operations:
• R(D) – creating representation R from document D• G(R) – generating document D based on representation R
unigramsaid has further that of a upon an the a see joined heavy cut alice on once you is and open the edition t of a to brought he it she she she kinds I came this away look declare four re and not vain the muttered in at was cried and her keep with I to gave I voice of at arm if smokes her tell she cry they finished some next kitten each can imitate only sit like nights you additional she software or courses for rule she is only to think damaged s blaze nice the shut prisoner no
Consider your white queen shook his head and rang through my punishments. She ought to me and alice said that distance said nothing. Just then he would you seem very hopeful so dark. There it begins with one on how many candlesticks in a white quilt and all alive before an upright on somehow kitty. Dear friend and
without the room in a thing that a king and butter.
bigram
WUTDMGNOV 2001
![Page 30: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/30.jpg)
0
5000
10000
15000
20000
25000
30000
35000
0 10 20 30 40 50 60
Posit
ion
Occurence
AnyDumpty
Positional representation
WUTDMGNOV 2001
![Page 31: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/31.jpg)
i
rk
rkj
iij
v
wpw
Vvvwgdy
kfi
.0
,1
)(1
1
n
vif
2r
Word occurences
f(k)=2 (before norm.)k
Creating positional representation
WUTDMGNOV 2001
![Page 32: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/32.jpg)
0
5e-005
0.0001
0.00015
0.0002
0.00025
f an
y
any
r=500r=5000
0
5e-005
0.0001
0.00015
0.0002
0.00025
0.0003
0.00035
0.0004
f d
um
pty
dumpty
r=500r=5000E
xam
ple
sWUTDMGNOV 2001
![Page 33: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/33.jpg)
Processing representations
1
10
100
1000
10000
0 500 1000 1500 2000 2500 3000 3500
Fre
quency
Word ID
Word Frequency
The 1664
And 940
To 789
A 788
It 683
You 666
I 658
She 543
Of 538
said 473
Zipf’s law
WUTDMGNOV 2001
There is no information about penguins in this document
Stopwords?
information penguins document
![Page 34: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/34.jpg)
• Expanding
• Trimming
• Scaling functions
• Attribute selection
• Remapping attribute space
Expanding and trimming
WUTDMGNOV 2001
![Page 35: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/35.jpg)
ns
j jx
yxnkkilap
sM
MwwvP
1 ,
,11
1),...,|(Laplace
Lidstone
ns
j jx
yxnkkilid
sM
MwwvP
1 ,
,11 ),...,|(
Expanding
Representation processing
)log()log(1),(i
ijjilln df
Ntfdw
00)log(1)log()log(1),( ijijjilln tfN
Ntfdw
)log()log()log()log(1),( ijijjilln tfNNtfdw
TF/IDF
term frequency tfi, document frequency dfiN – all documents in system
Attribute present in one document
Attribute present in all documents
Scaling
WUTDMGNOV 2001
![Page 36: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/36.jpg)
)|(log)|()(
)|(log)|()()(log)()(
1
11
ij
l
j iji
ij
l
j iji
l
j jji
wkPwkPwP
wkPwkPwPkPkPwIG
Example – Information Gain
Attribute selection
WUTDMGNOV 2001
Statistical tests can be also applied to check if a feature – class correlation exists
P(wi) – probability of encountering attribute wi in a randomly selected
documentP(kj) – probability, that randomly selected document belongs to class kj
P(kj|wi) – probability, that document selected from these containing wi
belongs to class kj
![Page 37: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/37.jpg)
Attribute clustering
Attribute space remapping
Attribute – class
clustering
Semantic clustering
Representation matrix processing
(example - SVD)
Clustering according to
density function similarity
Attribute space remapping
WUTDMGNOV 2001
![Page 38: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/38.jpg)
Applications
• Classic
• Mail analysis and mail routing
• Event tracking
• Internet related
• Web Content Mining and Web Farming
• Focused crawling and assisted browsing
WUTDMGNOV 2001
![Page 39: Text Mining Overview Piotr Gawrysiak gawrysia@ii.pw.edu.pl Warsaw University of Technology Data Mining Group 22 November 2001](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e3b5503460f94b2d216/html5/thumbnails/39.jpg)
Thank you