web-based information architectures
DESCRIPTION
Web-based Information Architectures. Jian Zhang. Today’s Topics. Term Weighting Scheme Vector Space Model & GVSM Evaluation of IR Rocchio Feedback Web Spider Algorithm Text Mining: Named Entity Identification Data Mining Text Categorization (kNN). Term Weighting Scheme. TW = TF * IDF - PowerPoint PPT PresentationTRANSCRIPT
Web-based Information Architectures
Jian Zhang
Today’s Topics
• Term Weighting Scheme
• Vector Space Model & GVSM
• Evaluation of IR
• Rocchio Feedback
• Web Spider Algorithm
• Text Mining: Named Entity Identification
• Data Mining
• Text Categorization (kNN)
Term Weighting Scheme
• TW = TF * IDF – TF part = f1(tf(term, doc))
– IDF part = f2(idf(term)) = f2(N/df(term))
– E.g., f1(tf) = normalized_tf = tf/max_tf; f2(idf) = log2(idf)
– E.g, f1(tf) = tf; f2(idf) = 1
NOTE: definition of DF!
Document & Query Representation
• Bag of words, Vector Space Model(VSM)• Word Normalization
– Stopwords removal– Stemming
• Proximity phrases• Each element of the vector is the Term
Weight of that term w.r.t the document/query.
Similarity Measure
• Dot Product:
n
iii
nn
vuvu
uuuuvvvv
1
2121 ],,,[];,,,[
Similarity Measure
• Cosine Similarity:
vu
vuvu
uuvvn
ii
n
ii
),cos(
;;1
2
1
2
Information Retrieval
• Basic assumption: Shared words between query and document
• Similarity measures– Dot product– Cosine similarity (normalized)
Evaluation
• Recall = a/(a+c)
• Precision = a/(a+b)
• F1=2.0*recall*precision / (recall+precision)
• Accuracy – Bad for IR,
Refinement of VSM
• Query expansion
• Relevance Feedback– Rocchio Formula: …
Alpha, beta, gamma and their meanings
Generalized Vector Space Model• Given a collection of training data, present
each term as a n-dimensional vectorD1 D2 … Dj … Dn
T1 w11 w12 … w1j … w1n
T2 w21 w22 … w2j … w2n
… … … … … … …
Ti wi1 wi2 … wij … win
… … … … … … …
Tm wm1 wm2 … wmj … wmn
GVSM (2)• Define similarity between term ti and tj
Sim(ti, tj) = cos(ti, tj)
• Similarity between qury and document is based on the term-term similarity– For each query term qi, find the term tD in the document
D that is most similar to qi. This value viD, can be considered as the similarity between a sigle word query qi and the document D.
– Sum up the similarities between each query term and the document D. This is considered the similarity between the query and the document D.
GVSM (3)
Sim(Q,D) = Σi[Maxj(sim(qi, dj)]
or normalizing for document & query length:
Simnorm(Q, D) =||||
)],(([
DQ
dqsimMax ji
Maximal Marginal Relevance
• Redundancy reduction
• Getting more novel things
• Formula
MMR(Q, C, R) =
Argmaxkdi in C[λS(Q, di) - (1-λ)maxdj
in R (S(di, dj))]
MMR Example (Summarization)
S1
S2
S3
S4
S5
S6
S1
S3
S4
Full Text
SummaryQuery
MMR Example (Summarization)Select first sentence: λ=0.7
S1
S2
S3
S4
S5
S6
S3
Full Text
Summary
Query0.4
0.3
0.6
0.2
0.2
0.3Sim(Q, S) = Q . S / (|Q||S|)
MMR Example (Summarization)Select second sentence
S1
S2
S3
S4
S5
S6
S3
Full Text
Summary
Query
0.150.1
0.2
0.5
0.5
S3
S1
S4
S1
S2
S3
S4
S5
S6
S1
Full Text
Summary
Query
0.2
0.1
0.4
0.6
S3
S1
MMR Example (Summarization)Select third sentence
Text Categorization
Task• You want to classify a document to some
categories automatically. For example, the categories are "weather" and "sport".
• To do that, you can use kNN algorithm.• To use kNN, you need a collection of
documents, each of them is labeled to some categories by human.
Text CategorizationProcedure• Using VSM represent each document in the
training data• Using VSM represent the document to be
categorized (new document).• Use cosine (or some other measures, but cosine is
good here, why) find top k documents (k nearest neighbors ) in the training data that are similar to the new document.
• Decide from the k nearest neighbors what are the categories for the new document
Web Spider
• The web graph at any instant of time contains k-connected subgraphs
• The spider algorithm given in class is a depth first search through a web subgraph
• Avoiding respidering the same page• Completeness is not guaranteed. Partial
solution is to get seed URLs as diverse as possible.
Web SpiderPROCEDURE SPIDER4(G, {SEEDS})
Initialize COLLECTION <big file of URL-page pairs>Initialize VISITED <big hash-table>
For every ROOT in SEEDSInitialize STACK <stack data structure>
Let STACK := push(ROOT, STACK)
While STACK is not empty,
Do URLcurr := pop(STACK)
Until URLcurr is not in VISITED
insert-hash(URLcurr, VISITED)
PAGE := look-up(URLcurr)
STORE(<URLcurr, PAGE>, COLLECTION)
For every URLi in PAGE,
push(URLi, STACK)Return COLLECTION
Text Mining
Components of Text Mining
• Categorization by topic or Genre
• Fact extraction from text
• Data Mining from DBs or extracted facts
Fact extraction from text
• Named Entity Identification
FSA/FST, HMM
• Role-Situated Named Entities
Apply context information
• Information Extraction
Template matching
Named Entity IdentificationDefinition of A Finite State Acceptor (FSA)• With an input source (e.g. string of words)• Outputs "YES" or "NO"Definition of A Finite State Transducer (FST)• An FSA with variable binding• Outputs "NO" or "YES"+variable-bindings• Variable bindings encode recognized entity
e.g. "YES <firstname Hideto> <lastname Suzuki>"
Named Entity Identification
Example. Identify numbers:
1, 2.0, -3.22, +3e2, 4e-5
D = {0,1,2,3,4,5,6,7,8,9}
+- D
D e
.D
e+- D
D
D
D
D
Start
Data Mining• Learning by caching
– What/when to cache– When to use/invalidate/update cache
• Learning from Examples(a.k.a, "Supervised" learning)– Labeled examples for training– Learn the mapping from examples to labels– E.g.: Naive Bayes, Decision Trees, ...– Text Categorization (using kNN or other means)
is a learning-from-examples task
Data Mining• "Speedup" Learning
– Tuning search heuristics from experience– Inducing explicit control knowledge– Analogical learning (generalized instances)
• Optimization "policy" learning– Predicting continuous objective function– E.g. Regression, Reinforcement, ...
• New Pattern Discovery(aka "Unsupervised" Learning)– Finding meaningful correlations in data– E.g. association rules, clustering, ...
Generalize v.s. Specialize
• Generalize:
First, each record in your database is a RULE
Then, generalize (how?, when to stop?)
• Specialize:
First, give a very general rule (almost useless)
Then, specialize (how? When to stop?)
Methods for Supervised DMClassifiers• Linear Separators (regression)• Naive Bayes (NB)• Decision Trees (DTs)• k-Nearest Neighbor (kNN)• Decision rule induction• Support Vector Machines (SVMs)• Neural Networks (NNs) ...