q & a based internet information acquisition
DESCRIPTION
Q & A Based Internet Information Acquisition. Xiaoyan Zhu Tsinghua University [email protected]. Tsinghua University found in 1911. 13 Schools and 54 Departments. With 7,080 Faculties and 25,173 students. National Lab of Information Science and Technology. 1,000 people. - PowerPoint PPT PresentationTRANSCRIPT
Q & A Based
Internet Information Acquisition
Xiaoyan ZhuTsinghua University
23/4/21 1
Tsinghua University found in 1911
23/4/21 2
13 Schools and 54 Departments
With 7,080 Faculties and 25,173 studentsWith 7,080 Faculties and 25,173 students
23/4/21 3
National Lab of Information Science and Technology
1,000 people
23/4/21 4
Department of CS
23/4/21 5
Faculty member: 120Students Undergraduate 661 Master 545 PhD 298
Joint Research CenterTsinghua – Waterloo Research Center for Internet Information Acquisition
• Title of the project– Break barrier to access internet
• Mission– Enable the internet information to be widely and conveniently accessed by the
disadvantaged: people who do not read English, people who do not have the internet, and visually impaired people.
• Goal – To minimize language barriers and allow over 19.2 billion English internet pages accessible
to 1.5 billion Chinese people.– To enable 580 million Chinese cell phone users access the internet, even if they do not have
internet connection.– To enable 16 million visually impaired people in China, and many more in Canada and the
world, access the internet conveniently.
• Supported by International Research Chair Initiative project from International Development Research Center, Canada, (1 million Canadian
dollar)
23/4/21 6
http://ciia.cs.tsinghua.edu.cn/Project_WebSite/index.jsp
23/4/21 7
TALK OUTLINE
• How we get information from internet• Research topics
– Information similarity measure– information extraction– Information summarization
• Future work
23/4/21 8
From http://en.disitu.com/2008/03/10/how-many-website-in-the-world/
There are more than 20 billion web pages in the world, and
increasing in a speed of 200,000 a day.
23/4/21 9
General search engines
23/4/21 10
23/4/21 11
more than 20billion+5billion (70?)web pages indexed.
almost 4, 000, 000, 000user queries per day.
almost 1, 000, 000, 000of user-generated data(MB) every hour.
almost 17, 600, 000daily visitors.
Vertical search engines• Travel Search• Shopping (or Product) Search• Employment Search• ……
23/4/21 12
23/4/21 13
15, 000, 000registered users.
1, 000, 000
daily visitors
20, 000, 000services provided every year.
Community Q&A Systems
23/4/21 14
23/4/21 15
Q&A SystemsInput: natural language questionsReturn: sophisticated and tidy answers!
23/4/21 16
QUANTA− Q&A based information acquisition
Q&A Knowledge Power
RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation
Refine keywordsRefine keywords
RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation
Natural language queryNatural language query QUANTA
23/4/21 17
To make information acquisition more intelligent and convenient and effectiveTo minimize language barriers and allow over 19.2 billion English internet pages accessible to 1.5 billion Chinese people. To enable 580 million Chinese cell phone users access the internet, even if they do not have internet connection.
THU QUANTA
23/4/21 18
http://166.111.138.87/quanta/index.jsp
QUANTA’s result
23/4/21 19
• – Key word search and more powerful for the shorter query.
• – Question answering and more useful for detail question in
a specific domain.
• – Question answering and more popular for various
complex questions.
• – Complement to web keyword search.– Enhance existing cQA and search services.– Leverage existing knowledge in the question and answer
forms and their authors.23/4/21 20
General engines
Vertical engines
Community Q&A
Q&A systems
Get answer automatically from internet at anytime, anywhere for everyone!
Summary
Problems • General search engines, Key word search
– powerful for short query but long query and question• Loss of information Return irrelevant results
• Vertical search engines, question answering– Powerful for specified questions Domain limited
• cQA system, question answering– The best answer is really best?– Too many redundant answers for the same question.– The answer is not in real time for a new question.
23/4/21 21
Problems (cont.)
23/4/21 22
Complexity of the question limited QUANTA
Database based Wolfram alpha
Knowledge based
Powerset
Q&A Systems
TALK OUTLINE
• How we can get information from internet• Research topics
– Information similarity measure– information extraction– Information summarization
• Future work
23/4/21 23
Research topics ongoing
• Information Similarity • Information Extraction• Sentiment analysis, opinion mining• Information summarization• Question analysis and classification• Candidate generation, ranking,
evaluation• Recommendation for related
content
Theo
retic
al S
tudy
Application
23/4/21 24
Our work• Information theory based information similarity measure
– Conditional information distance• Distance between named entities under the specific environment
– Min distance measure• Distance between two objects for partial matching problem
– Information Distance among Many Objects• Comprehensive and typical information selection, e.g. review mining; multi-
document summarization
• Information extraction– Relationship extraction– Redundant removing
• Summarization– Based on Information Distance– Based on Graph Centrality
23/4/21 25
TALK OUTLINE
• How we can get information from internet• Research topics
– Information similarity measure– information extraction– Information summarization
• Future work
23/4/21 26
Information Distance
Kolmogorov Complexity
Dmax(x,y)
Dmax(x,y|c)dmax(x,y|c)dmax(x,y)
Dmin(x,y)dmin(x,y)
23/4/21
Information Distance
Dmax(x1,x2,…)
27
Information Distance
• Original theories:– Information distance: Dmax(x, y)
– Normalized information distance: dmax(x, y)
• Proposed by our group:– Conditional Information Distance(CID/NCID):
Dmax(x, y|c), dmax(x, y|c)
– min Distance: Dmin(x, y), dmin(x,y)
– Information Distance among Many Objects (IDMO): Dmax(x1, …,xn)
2823/4/21
Motivation of NCIDNormalized Conditional Information Distance
• Information Distance (Bennett C H, G´acs P, Li M, Vitanyi P, and Zurek W. , 1998):
This is normalized ID, named NID, where, K(x) is the Kolmogorov complexity of x, while K(x|y) is that condition to y.
NID is an absolute, universal, and application independent distance measure between any two sequences, which was applied in evolution tree creation, language classification, music classification, plagiarism detection, data mining for images and time sequences such as heart rhythm data etc..
• However, while people try to use it with Google, it becomes difficult and meaningless sometimes. For example, “fan” “CPU” and “star”? The NIDs of them are 0.60 and 0.58, respectively which are almost same and mean nothing.
29
)}(),(max{
)}|(),|(max{),(
yKxK
xyKyxKyxd
23/4/21
Experiment results of NCIDx y condition NID
fan CPU - 0.6076
fan star - 0.5832
fan CPU temperature 0.3527
fan star temperature 0.6916
fan CPU Hollywood 0.8258
fan star Hollywood 0.6598
xCondition
ISANo Condition
fish salmon salmon
fish carp tail
fish shark shark
fish whale whale
fish cuttlefish carp
fish fin cuttlefish
fish tail fin
Regular expression
23/4/21 30
Approximation of NCID
where, f(x) is the number of elements in which x occurs, and f(x,c) is number of elements in which x and c both occur; f(y,c) and f(x,y,c) are similarly defined.
From definition, 0 ≤ f(x,y,c) ≤ f(x,c) / f(y,c) ≤ f(c),
when f(x,y,c) = 0,
if f(x,c) * f(y,c) > 0, then dc(x,y) = ∞ (infinite);
otherwise, dc(x; y) is undefined.
)},(log),,(min{log)(log
)},(log),,(max{log),,(log),(),(
cyfcxfcf
cyfcxfcyxfyxNCIDyxd cc
3123/4/21
Motivation of min Distance
Partial matching problem– Example: What city is Lake Washington by? (Seattle, Bellevue,
Kirkland)
• Seattle, correct and most popular answer, has much more private information comparing with the other answers, in contrast with the information shared with “Lake Washington” .
– Problem of max distance• The shared information between question and the answer is “diluted”
by the private information of the answer, which makes the system selected an unpopular candidate, Bellevue which has “dense” shared information.
• Can we remove the irrelevant information in a coherent theory and give the most popular city Seattle a chance?3223/4/21
min Distance measure• Define Dmin and dmin
• Characters of min Distance• Universal• No partial matching problem• Nonnegative and symmetrical• It does not satisfy triangle inequality
Publications: ACM KDD-07, Bioinformatics, JCST
)}(),(min{
)}|(),|(min{),()},|(),|(min{),( minmin yKxK
xyKyxKyxdxyKyxKyxD
)()},(),,(min{
)},(),,(max{),,()|,(min cKcyKcxK
cyKcxKcyxKcyxd
3323/4/21
d(human,horse) > d(human,Centaur)+d(Centaur,horse)
minmin Distance’s problem Distance’s problem
3423/4/21
An information distance must reflect what “we think ” to be similar. And what “we think” to be similar apparently does not really satisfy triangle inequality. min Distance is reasonable and successful in the applications.
Motivation of IDMO
3504/21/23
• In many data mining applications, we are more interested in mining information from many, not just two, information carrying entities, for example:– What is the public opinion on the United States presidential
election, from the blogs?
– What do the customers say about a product, from the reviews?
– Which article, among many, covers the news most comprehensively? Or typical in one particular news item?
23/4/21 35
IDMO measure• Define Dmax(x1, …,xn)
3604/21/23
Dmax(x1,...,xn) =Em(x1,...,xn) = min { |p| : U(xi,p,j) = xj, for all i,j}
1 2 1 maxmin ( ... | , ) ( ,..., | ) min ( , | )n i m n i ki i
k i
K x x x x c E x x c D x x c
• Conditional Dmax(x1, …,xn)
Dmax(x1,...,xn|c) =Em(x1,...,xn|c) = min { |p| : U(xi,p,j|c) = xj, for all i,j}
• Most representative object
– Left-hand side: the most “comprehensive” object that contains the most information about all of the others
– Right-hand side: the most “typical” object which is similar to all of the others
23/4/21 36
Document Summarization by IDMO
Multi-document Summarization
– To generate the most “typical” summary according to the right-hand side of Em
– Ranked top 1 under the measurement of overall responsiveness both on TAC 2008(33 research groups, 58 submissions) and 2009 (26 research groups, 52 submissions)
3704/21/2323/4/21 37
Publications: ICDM-09
Review Mining by IDMO
Comprehensive and typical review selection
– To select the most “comprehensive” and the most “typical” reviews
– We have studied the relationship between a review’s sentiment rating and its textual content and developed a rating estimation system
– The system based on our theory is very helpful for customers
3804/21/23
Publications: ACM CIKM-08, WI-09
23/4/21 38
TALK OUTLINE
• How we can get information from internet• Research topics
– Information similarity measure– Information extraction– Information summarization
• Future work
23/4/21 39
Information Extraction
• Relationship extraction–Supervised learning–Unsupervised learning
23/4/21 40
Interaction extraction
• Statistical algorithms– low precision, high recall
• Pattern matching algorithms– Manual pattern generation
• High precision, low recall
• Bad at generalization
– Automatic pattern generation• Good balance between precision and recall
• Good at generalization
23/4/21 41
Pattern generation and optimization
• Pattern generation: extract patterns automatically.– dynamic alignment/dynamic programming algorithm
• Pattern optimization: reduce and merge patterns to increase generalization power, and hence the recall and precision rates.– Supervised Machine Learning algorithm
• approach based on MDL principle
• data labeling is required
– Semisupervised Machine Learning algorithm• approach with a ranking function, and a heuristic evaluation algorithm
• Relatively, little data labeling is required
23/4/21 42
Pattern Set
• Best pattern set should satisfy: – least amount of error in output and least amount of
redundancy in pattern set
– maximum number of sentences matched by at least one pattern.
• The problem is how to take the balance. That is a trade-off between the complexity of the model (pattern set) and the fitness of the model to the data (shown by the performance of the system).
23/4/21 43
Pattern optimization (MDL based)
• What is Minimum Description Length principle– Proposed by Rissanen, in 1978, as a tool to solve the tradeoff
problem between generalization and accuracy.
– MDL principle can be applied without the analytical form of the risk function.
– The MDL principle states that given some data D, the best model (or theory) MMDL in the set M of all models consistent with the data is the one that minimizes the sum of the length in bits of description of the model, and the length in bits of the description of the data with the aid of the model.
where l(M) and l(D/M) denote, respectively, the description length of the model M and that of data D using model M.
)|()(minarg MDlMlMM
MDL
23/4/21 44
Implement of MDL principle
• The MDL principle can be viewed from the point of view of Kolmogorov complexity (Li and Vitanyi, 1997):
where K() is Kolmogorov complexity which is also not computable. The MDL principle looks for an optimal balance between the regularities (in the model) and the randomness remaining in the data, that is, a trade-off between the complexity of the model and the fitness of the model to the data.
• The problem is how to make the principle computable.
K(D|M)K(M)MM
MDL minarg
,
23/4/21 45
MDL based algorithm
• Examples of pattern (a tag sequence):– {PTN VBZ IN PTN: *; binds, associates; to, with; *}– {PTN VBZ PTN: *; binds, associates, activate; *}Pattern set P, P={p1, p2, …, pn}, and pattern pi, pi=mi
1mi2…mi
Ji
• Finally, the optimal pattern P* is obtained as follows:
where, γ is a constant, and c(mij) is the number of words involved in jth
components of pattern pi. I* is expected interaction set while I is extracted by pattern set P. So d(I, I*) is the number of difference between I and I*.
d(I , I*)K(P)PP
2logminarg*
n
iii IIIId
1
*),(*),(
n
i j
jimPK
1
||)(
othersmc
PTNmm
ji
jij
i),(/
,1||
),( *
ii II
*
*
,0
,1
ii
ii
II
II
=
23/4/21 46
Experiment results (1)
700
800
900
1000
1100
1200
1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted
MDL
0. 2
0. 3
0. 4
0. 5
0. 6
0. 7
0. 8
0. 9
1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted
Preci si onRecal l
• X-coordinate is the number of deleted pattern;
• Y-coordinates are the value of MDL, precision and recall of the system, respectively.
23/4/21 47
Experiment results (2)
Algorithm Pattern Number
Training Set
Precision Recall F-Score
Original 192 64.6% 57.2% 60.68%
ERM134 82.7% 55.7% 66.57%
MDL 14 80.7% 61.1% 69.55%
Algorithm Pattern NumberTest Set
Precision Recall F-Score
Original 192 63.5% 57.3% 60.24%
ERM 134 77.9% 53.3% 63.29%
MDL 14 79.8% 59.5% 68.17%
Publications: Bioinformatics 2004, 200523/4/21 48
Pattern optimization (semi-supervised)
• Why is semisupervised algorithm proposed– Many kinds of relationship between protein, gene,
and disease.
– Data annotation is too expensive.
• Key point is ranking function and evaluation algorithm
23/4/21 49
Semi-supervised Algorithms
• Novel Ranking function
– Where p. positive indicates the number of correct instances matched by the pattern p and p. negative denotes the number of false instances. The parameter β is a threshold that controls p/n. If , HD is an increasing function of (p+n), which means if several patterns have the same p/n that exceeds , a pattern with larger (p+n) has a higher rank and is more possible to be saved. If , the first term is negative, which means that a pattern with larger (p+n) will have a lower rank. Thus different ranking strategies are used when different p/n are met.
• Heuristic Evaluation Algorithm (HEA)– To reduce redundancy among patterns with an optimization function
2
. 0.5( ) ( log )*ln( . . 1)
. 0.5
p positiveHD p p positive p negative
p negative
2/ np2
2/ np
23/4/21 50
Experiment results
F1 score
0.24
0.26
0.28
0.3
0.32
0.34
0.36
0.38
0.4
0.42
0.44
0.46
0.48
2480
2280
2080
1880
1680
1480
1280
1080
880
680
480
280 80
Number of Patterns
Ripper
Riloff
HD
HEA
Method Patterns
Precision
Recall F1
score
Impr. of F1
Baseline (raw)
2480
19.0% 46.5%
27.0% –
Ripper 1626
41.0% 40.8%
40.9% +49.3%
Riloff 88
40.8% 44.5% 42.6%
+55.5%
HD 92
52.5% 38.8% 44.5%
+62.4%
HEA 7243.5% 45.9% 45.5%
+66.1%
Published on Bioinformatics,
Medical Informatics,
and APBC’07
23/4/21 51
TALK OUTLINE
• How we can get information from internet• Research topics
– Information similarity measure– information extraction– Information summarization
• Future work
23/4/21 52
Background
• Too many information– Concise & coherent summary
• Generative summarization– Language generation (re-phrase meanings)– Very difficult to express semantics
• Extractive summarization– Extract key sentences
23/4/21 53
Previous Studies
• Statistical approaches (Nomoto,2001; …)• Linguistic techniques (Nakao,2000; …)• Graph-based methods
– LexRank (Erkan&Radev, 2004;)– TextRank (Mihalcea&Tarau, 2004;)– Query specific document summarizer
(Varadarajan&Hristidis, 2006)– Many more …
23/4/21 54
System I: Information Distance Based
• Problem Reformulation Given cluster A with m documents A1,A2,…,Am,
the update sum task for cluster B={B1,B2,…,Bn} should:
Min{Dmax(S, B1B2…Bn| A1A2…Am)}, |S|≤Θ
where, S={s1,s2,…,sk} is a set in which each si is
a sentence selected for the summary 23/4/21 55
Problem Reformulation
• K(AB)=K(A∪B) K(A|B)=K(A\B)
• Dmax(S, B1B2…Bn| A1A2…Am)
=K((B1B2…Bn\A1A2…Am)\S1S2…Sk)
• Min{Dmax(S, B1B2…Bn| A1A2…Am)}
=Max{K(S1S2…Sk)}
23/4/21 56
Approximation
• How to compute K(S1S2…Sk)?• Assumption: each important term carries one
unit of information, then K(S)=|S| (the cardinality)
S={t1,t2,t3,…,tn} tk: important terms• Important terms
– Non stop-words– Named entities (person, org., loc., date, …)– With high document frequency
23/4/21 57
Approximation (cont.)
• Select one representative sentence s for each document D by: argmins{Dmax(s,T), s∈D}
T: the union of topic title and narrativeRemove redundant representative sentences
—8 continuous common words—60% common words
23/4/21 58
Generate Summary
• With the representative sentences, select a subset that makes max{K(…)}– Compute all combinations of sentences within the
length limit;
23/4/21 59
Publication: IEEE ICDM-09
Results
Evaluation Method Best Our Result RankAverage Modified Score 0.336 0.309 5/58Macroaverage Modified Score with 3 models
0.331 0.304 5/58
Average Linguistic Quality 3.333 2.958 3/58Average Overall Responsiveness 2.667 2.667 1/58
23/4/21 60
Cluster Traditional UpdateEvaluation Method Results Rank Results RankAVG Modified Score 0.311 9 0.296 4/52MacroAVG Modified 0.316 9 0.292 4/52AVG Linguistic Qualilty 5.682 3 5.886 1/52AVG Overall Resp. 4.955 2 5.023 1/52
TAC 2008, Update Summarization
TAC 2009, Summarization
System II: graph model based• It may be a new solution for text presentation
– Bag-of-Words
• Iterating on the graph can propagate very distant dependence
• Key points: define nodes\edges\computationA
B
C
D
E
A E
23/4/21 61
Graph-based Update Summarization
① Select most salient terms② Build the term-sentence matrix W③ Use the LSI sentence-sentence similarity matrix SIM④ Construct a graph based on SIM⑤ Compute the graph centrality (power iteration
algorithm)⑥ Select the top 15 sentences with high centrality⑦ Compute all combinations with the length limit⑧ Score a summary as a whole, and keep the best⑨ Re-order the sentences within the summary
23/4/21 62
Graph centrality
• Centrality measure: which is the most important node in a graph?– Degree centrality– Eigenvector centrality
• Suppose the centrality of sentence s is Cs, then
– Important connections make the node itself more important
s r s rr D r s
C Sim C
23/4/21 63
Tailored to Update Summarization
• Problem: given cluster A, summarize cluster B
• Sentence in cluster B should be penalized
• Matrix form:
AA AB
BA BB
SIM SIMSIM
SIM SIM
B A
s r s r r s rr D r s r D r s
C Sim C Sim C
' AA AB
BA BB
SIM SIMSIM
SIM SIM
23/4/21 64
How to score term, summary?
• Score a term– The position of a word (headline, first sentence)– With manual tuning parameters:
• Score(w)=tf(w)0.4*Fd(df(w))
• Score a summary– The term frequency of each word– The centrality of each sentence
0 15( )
( ) 1 0 45 2Score w
Power wmaxscore
( )( ) ( )Power ww
w S
SumScore S count
23/4/21 65
Update Summarization Results (TAC 2008)
Evaluation Method Best Result Our Result Rank
Average Modified Score
0.336 0.304 7 /58
Macroaverage Modified Score with 3 models
0.331 0.299 7 /58
Average Linguistic Quality
3.333 3.073 2 /58
Average Overall Responsiveness
2.667 2.667 1 /58
23/4/21 66
TALK OUTLINE
• How we can get information from internet• Research topics
– Information similarity measure– information extraction– Information summarization
• Future work
23/4/21 67
Challenges • If there is an universal semantic information similarity
measure for any types of information.• How to organize and represent the knowledge in order
to reuse it.• How to combine rules and statistical methods to get
sophisticated and tidy answers with comprehensive information.
• How to integrate all kinds of information resources to achieve a domain independent QA system.
• ……
– Goal: provide a scalable question and answering service
23/4/21 68
Acknowledgement
• Prof. Ming Li, Dr. Minlie Huang, Dr. Yu Hao• Xian Zhang, Chong Long, Feng Jin, Zhicheng
Zheng, Fan Bu• Jianshu Sun, Yang Tang• Shouyuan Chen, Yuanming Yu
23/4/21 69
Q&A
Thanks !
23/4/21 70