q & a based internet information acquisition

Q & A Based

Internet Information Acquisition

Xiaoyan ZhuTsinghua University

[email protected]

23/4/21 1

Tsinghua University found in 1911

23/4/21 2

13 Schools and 54 Departments

With 7,080 Faculties and 25,173 studentsWith 7,080 Faculties and 25,173 students

23/4/21 3

National Lab of Information Science and Technology

1,000 people

23/4/21 4

Department of CS

23/4/21 5

Faculty member: 120Students Undergraduate 661 Master 545 PhD 298

Joint Research CenterTsinghua – Waterloo Research Center for Internet Information Acquisition

• Title of the project– Break barrier to access internet

• Mission– Enable the internet information to be widely and conveniently accessed by the

disadvantaged: people who do not read English, people who do not have the internet, and visually impaired people.

• Goal – To minimize language barriers and allow over 19.2 billion English internet pages accessible

to 1.5 billion Chinese people.– To enable 580 million Chinese cell phone users access the internet, even if they do not have

internet connection.– To enable 16 million visually impaired people in China, and many more in Canada and the

world, access the internet conveniently.

• Supported by International Research Chair Initiative project from International Development Research Center, Canada, (1 million Canadian

dollar)

23/4/21 6

http://ciia.cs.tsinghua.edu.cn/Project_WebSite/index.jsp

23/4/21 7

http://ciia.cs.tsinghua.edu.cn/Project_WebSite/index.jsp

TALK OUTLINE

• How we get information from internet• Research topics

– Information similarity measure– information extraction– Information summarization

• Future work

23/4/21 8

From http://en.disitu.com/2008/03/10/how-many-website-in-the-world/

There are more than 20 billion web pages in the world, and

increasing in a speed of 200,000 a day.

23/4/21 9

General search engines

23/4/21 10

23/4/21 11

more than 20billion+5billion (70?)web pages indexed.

almost 4, 000, 000, 000user queries per day.

almost 1, 000, 000, 000of user-generated data(MB) every hour.

almost 17, 600, 000daily visitors.

Vertical search engines• Travel Search• Shopping (or Product) Search• Employment Search• ……

23/4/21 12

23/4/21 13

15, 000, 000registered users.

1, 000, 000

daily visitors

20, 000, 000services provided every year.

Community Q&A Systems

23/4/21 14

23/4/21 15

http://zhidao.baidu.com/

Q&A SystemsInput: natural language questionsReturn: sophisticated and tidy answers!

23/4/21 16

QUANTA− Q&A based information acquisition

Q&A Knowledge Power

RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation

Refine keywordsRefine keywords

RequirementRequirement KeywordsKeywords URLURL Web pagesWeb pages InformationInformation

Natural language queryNatural language query QUANTA

23/4/21 17

To make information acquisition more intelligent and convenient and effectiveTo minimize language barriers and allow over 19.2 billion English internet pages accessible to 1.5 billion Chinese people. To enable 580 million Chinese cell phone users access the internet, even if they do not have internet connection.

http://hi.baidu.com/baidu

http://hi.baidu.com/baidu

THU QUANTA

23/4/21 18

http://166.111.138.87/quanta/index.jsp

QUANTA’s result

23/4/21 19

• – Key word search and more powerful for the shorter query.

• – Question answering and more useful for detail question in

a specific domain.

• – Question answering and more popular for various

complex questions.

• – Complement to web keyword search.– Enhance existing cQA and search services.– Leverage existing knowledge in the question and answer

forms and their authors.23/4/21 20

General engines

Vertical engines

Community Q&A

Q&A systems

Get answer automatically from internet at anytime, anywhere for everyone！

Summary

Problems • General search engines, Key word search

– powerful for short query but long query and question• Loss of information Return irrelevant results

• Vertical search engines, question answering– Powerful for specified questions Domain limited

• cQA system, question answering– The best answer is really best?– Too many redundant answers for the same question.– The answer is not in real time for a new question.

23/4/21 21

Problems (cont.)

23/4/21 22

Complexity of the question limited QUANTA

Database based Wolfram alpha

Knowledge based

Powerset

Q&A Systems

TALK OUTLINE

• How we can get information from internet• Research topics


• Future work

23/4/21 23

Research topics ongoing

• Information Similarity • Information Extraction• Sentiment analysis, opinion mining• Information summarization• Question analysis and classification• Candidate generation, ranking,

evaluation• Recommendation for related

content

Theo

retic

al S

tudy

Application

23/4/21 24

Our work• Information theory based information similarity measure

– Conditional information distance• Distance between named entities under the specific environment

– Min distance measure• Distance between two objects for partial matching problem

– Information Distance among Many Objects• Comprehensive and typical information selection, e.g. review mining; multi-

document summarization

• Information extraction– Relationship extraction– Redundant removing

• Summarization– Based on Information Distance– Based on Graph Centrality

23/4/21 25

TALK OUTLINE



• Future work

23/4/21 26

Information Distance

Kolmogorov Complexity

Dmax(x,y)

Dmax(x,y|c)dmax(x,y|c)dmax(x,y)

Dmin(x,y)dmin(x,y)

23/4/21


Dmax(x1,x2,…)

27


• Original theories:– Information distance: Dmax(x, y)

– Normalized information distance: dmax(x, y)

• Proposed by our group：– Conditional Information Distance(CID/NCID):

Dmax(x, y|c), dmax(x, y|c)

– min Distance: Dmin(x, y), dmin(x,y)

– Information Distance among Many Objects (IDMO): Dmax(x1, …,xn)

2823/4/21

Motivation of NCIDNormalized Conditional Information Distance

• Information Distance (Bennett C H, G´acs P, Li M, Vitanyi P, and Zurek W. , 1998):

This is normalized ID, named NID, where, K(x) is the Kolmogorov complexity of x, while K(x|y) is that condition to y.

NID is an absolute, universal, and application independent distance measure between any two sequences, which was applied in evolution tree creation, language classification, music classification, plagiarism detection, data mining for images and time sequences such as heart rhythm data etc..

• However, while people try to use it with Google, it becomes difficult and meaningless sometimes. For example, “fan” “CPU” and “star”? The NIDs of them are 0.60 and 0.58, respectively which are almost same and mean nothing.

29

)}(),(max{

)}|(),|(max{),(

yKxK

xyKyxKyxd

23/4/21

Experiment results of NCIDx y condition NID

fan CPU - 0.6076

fan star - 0.5832

fan CPU temperature 0.3527

fan star temperature 0.6916

fan CPU Hollywood 0.8258

fan star Hollywood 0.6598

xCondition

ISANo Condition

fish salmon salmon

fish carp tail

fish shark shark

fish whale whale

fish cuttlefish carp

fish fin cuttlefish

fish tail fin

Regular expression

23/4/21 30

Approximation of NCID

where, f(x) is the number of elements in which x occurs, and f(x,c) is number of elements in which x and c both occur; f(y,c) and f(x,y,c) are similarly defined.

From definition, 0 ≤ f(x,y,c) ≤ f(x,c) / f(y,c) ≤ f(c),

when f(x,y,c) = 0,

if f(x,c) * f(y,c) > 0, then dc(x,y) = ∞ (infinite);

otherwise, dc(x; y) is undefined.

)},(log),,(min{log)(log

)},(log),,(max{log),,(log),(),(

cyfcxfcf

cyfcxfcyxfyxNCIDyxd cc

3123/4/21

Motivation of min Distance

Partial matching problem– Example: What city is Lake Washington by? (Seattle, Bellevue,

Kirkland)

• Seattle, correct and most popular answer, has much more private information comparing with the other answers, in contrast with the information shared with “Lake Washington” .

– Problem of max distance• The shared information between question and the answer is “diluted”

by the private information of the answer, which makes the system selected an unpopular candidate, Bellevue which has “dense” shared information.

• Can we remove the irrelevant information in a coherent theory and give the most popular city Seattle a chance?3223/4/21

min Distance measure• Define Dmin and dmin

• Characters of min Distance• Universal• No partial matching problem• Nonnegative and symmetrical• It does not satisfy triangle inequality

Publications: ACM KDD-07, Bioinformatics, JCST

)}(),(min{

)}|(),|(min{),()},|(),|(min{),( minmin yKxK

xyKyxKyxdxyKyxKyxD

)()},(),,(min{

)},(),,(max{),,()|,(min cKcyKcxK

cyKcxKcyxKcyxd

3323/4/21

d(human,horse) > d(human,Centaur)+d(Centaur,horse)

minmin Distance’s problem Distance’s problem

3423/4/21

An information distance must reflect what “we think ” to be similar. And what “we think” to be similar apparently does not really satisfy triangle inequality. min Distance is reasonable and successful in the applications.

Motivation of IDMO

3504/21/23

• In many data mining applications, we are more interested in mining information from many, not just two, information carrying entities, for example:– What is the public opinion on the United States presidential

election, from the blogs?

– What do the customers say about a product, from the reviews?

– Which article, among many, covers the news most comprehensively? Or typical in one particular news item?

23/4/21 35

IDMO measure• Define Dmax(x1, …,xn)

3604/21/23

Dmax(x1,...,xn) =Em(x1,...,xn) = min { |p| : U(xi,p,j) = xj, for all i,j}

1 2 1 maxmin ( ... | , ) ( ,..., | ) min ( , | )n i m n i ki i

k i

K x x x x c E x x c D x x c

• Conditional Dmax(x1, …,xn)

Dmax(x1,...,xn|c) =Em(x1,...,xn|c) = min { |p| : U(xi,p,j|c) = xj, for all i,j}

• Most representative object

– Left-hand side: the most “comprehensive” object that contains the most information about all of the others

– Right-hand side: the most “typical” object which is similar to all of the others

23/4/21 36

Document Summarization by IDMO

Multi-document Summarization

– To generate the most “typical” summary according to the right-hand side of Em

– Ranked top 1 under the measurement of overall responsiveness both on TAC 2008(33 research groups, 58 submissions) and 2009 (26 research groups, 52 submissions)

3704/21/2323/4/21 37

Publications: ICDM-09

Review Mining by IDMO

Comprehensive and typical review selection

– To select the most “comprehensive” and the most “typical” reviews

– We have studied the relationship between a review’s sentiment rating and its textual content and developed a rating estimation system

– The system based on our theory is very helpful for customers

3804/21/23

Publications: ACM CIKM-08, WI-09

23/4/21 38

TALK OUTLINE


– Information similarity measure– Information extraction– Information summarization

• Future work

23/4/21 39

Information Extraction

• Relationship extraction–Supervised learning–Unsupervised learning

23/4/21 40

Interaction extraction

• Statistical algorithms– low precision, high recall

• Pattern matching algorithms– Manual pattern generation

• High precision, low recall

• Bad at generalization

– Automatic pattern generation• Good balance between precision and recall

• Good at generalization

23/4/21 41

Pattern generation and optimization

• Pattern generation: extract patterns automatically.– dynamic alignment/dynamic programming algorithm

• Pattern optimization: reduce and merge patterns to increase generalization power, and hence the recall and precision rates.– Supervised Machine Learning algorithm

• approach based on MDL principle

• data labeling is required

– Semisupervised Machine Learning algorithm• approach with a ranking function, and a heuristic evaluation algorithm

• Relatively, little data labeling is required

23/4/21 42

Pattern Set

• Best pattern set should satisfy: – least amount of error in output and least amount of

redundancy in pattern set

– maximum number of sentences matched by at least one pattern.

• The problem is how to take the balance. That is a trade-off between the complexity of the model (pattern set) and the fitness of the model to the data (shown by the performance of the system).

23/4/21 43

Pattern optimization (MDL based)

• What is Minimum Description Length principle– Proposed by Rissanen, in 1978, as a tool to solve the tradeoff

problem between generalization and accuracy.

– MDL principle can be applied without the analytical form of the risk function.

– The MDL principle states that given some data D, the best model (or theory) MMDL in the set M of all models consistent with the data is the one that minimizes the sum of the length in bits of description of the model, and the length in bits of the description of the data with the aid of the model.

where l(M) and l(D/M) denote, respectively, the description length of the model M and that of data D using model M.

)|()(minarg MDlMlMM

MDL

23/4/21 44

Implement of MDL principle

• The MDL principle can be viewed from the point of view of Kolmogorov complexity (Li and Vitanyi, 1997):

where K() is Kolmogorov complexity which is also not computable. The MDL principle looks for an optimal balance between the regularities (in the model) and the randomness remaining in the data, that is, a trade-off between the complexity of the model and the fitness of the model to the data.

• The problem is how to make the principle computable.

K(D|M)K(M)MM

MDL minarg

,

23/4/21 45

MDL based algorithm

• Examples of pattern (a tag sequence):– {PTN VBZ IN PTN: *; binds, associates; to, with; *}– {PTN VBZ PTN: *; binds, associates, activate; *}Pattern set P, P={p1, p2, …, pn}, and pattern pi, pi=mi

1mi2…mi

Ji

• Finally, the optimal pattern P* is obtained as follows:

where, γ is a constant, and c(mij) is the number of words involved in jth

components of pattern pi. I* is expected interaction set while I is extracted by pattern set P. So d(I, I*) is the number of difference between I and I*.

d(I , I*)K(P)PP

2logminarg*

n

iii IIIId

1

*),(*),(

n

i j

jimPK

1

||)(

othersmc

PTNmm

ji

jij

i),(/

,1||

),( *

ii II

*

*

,0

,1

ii

ii

II

II

=

23/4/21 46

Experiment results (1)

700

800

900

1000

1100

1200

1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted

MDL

0. 2

0. 3

0. 4

0. 5

0. 6

0. 7

0. 8

0. 9

1 22 43 64 85 106 127 148 169 190Number of Pattern Del eted

Preci si onRecal l

• X-coordinate is the number of deleted pattern;

• Y-coordinates are the value of MDL, precision and recall of the system, respectively.

23/4/21 47

Experiment results (2)

Algorithm Pattern Number

Training Set

Precision Recall F-Score

Original 192 64.6% 57.2% 60.68%

ERM134 82.7% 55.7% 66.57%

MDL 14 80.7% 61.1% 69.55%

Algorithm Pattern NumberTest Set

Precision Recall F-Score

Original 192 63.5% 57.3% 60.24%

ERM 134 77.9% 53.3% 63.29%

MDL 14 79.8% 59.5% 68.17%

Publications: Bioinformatics 2004, 200523/4/21 48

Pattern optimization (semi-supervised)

• Why is semisupervised algorithm proposed– Many kinds of relationship between protein, gene,

and disease.

– Data annotation is too expensive.

• Key point is ranking function and evaluation algorithm

23/4/21 49

Semi-supervised Algorithms

• Novel Ranking function

– Where p. positive indicates the number of correct instances matched by the pattern p and p. negative denotes the number of false instances. The parameter β is a threshold that controls p/n. If , HD is an increasing function of (p+n), which means if several patterns have the same p/n that exceeds , a pattern with larger (p+n) has a higher rank and is more possible to be saved. If , the first term is negative, which means that a pattern with larger (p+n) will have a lower rank. Thus different ranking strategies are used when different p/n are met.

• Heuristic Evaluation Algorithm (HEA)– To reduce redundancy among patterns with an optimization function

2

. 0.5( ) ( log )*ln( . . 1)

. 0.5

p positiveHD p p positive p negative

p negative

2/ np2

2/ np

23/4/21 50

Experiment results

F1 score

0.24

0.26

0.28

0.3

0.32

0.34

0.36

0.38

0.4

0.42

0.44

0.46

0.48

2480

2280

2080

1880

1680

1480

1280

1080

880

680

480

280 80

Number of Patterns

Ripper

Riloff

HD

HEA

Method Patterns

Precision

Recall F1

score

Impr. of F1

Baseline (raw)

2480

19.0% 46.5%

27.0% –

Ripper 1626

41.0% 40.8%

40.9% +49.3%

Riloff 88

40.8% 44.5% 42.6%

+55.5%

HD 92

52.5% 38.8% 44.5%

+62.4%

HEA 7243.5% 45.9% 45.5%

+66.1%

Published on Bioinformatics,

Medical Informatics,

and APBC’07

23/4/21 51

TALK OUTLINE



• Future work

23/4/21 52

Background

• Too many information– Concise & coherent summary

• Generative summarization– Language generation (re-phrase meanings)– Very difficult to express semantics

• Extractive summarization– Extract key sentences

23/4/21 53

Previous Studies

• Statistical approaches (Nomoto,2001; …)• Linguistic techniques (Nakao,2000; …)• Graph-based methods

– LexRank (Erkan&Radev, 2004;)– TextRank (Mihalcea&Tarau, 2004;)– Query specific document summarizer

(Varadarajan&Hristidis, 2006)– Many more …

23/4/21 54

System I: Information Distance Based

• Problem Reformulation Given cluster A with m documents A1,A2,…,Am,

the update sum task for cluster B={B1,B2,…,Bn} should:

Min{Dmax(S, B1B2…Bn| A1A2…Am)}, |S|≤Θ

where, S={s1,s2,…,sk} is a set in which each si is

a sentence selected for the summary 23/4/21 55

Problem Reformulation

• K(AB)=K(A∪B) K(A|B)=K(A\B)

• Dmax(S, B1B2…Bn| A1A2…Am)

=K((B1B2…Bn\A1A2…Am)\S1S2…Sk)

• Min{Dmax(S, B1B2…Bn| A1A2…Am)}

=Max{K(S1S2…Sk)}

23/4/21 56

Approximation

• How to compute K(S1S2…Sk)?• Assumption: each important term carries one

unit of information, then K(S)=|S| (the cardinality)

S={t1,t2,t3,…,tn} tk: important terms• Important terms

– Non stop-words– Named entities (person, org., loc., date, …)– With high document frequency

23/4/21 57

Approximation (cont.)

• Select one representative sentence s for each document D by: argmins{Dmax(s,T), s∈D}

T: the union of topic title and narrativeRemove redundant representative sentences

—8 continuous common words—60% common words

23/4/21 58

Generate Summary

• With the representative sentences, select a subset that makes max{K(…)}– Compute all combinations of sentences within the

length limit;

23/4/21 59

Publication: IEEE ICDM-09

Results

Evaluation Method Best Our Result RankAverage Modified Score 0.336 0.309 5/58Macroaverage Modified Score with 3 models

0.331 0.304 5/58

Average Linguistic Quality 3.333 2.958 3/58Average Overall Responsiveness 2.667 2.667 1/58

23/4/21 60

Cluster Traditional UpdateEvaluation Method Results Rank Results RankAVG Modified Score 0.311 9 0.296 4/52MacroAVG Modified 0.316 9 0.292 4/52AVG Linguistic Qualilty 5.682 3 5.886 1/52AVG Overall Resp. 4.955 2 5.023 1/52

TAC 2008, Update Summarization

TAC 2009, Summarization

System II: graph model based• It may be a new solution for text presentation

– Bag-of-Words

• Iterating on the graph can propagate very distant dependence

• Key points: define nodes\edges\computationA

B

C

D

E

A E

23/4/21 61

Graph-based Update Summarization

① Select most salient terms② Build the term-sentence matrix W③ Use the LSI sentence-sentence similarity matrix SIM④ Construct a graph based on SIM⑤ Compute the graph centrality (power iteration

algorithm)⑥ Select the top 15 sentences with high centrality⑦ Compute all combinations with the length limit⑧ Score a summary as a whole, and keep the best⑨ Re-order the sentences within the summary

23/4/21 62

Graph centrality

• Centrality measure: which is the most important node in a graph?– Degree centrality– Eigenvector centrality

• Suppose the centrality of sentence s is Cs, then

– Important connections make the node itself more important

s r s rr D r s

C Sim C

23/4/21 63

Tailored to Update Summarization

• Problem: given cluster A, summarize cluster B

• Sentence in cluster B should be penalized

• Matrix form:

AA AB

BA BB

SIM SIMSIM

SIM SIM

B A

s r s r r s rr D r s r D r s

C Sim C Sim C

' AA AB

BA BB

SIM SIMSIM

SIM SIM

23/4/21 64

How to score term, summary?

• Score a term– The position of a word (headline, first sentence)– With manual tuning parameters:

• Score(w)=tf(w)0.4*Fd(df(w))

• Score a summary– The term frequency of each word– The centrality of each sentence

0 15( )

( ) 1 0 45 2Score w

Power wmaxscore

( )( ) ( )Power ww

w S

SumScore S count

23/4/21 65

Update Summarization Results (TAC 2008)

Evaluation Method Best Result Our Result Rank

Average Modified Score

0.336 0.304 7 /58

Macroaverage Modified Score with 3 models

0.331 0.299 7 /58

Average Linguistic Quality

3.333 3.073 2 /58

Average Overall Responsiveness

2.667 2.667 1 /58

23/4/21 66

TALK OUTLINE



• Future work

23/4/21 67

Challenges • If there is an universal semantic information similarity

measure for any types of information.• How to organize and represent the knowledge in order

to reuse it.• How to combine rules and statistical methods to get

sophisticated and tidy answers with comprehensive information.

• How to integrate all kinds of information resources to achieve a domain independent QA system.

• ……

– Goal: provide a scalable question and answering service

23/4/21 68

Acknowledgement

• Prof. Ming Li, Dr. Minlie Huang, Dr. Yu Hao• Xian Zhang, Chong Long, Feng Jin, Zhicheng

Zheng, Fan Bu• Jianshu Sun, Yang Tang• Shouyuan Chen, Yuanming Yu

23/4/21 69

Q&A

Thanks !

23/4/21 70

q & a based internet information acquisition

Documents

internet connection

chinese people

search services

questionloss of information

impaired people

web keyword search

question answeringpowerful

web pages