social tag computation and its application in web

87
Social Tag Computation and its Application in Web Community Q&A Maosong Sun Department of Computer Science & Technology Tsinghua University 9th AEARU Web Technology and Computer Science Workshop, Kyoto University, January. 17-18,2011

Upload: others

Post on 11-Dec-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Social Tag Computation and

its Application in Web

Community Q&A

Maosong Sun

Department of Computer Science & Technology

Tsinghua University

9th AEARU Web Technology and Computer Science

Workshop, Kyoto University, January. 17-18,2011

What are social tags?

What are social tags?

Huge amount of user annotated data: social tag annotation

Widely applied in user interest mining, trend discovery and search improvement

Granularity of features for recommendation?

Word-based? Topic-based? Cluster-based? ……

Ambiguity and noisy

Relations between tags

New occurred tags

Can old tags cover new tags in time?

Problems in Social Tag Computation

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

An Example in Chinese

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

Adequate tag Adequate tag

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

An Example in Chinese

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

Noise

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

Ambiguity

An Example in Chinese

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

Hierarchical relation

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

An Example in Chinese

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

New tag

An Example in Chinese

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

Word weighting

TF*IDF

TF*ITF (Inversed Tag Frequency)

Word-based Tag Recommendation

word weight X word/tag dependency = tag weight

Word/Tag weighting

Co-occurrence

Mutual information (PMI)

Chi-square

Data Set

Three Data Set with different properties

BIBTEX,Scholar-article-related

Short text, keyphrase-intensive tags(model, RNA)

BLOGS,Chinese blog collection

Long text, variety of tags (Russian, Disaster)

DOUBAN,Book sharing

Moderate , classification-intensive tags

Experiment, word-based(BIBTEX)

Experiment, word-based(BLOG)

Experiment, word-based(DOUBAN)

Use latent topics to explain tags

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

俄国、俄罗斯、莫斯科...

灾难、火情、烟雾、灾民、消防...

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

Topic-based Tag Recommendation

Add tags in LDA

Use latent topics for tag explanation and

disambiguation

LDA Tag-LDA

Experiment, topic-based(BIBTEX)

Experiment, topic-based(BLOG)

Experiment, topic-based(DOUBAN)

Degree of confidence of tags

Comparison of Word-based and

Topic-based Tag Recommendation

Word-based:

(1) More adequate for keyphrase-intensive

recommendation. May not be very strong for

long documents.

(2) Simple and fast

Topic-based:

(1) More adequate for classification-intensive

recommendation. Good for long documents.

(2) Complex and slow

Tag Allocation Model (TAM)

Word-based, LDA-like model

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

Design a “garbage bin” for noise

据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。

科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。

茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。

据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。

NOISE

俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟

Experiment, TAM(BIBTEX)

Experiment, TAM(BLOG)

Experiment, TAM(DOUBAN)

Noisy Tags and Non-noisy Tags

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

The Hierarchy of Tags

The Subsumption Relation

nlp-conference coling2010

text-mining coling2010

subsumes

subsumes

nlp-conference coling2010subsumes

tag A subsumes tag

B, means wherever

tag B is used, we can

use tag A too.

How to Find Subsumption Relations?

How likely to use tag A when using tag B?

Sort all tag pairs by , take the top ones.

The problem turns to a correct estimation of

TAG-TAG: Naïve Approach

Just count co-occurrences (with smoothing)

The Omitted-tag Problem

Since B implies A, and people are lazy...

linux ubuntusubsumes?

TAG-WORD: Leverage the Content

Use the words in the documents as "bridges".

linux ubuntusubsumes?

the distro has a very efficient package

management system called APT, which

also handles kernels.

Problem in TAG-WORD

Bad "bridge words" are more than good ones.

linux ubuntusubsumes?

the distro has a very efficient package

management system called APT, which

also handles kernels.

The TAG-REASON Approach

What if we identify the words that really help?

The Tag Allocation Model (TAM)

Three Ways to Estimate

TAG-TAG

Direct counting.

TAG-WORD

Use all content words as bridges.

TAG-REASON

Use relevant content words as bridges.

Will show the comparison later...

DAG and Layered-DAG

Raw relation graph contains redundancy.

Layered-DAG: all paths between two tags have equal lengths.

DAG and Layered-DAG

Start with a graph G*

contains only the highest

weighted relation.

Each time, add a highest

weighted relation to G*

that contains at least a

tag in current graph.

If G* is still a layered

DAG, keeps the relation.

Else remove the newly

added relation.

Experiments

Two datasets

BLOG(blog posts) and DOUBAN(book collections)

~10000 documents each.

Ground truth

Pooling

Compared methods.

TAG-TAG, TAG-WORD and TAG-REASON

A hierarchical clustering method by Heymann et al.

Precision vs. Coverage

With Layered-DAG Construction

Sampled Results

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

3. Keyphrase Extraction from Text

3.1. Keyphrase Extraction via

Topic Decomposition

3.2. Local and Global Lexical

Relations in Keyphrase Extraction

Motivation

The state of the art of keyphrase extraction

Basic method

• Supervised

Learning algorithms for keyphrase extraction

(Turney, 2000)

• Unsupervised

TFIDF

TextRank: Bringing order into texts (Rada Mihalcea

and Paul Tarau. 2004)

Example: Arafat Says U.S. Threatening to Kill PLO Officials

Yasser Arafat on Tuesday accused the United States of threatening to kill

PLO officials if Palestinian guerrillas attack American targets. The United

States denied the accusation. The State Department said in Washington

that it had received reports the PLO might target Americans because of

alleged U.S. involvement in the assassination of Khalil Wazir, the PLO's

second in command. Wazir was slain April 16 during a raid on his house

near Tunis, Tunisia. Israeli officials who spoke on condition they not be

identified said an Israeli squad carried out the assassination. There have

been accusations by the PLO that the United States knew about and

approved plans for slaying Wazir. Arafat, the Palestine Liberation

Organization leader, claimed the threat to kill PLO officials was made in a

U.S. government document the PLO obtained from an Arab government.

He refused to identify the government. In Washington, Assistant

Secretary of State Richard Murphy denied Arafat's accusation that the

United States threatened PLO officials. … Arafat said the document

``reveals the U.S. administration is planning, in full cooperation with the

Israelis, to conduct a crusade of terrorist attacks and then to blame the

PLO for them. ``These attacks will then be used to justify the assassination

of PLO leaders.'„

Basic Idea

Two assumptions

• Relevance

Good keyphrases should be relevant to the

major topics of the given document .

• Coverage

An appropriate set of keyphrases should also

have a good coverage of a document‟s major

topics.

Building Topic Interpreters Method Latent Dirichlet Allocation (LDA)

Datasets Wikipedia snapshot at March 2008

Figure: An example of probabilistic topic

model

Topic-Decomposed PageRank

Figure: Topical PageRank for Keyphrase Extraction.

(TPR)

Calculate Ranking Scores by TPR

Extract Keyphrases Using Ranking

Scores

Examples

(a) Topic on “Terrorism” (b) Topic on “Israel”

(c) Topic on “U.S.” (d) TPR Result

Experiments

Influences of Parameters

The Number of Topics K

Different Preference Values

Comparing with Baseline Methods

3. Keyphrase Extraction from Text

3.1. Keyphrase Extraction via

Topic Decomposition

3.2. Local and Global Lexical

Relations in Keyphrase Extraction

Another Factor in Keyphrase Extraction

Lexical relation in a document (Local, TextRank)

Lexical relation in a document set (Global)

Combination of Local and Global

Lexical Relations in TextRank

The balancing factor between global and

local lexical relations in algorithm

Experimental result on the Hulth news dataset

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

Emergence of New Tags

New tags occurs continually

More likely new nouns for new events

Text is an important source for new tags

Combination of Recommendation and

Keyphrase Extraction in Tag Generation

Outline

1. Social Tag Recommendation

2. Finding the Structure of Social

Tags by Subsumption Relations

3. Keyphrase Extraction from Text

4. Tag Supplement with Extracted

Keyphrases

5. Application of Social Tags in

Google Community Q&A

Background

Confucius is a Google community Q&A

service.

Query: What are must-see attractions at Yellowstone

Background

ArabicThailandRussia China

Launched Coming

more south-eastAsia countries

Africa

21+ countries, being serious competitor or the dominating

one

Related Systems and Works

Overview of Confucius

Trigger a discussion/question session during search (SI) Provide labels to a post (semi-automatically) (QL) Given a post, find similar posts (automatically) (QR) Route questions to users Evaluate user credentials in a topic sensitive way (UR) Evaluate quality of a post, relevance and originality (AQ) Provide most relevant, high-quality content for Search to

index Fight spams Natural language question answering (automatically) (NL)

Overview of Confucius

Trigger a discussion/question session during search (SI) Provide labels to a post (semi-automatically) (QL) Given a post, find similar posts (automatically) (QR) Route questions to users Evaluate user credentials in a topic sensitive way (UR) Evaluate quality of a post, relevance and originality (AQ) Provide most relevant, high-quality content for Search to

index Fight spams Natural language question answering (automatically) (NL)

Search Integration

Direct users from search to Confucius.

When the query is a question, or

When all result are of low quality

Ask "does china debt card work in singapore?" in Confucius

Search Integration

Improvements Improve quantity of the questions.

Search Integration

Improvements Improve the quality of the questions

More subjective

More objective

Question Labeling

Recommend labels to questions

Help organize the question flow

Leverage LDA model

Label

w(q1

)

w(q2) ....

Label

Topics

.

New Question

Topics

.

Recommendations

11/4/2009 ACM CIKM Keynote 84

Label suggestion using LDA algorithm.

Question Labeling

Users find it useful

Conclusion

Social tag computation is a systematic

work:

1. Social tag recommendation

2. Structure finding of social tags

3. Keyphrase extraction from Text

4. New tag generation using extracted

keyphrases from Text

5. Proper application of social tags

87

and Q&A