social tag computation and its application in web
TRANSCRIPT
Social Tag Computation and
its Application in Web
Community Q&A
Maosong Sun
Department of Computer Science & Technology
Tsinghua University
9th AEARU Web Technology and Computer Science
Workshop, Kyoto University, January. 17-18,2011
Huge amount of user annotated data: social tag annotation
Widely applied in user interest mining, trend discovery and search improvement
Granularity of features for recommendation?
Word-based? Topic-based? Cluster-based? ……
Ambiguity and noisy
Relations between tags
New occurred tags
Can old tags cover new tags in time?
Problems in Social Tag Computation
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
An Example in Chinese
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
Adequate tag Adequate tag
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
An Example in Chinese
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
Noise
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
Ambiguity
An Example in Chinese
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
Hierarchical relation
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
An Example in Chinese
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
New tag
An Example in Chinese
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
Word weighting
TF*IDF
TF*ITF (Inversed Tag Frequency)
Word-based Tag Recommendation
word weight X word/tag dependency = tag weight
Data Set
Three Data Set with different properties
BIBTEX,Scholar-article-related
Short text, keyphrase-intensive tags(model, RNA)
BLOGS,Chinese blog collection
Long text, variety of tags (Russian, Disaster)
DOUBAN,Book sharing
Moderate , classification-intensive tags
Use latent topics to explain tags
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
俄国、俄罗斯、莫斯科...
灾难、火情、烟雾、灾民、消防...
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
Topic-based Tag Recommendation
Comparison of Word-based and
Topic-based Tag Recommendation
Word-based:
(1) More adequate for keyphrase-intensive
recommendation. May not be very strong for
long documents.
(2) Simple and fast
Topic-based:
(1) More adequate for classification-intensive
recommendation. Good for long documents.
(2) Complex and slow
Tag Allocation Model (TAM)
Word-based, LDA-like model
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
Design a “garbage bin” for noise
据国际文传电讯社13日报道,俄罗斯森林大火的烟雾已侵入哈萨克斯坦北部科斯塔奈州。
科斯塔奈州紧急情况局新闻处负责人茹马巴耶夫当天表示,这些烟雾来自俄罗斯车里雅宾斯克州和库尔干州的森林火灾。
茹马巴耶夫说,今夏该州反常的高温无风天气让火灾浓烟容易驻留,“乐观的估计是一周以后烟雾能够散去”。
据报道,突如其来的浓烟给科斯塔奈州的森林防火工作带来困难。因为空气能见度大大降低,该州森林防火瞭望塔难以有效发挥作用,护林人员无法及时发现火情。据医务人员介绍,浓烟还将给当地人畜健康造成不利影响。
NOISE
俄罗斯 火灾 淘宝星饰品 科斯塔奈 收藏 灾难 烟
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
The Subsumption Relation
nlp-conference coling2010
text-mining coling2010
subsumes
subsumes
nlp-conference coling2010subsumes
tag A subsumes tag
B, means wherever
tag B is used, we can
use tag A too.
How to Find Subsumption Relations?
How likely to use tag A when using tag B?
Sort all tag pairs by , take the top ones.
The problem turns to a correct estimation of
TAG-WORD: Leverage the Content
Use the words in the documents as "bridges".
linux ubuntusubsumes?
the distro has a very efficient package
management system called APT, which
also handles kernels.
Problem in TAG-WORD
Bad "bridge words" are more than good ones.
linux ubuntusubsumes?
the distro has a very efficient package
management system called APT, which
also handles kernels.
The TAG-REASON Approach
What if we identify the words that really help?
The Tag Allocation Model (TAM)
Three Ways to Estimate
TAG-TAG
Direct counting.
TAG-WORD
Use all content words as bridges.
TAG-REASON
Use relevant content words as bridges.
Will show the comparison later...
DAG and Layered-DAG
Raw relation graph contains redundancy.
Layered-DAG: all paths between two tags have equal lengths.
DAG and Layered-DAG
Start with a graph G*
contains only the highest
weighted relation.
Each time, add a highest
weighted relation to G*
that contains at least a
tag in current graph.
If G* is still a layered
DAG, keeps the relation.
Else remove the newly
added relation.
Experiments
Two datasets
BLOG(blog posts) and DOUBAN(book collections)
~10000 documents each.
Ground truth
Pooling
Compared methods.
TAG-TAG, TAG-WORD and TAG-REASON
A hierarchical clustering method by Heymann et al.
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
3. Keyphrase Extraction from Text
3.1. Keyphrase Extraction via
Topic Decomposition
3.2. Local and Global Lexical
Relations in Keyphrase Extraction
Motivation
The state of the art of keyphrase extraction
Basic method
• Supervised
Learning algorithms for keyphrase extraction
(Turney, 2000)
• Unsupervised
TFIDF
TextRank: Bringing order into texts (Rada Mihalcea
and Paul Tarau. 2004)
Example: Arafat Says U.S. Threatening to Kill PLO Officials
Yasser Arafat on Tuesday accused the United States of threatening to kill
PLO officials if Palestinian guerrillas attack American targets. The United
States denied the accusation. The State Department said in Washington
that it had received reports the PLO might target Americans because of
alleged U.S. involvement in the assassination of Khalil Wazir, the PLO's
second in command. Wazir was slain April 16 during a raid on his house
near Tunis, Tunisia. Israeli officials who spoke on condition they not be
identified said an Israeli squad carried out the assassination. There have
been accusations by the PLO that the United States knew about and
approved plans for slaying Wazir. Arafat, the Palestine Liberation
Organization leader, claimed the threat to kill PLO officials was made in a
U.S. government document the PLO obtained from an Arab government.
He refused to identify the government. In Washington, Assistant
Secretary of State Richard Murphy denied Arafat's accusation that the
United States threatened PLO officials. … Arafat said the document
``reveals the U.S. administration is planning, in full cooperation with the
Israelis, to conduct a crusade of terrorist attacks and then to blame the
PLO for them. ``These attacks will then be used to justify the assassination
of PLO leaders.'„
Basic Idea
Two assumptions
• Relevance
Good keyphrases should be relevant to the
major topics of the given document .
• Coverage
An appropriate set of keyphrases should also
have a good coverage of a document‟s major
topics.
Building Topic Interpreters Method Latent Dirichlet Allocation (LDA)
Datasets Wikipedia snapshot at March 2008
Figure: An example of probabilistic topic
model
3. Keyphrase Extraction from Text
3.1. Keyphrase Extraction via
Topic Decomposition
3.2. Local and Global Lexical
Relations in Keyphrase Extraction
Another Factor in Keyphrase Extraction
Lexical relation in a document (Local, TextRank)
Lexical relation in a document set (Global)
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
Emergence of New Tags
New tags occurs continually
More likely new nouns for new events
Text is an important source for new tags
Outline
1. Social Tag Recommendation
2. Finding the Structure of Social
Tags by Subsumption Relations
3. Keyphrase Extraction from Text
4. Tag Supplement with Extracted
Keyphrases
5. Application of Social Tags in
Google Community Q&A
Background
ArabicThailandRussia China
Launched Coming
more south-eastAsia countries
Africa
21+ countries, being serious competitor or the dominating
one
Overview of Confucius
Trigger a discussion/question session during search (SI) Provide labels to a post (semi-automatically) (QL) Given a post, find similar posts (automatically) (QR) Route questions to users Evaluate user credentials in a topic sensitive way (UR) Evaluate quality of a post, relevance and originality (AQ) Provide most relevant, high-quality content for Search to
index Fight spams Natural language question answering (automatically) (NL)
Overview of Confucius
Trigger a discussion/question session during search (SI) Provide labels to a post (semi-automatically) (QL) Given a post, find similar posts (automatically) (QR) Route questions to users Evaluate user credentials in a topic sensitive way (UR) Evaluate quality of a post, relevance and originality (AQ) Provide most relevant, high-quality content for Search to
index Fight spams Natural language question answering (automatically) (NL)
Search Integration
Direct users from search to Confucius.
When the query is a question, or
When all result are of low quality
Ask "does china debt card work in singapore?" in Confucius
Question Labeling
Recommend labels to questions
Help organize the question flow
Leverage LDA model
Label
w(q1
)
w(q2) ....
Label
Topics
.
New Question
Topics
.
Recommendations
Conclusion
Social tag computation is a systematic
work:
1. Social tag recommendation
2. Structure finding of social tags
3. Keyphrase extraction from Text
4. New tag generation using extracted
keyphrases from Text
5. Proper application of social tags