mining topic-specific concepts and definitions on the web bing liu, etc kdd03 cs591cxz cs591cxz web...
TRANSCRIPT
Mining Topic-Specific Mining Topic-Specific Concepts and Concepts and
Definitions on the WebDefinitions on the Web
Bing Liu, etcBing Liu, etc
KDD03KDD03
CS591CXZCS591CXZ Web mining: Lexical relationship mining
Lexical relationship miningLexical relationship mining• A lexical relationship is a relationship between
words, such as synonym, antonym, hypernym (“poodle” <-- “dog”), and hyponym (“poodle” --> “dog”)
• A lexical relationship is a connection between the meanings of two words in a text which helps the text to hold together. Relevant connections include (rough) synonymy (e.g. woman - person, win - victory) and connections in a field of meaning (e.g. plane - pilot).
Thus, subtopic mining is in this category, but definition mining is not.
Information Extraction Information Extraction • MUC
http://www.itl.nist.gov/iaui/894.02/related_projects/muc/Information Extraction: the extraction or
pulling out of pertinent information from large volumes of texts
Items of Information Percentile Reliability Entities 90 Attributes 80 definition falls hereFacts 70Events 60
Attribute: a property of an entity such as its name, alias, descriptor, or type
Mining Topic-Specific Concepts Mining Topic-Specific Concepts and Definitions on the Weband Definitions on the Web
• Goal : Systematically learn an unfamiliar topic from Web
• Definitions • Topic hierarchy
• Input : a term “data mining”, “Web mining”• Tasks
– Identify sub-topics or salient concepts • Like building ontology, but no clear hierarchy
E.g.: Genetic Algorithm• Algorithms
– Find and organize definition pages• Definition question answering
– Concept disambiguation
TechniquesTechniques
• A lot of heuristics – Simple linguistic patterns
{concept} {-|:} {definition}{concept} {refer(s) to | satisfy(ies)} ……
– Web page tags<h1>,…,<h4> <b> <em> <li> …
• Frequent pattern mining– A classic data mining technique
AlgorithmAlgorithm
WebLearn(T)• Submit T to a search engine, get relevant pages• Mines subtopics or salient concepts of T • Finds definition pages• Output the concepts and definition pages to
users.
If a user wants to know more about subtopics T’
do WebLearn(T’)
Mining subtopic/salient Mining subtopic/salient concept(1) concept(1)
Input: a set of top-ranked relevant document
Steps:
1. Filter out “noisy” documents• Publication listing pages
“in proceeding”, “journal”
• Forum discussion pages“previous message”, “reply to”
• Pages that do not contain all query terms
Mining subtopic/salient concept(2)Mining subtopic/salient concept(2)
2. Identify important phrases in each page• Extract text segments in HTML emphasizing tags
<h1>,…,<h4> <b> <em> <li> …• Except those containing:
• Salutation title (Mr. Dr. Professor)• URL or email address• “conference”, “journal” …• Digits ( KDD2004)• Images• Too many words (15 words as limit)
Mining subtopic/salient concept(3)Mining subtopic/salient concept(3)
3. Mine frequent phrases• Input: emphasized text segments• Mine frequent word sets using associate rule mining
technique
4. Eliminate word sets unlikely to be subtopics• Heuristic: those that do not appear alone in
emphasizing tags in any page“process”
• Remove generic words from result set“abstract”, “introduction”, “conclusion”, “research”,…
5. Rank result setsAccording to number of pages they occur
Definition FindingDefinition Finding
• Definition identification patterns suitable for Web pages
{concept} {-|:} {definition}{concept} {refer(s) to | satisfy(ies)} …
• HTML structuring clues and hyperlinks• If only one header <h1>, <h2>,… or one big
emphasized segment at the beginning => definition page
• Look up definition pages up to the second level of the hyperlinks, and only hyperlinks with anchor text matching the concept
Subtopic disambiguation Subtopic disambiguation • By adding context terms
– usually parent topic or subtopics• context terms tend to dominate results • cannot work for the first (root) topic
• Heuristics to combat domination of context terms– only consider text segments containing the topic or
subtopic– identify pages with topic hierarchy
HTML list tag <li> The hierarchy should also contain other subtopics of the parent topic
– shallow linguistic phenomenaTopic + “approaches” / ”techniques” + ( + “e.g” / “such as” / “including”
+ subtopics ) Then, how does this help disambiguate?
Evaluation Evaluation
• Use Google to get the initial set of relevant pages
• Result 1: subtopics / salient concepts Looks pretty good, terms are closely relevant
More salient concepts than subtopics
• Result 2: definition discovery comparison Precision: WebLearn vs Google vs AskJeeves
• Result 3 : disambiguationSeem to be useful
AnalysisAnalysis
• Interesting topicPotentially to be used in practice
• A complete system
• Techniques– Avoid NLP, Machine Learning– Apply heuristics of shallow text structures
LimitationsLimitations
• Research topics, not much ambiguity
• Techniques: – Heuristics are empirical, by no means being
flawless or exhaustive, and hard to applied to other domains
How to improve? -- discussion How to improve? -- discussion
• Better research: – do you think it is a good research topic?
• Better techniques: – what techniques would you like to try to solve
the problme?
Thank you!Thank you!