meena nagarajan ph.d. dissertation defense
DESCRIPTION
Understanding User-generated Content on Social Media PlatformsTRANSCRIPT
Understanding User-generated Content on
Social Media Meena Nagarajan
Ph.D. Dissertation DefenseKno.e.sis Center, College of Engineering and Computer Science
Wright State University
1
Introductions and Thank-you!
2
Social Information Needs
• Facts, Networked Public Conversations, Opinions, Emotions, Preferences..
3
Social Information Needs
• Can we use this information to assess a population’s preference?
• Can we study how these preferences propagate in a network of friends?
• Are such crowd-sourced preferences a good substitute for traditional polling methods?
4
!" !"
Social Information Processing
• "Who says what, to whom, why, to what extent and with what effect?" [Laswell]
• Network: Social structure emerges from the aggregate of relationships (ties)
• People: poster identities, the active effort of accomplishing interaction
• Content : studying the content of communication.
ABOUTNESS of textual user-generated content via the lens of TEXT MINING
5
Aboutness Of Text
• One among several terms used to express certain attributes of a discourse, text or document
• characterizing what a document is about, what its content, subject or topic matter are
• A central component of knowledge organization and information retrieval
• For machine and human consumption
6
Aboutness & Subgoals in IE
• Named entity recognition
• Co-reference, anaphora resolution• "International Business Machines" and "IBM"; ‘he’ in a passage
refers to the mention of ‘John Smith’
• Terminology, key-phrase, lexical chain extraction
• Relationship and fact extraction• ‘person works for organization’
7
Text Mining and Aboutness
• Thesis focus: `Aboutness’ understanding via Text Mining
• Gleaning meaningful information from natural language text useful for particular purposes
• Indicators of thematic elements for aboutness
• via NER, Key phrase extraction
8
Aboutness & The Role Of Context
• Extracting thematic elements : interpretation of the individual elements in context.
• (a) I can hear bass sounds. (b) They like grilled bass.
• Typical context cues that are employed
• Word Associations, Linguistic Cues, Syntactic, Structural Cues, Knowledge Sources..
9
10
1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010
User-generated content on Twitter during the 2009 Iran Election
show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/
Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft
Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches
User comments on music artist pages on MySpace
Your music is really bangin!
You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Comments on Weblogs about movies and video games
I decided to check out Wanted demo today even though I really did not like the movie
It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now
Excerpt from a blog around the 2009 Health Care Reform debate
Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.
Figure 1.1: Examples of user-generated content from different social media platforms
5
10
1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010
User-generated content on Twitter during the 2009 Iran Election
show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/
Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft
Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches
User comments on music artist pages on MySpace
Your music is really bangin!
You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Comments on Weblogs about movies and video games
I decided to check out Wanted demo today even though I really did not like the movie
It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now
Excerpt from a blog around the 2009 Health Care Reform debate
Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.
Figure 1.1: Examples of user-generated content from different social media platforms
5
Unmediated Interpersonal communication
Informal English Domain
10
1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010
User-generated content on Twitter during the 2009 Iran Election
show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/
Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft
Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches
User comments on music artist pages on MySpace
Your music is really bangin!
You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Comments on Weblogs about movies and video games
I decided to check out Wanted demo today even though I really did not like the movie
It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now
Excerpt from a blog around the 2009 Health Care Reform debate
Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.
Figure 1.1: Examples of user-generated content from different social media platforms
5
Unmediated Interpersonal communication
Informal English Domain
Context is implicit
Interactions between like-minded people
10
1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010
User-generated content on Twitter during the 2009 Iran Election
show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/
Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft
Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches
User comments on music artist pages on MySpace
Your music is really bangin!
You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Comments on Weblogs about movies and video games
I decided to check out Wanted demo today even though I really did not like the movie
It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now
Excerpt from a blog around the 2009 Health Care Reform debate
Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.
Figure 1.1: Examples of user-generated content from different social media platforms
5
Unmediated Interpersonal communication
Informal English Domain
Context is implicit
Interactions between like-minded people
Variations and creativity in expression
Properties of the medium
10
1.2. THESIS CONTRIBUTIONS – ‘ABOUTNESS’ OF INFORMAL TEXT August 10, 2010
User-generated content on Twitter during the 2009 Iran Election
show support for democracy in Iranadd green overlay to your Twitter avatar with 1-click - http://helpiranelection.com/
Twitition: Google Earth to update satellite images of Tehran#Iranelection http://twitition.com/csfeo @patrickaltoft
Set your location to Tehran and your time zone to GMT +3.30.Security forces are hunting for bloggers using location/timezone searches
User comments on music artist pages on MySpace
Your music is really bangin!
You’re a genius! Keep droppin bombs!
u doin it up 4 real. i really love the album.
hey just hittin you up showin love to one ofchi-town’s own. MADD LOVE.
Comments on Weblogs about movies and video games
I decided to check out Wanted demo today even though I really did not like the movie
It was THE HANGOVER of the year..lasted forever..so I went to the movies..bad choice picking GI Jane worse now
Excerpt from a blog around the 2009 Health Care Reform debate
Hawaii’s Lessons - NY timesIn Hawaii’s Health System, Lessons for LawmakersSince 1974, Hawaii has required all employers to provide relatively generoushealth care benefits to any employee who works 20 hours a week or more. Ifhealth care legislation passes in Congress, the rest of the country may barelycatch up.Lawmakers working on a national health care fix have much to learnfrom the past 35 years in Hawaii, President Obama’s native state.Among the most important lessons is that even small steps to change the sys-tem can have lasting effects on health. Another is that, once benefits are en-trenched, taking them away becomes almost impossible. There have not beenany serious efforts in Hawaii to repeal the law, although cheating by employersmay be on the rise. But perhaps the most intriguing lesson from Hawaii has todo with costs. This is a state where regular milk sells for $8 a gallon, gasolinecosts $3.60 a gallon and the median price of a home in 2008 was $624, 000 Ñthe second-highest in the nation.
Figure 1.1: Examples of user-generated content from different social media platforms
5
Unmediated Interpersonal communication
Informal English Domain
Context is implicit
Interactions between like-minded people
Variations and creativity in expression
Properties of the medium
One solution rarely fits all
social media content
Thesis Contributions• Compensating for informal highly variable
language, lack of context
• Examining usefulness of multiple context cues for text mining algorithms
• Context cues: Document corpus, syntactic, structural cues, social medium and external domain knowledge
• End goal: NER, Key Phrase Extraction11
Thesis Statements
• We show that for 2 Aboutness Understanding tasks -- NER, Key Phrase Extraction
• Multiple contextual information can supplement and improve the reliability and performance of existing NLP/ML algorithms
• Improvements tend to be robust across domains and data sources
12
13
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Weblogs MySpace Music Forum
NER - Movie Names NER - Music Album/Track names
I loved your music Yesterday!
“It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice
picking “GI Jane” worse now”
13
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Weblogs MySpace Music Forum
Wikipedia Infoboxes
Word Associations from large corpora
Blog URL, Title, Post URL
NER - Movie Names NER - Music Album/Track names
13
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Weblogs MySpace Music Forum
Wikipedia Infoboxes
Word Associations from large corpora
Blog URL, Title, Post URL
Word associations from large corpora, POS Tags, Syntactic
Dependencies
Music Brainz, UrbanDictionary
Page URL
NER - Movie Names NER - Music Album/Track names
14
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Twitter Facebook, MySpace Forums
Key Phrase Extraction Key Phrase Elimination
4.1. KEY PHRASE EXTRACTION - ‘ABOUTNESS’ OF CONTENT August 10, 2010
document that are descriptive of its contents.
The contributions made in this thesis fall under the second category of extracting key phrases
that are explicitly present in the content and are also indicative of what the document is ‘about’.
The focus of previous approaches to key phrase extraction have been on extracting phrases
that summarize a document, e.g. a news article, a web page, a journal article or a book. In
contrast, the focus of this thesis is not in summarizing a document generated by users on social
media platforms but to extract key phrases that are descriptive of information present in multiple
observations (or documents) made by users about an entity, event or topic of interest.
The primary motivation is to obtain an abstraction of a social phenomenon that makes volumes
of unstructured user-generated content easily consumable by humans and agents alike. As an
example of the goals of our work, Table 4.1 shows key phrases extracted from online discussions
around the 2009 Health Care Reform debate and the 2008 Mumbai terror attack, summarizing
hundreds of user comments to give a sense of what the population cared about on a particular day.
2009 Health Care Reform 2008 Mumbai Terror Attack
Health care debate Foreign relations perspectiveHealthcare staffing problem Indian prime minister speechObamacare facts UK indicating supportHealthcare protestors Country of IndiaParty ratings plummet Rejected evidence providedPublic option Photographers capture images of Mumbai
Table 4.1: Showing summary key phrases extracted from more than 500 online posts on Twitteraround two news-worthy events on a single day.
Solutions to key phrase extraction have ranged from both unsupervised techniques that are
based on heuristics to identify phrases and supervised learning approaches that learn from human
105
14
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Twitter Facebook, MySpace Forums
n-grams for thematic cues
spatial, temporal metadata
Key Phrase Extraction Key Phrase Elimination
14
Con
text
Cue
s
In Content
Medium Metadata,
Structural cues
External Knowledge
Sources
Text Formality
Thesis ContributionsTask : Aboutness of text
Twitter Facebook, MySpace Forums
n-grams for thematic cues
spatial, temporal metadata
Word associations from large corpora
Page Title
Key Phrase Extraction Key Phrase Elimination
Seeds from a Domain Knowledge base
WHY
15
WHAT
HOW
Building on results of NER, Key Phrase Extraction
WHEREWHEN
WHO
Thesis ContributionsBuilding Social Intelligence Applications
WHY
15
WHAT
HOW
Social Intelligence Applications
1. Application of NER results : BBC Sound Index with IBM Almaden2. Application of Key Phrase Extraction : Twitris @ Knoesis
Building on results of NER, Key Phrase Extraction
WHEREWHEN
WHO
Thesis ContributionsBuilding Social Intelligence Applications
Thesis Significance, Impact• Focuses on relatively less explored content aspects
of expression on social media platforms
• Why text on social media is different from what most text mining applications have focused on
• Combination of top-down, bottom-up analysis for informal text
• Statistical NLP, ML algorithms over large corpora
• Models and rich knowledge bases in a domain16
TALK OUTLINE - In Detail
ABOUTNESS UNDERSTANDING
• Named Entity Identification in Informal Text
TALK OUTLINE - Overviews
• Topical Key Phrase Extraction from Informal Text
• Applications and Consequences of Understanding content : Social Intelligence Application
• BBC SoundIndex, Twitris
17
18
Named Entity RecognitionI loved your music Yesterday!
“It was THE HANGOVER of the year..lasted forever..
so I went to the movies..bad choice picking “GI Jane” worse now”
Thesis Contributions
19
Predominant Focus of Prior Work Thesis Focus
Entity Type Focus : PER, LOC, ORGN, DATE, TIME.. [TREC]
Entity Type Focus: Cultural Entities
Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)
Document Types: Scientific Literature, News, Blogs (formal)
Document Types: Social Media Content, Blogs, MySpace Forums
Features: Word-Level Features, List-lookup Features, Documents and
corpus features
Features: Word-Level Features, List-lookup Features, Documents and
corpus features
Cultural Named Entities
20
• NER focus in my work: Cultural Named Entities
• Name of a books, music albums, films, video games, etc.
• The Lord of the Rings, Lips, Crash, Up, Wanted, Today, Twilight, Dark Knight...
• Common words in a language
Characteristics of Cultural Entities
• Varied senses, several poorly documented• Merry Christmas covered by 60+ artists
Star Trek: movies, tv series, media franchise.. and cuisines !!
• Changing contexts with recent events• The Dark Knight reference to Obama, health care reform
• Unrealistic expectations: Comprehensive sense definitions, enumeration of contexts, labeled corpora for all senses ..
21
Characteristics of Cultural Entities
• Varied senses, several poorly documented• Merry Christmas covered by 60+ artists
Star Trek: movies, tv series, media franchise.. and cuisines !!
• Changing contexts with recent events• The Dark Knight reference to Obama, health care reform
• Unrealistic expectations: Comprehensive sense definitions, enumeration of contexts, labeled corpora for all senses ..
21NER Relaxing the closed-world sense assumptions
Thesis Contributions
22
Predominant Focus of Prior Work Thesis Focus
Entity Types : PER, LOC, ORGN, DATE, TIME Entity Type Focus: Cultural Entities
Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)
Document Types: Scientific Literature, News, Blogs (formal)
Document Types: Social Media Content, Blogs, MySpace Forums
Features: Word-Level Features, List-lookup Features, Documents and
corpus features
Features: Word-Level Features, List-lookup Features, Documents and
corpus features
A Spot and Disambiguate Paradigm
23
• NER generally a sequential prediction problem• NER system that achieves 90.8 F1 score on the CoNLL-2003 NER
shared task (PER, LOC, ORGN entities) [Lev Ratinov, Dan Roth]
• My approach: Spot and Disambiguate Paradigm• Dictionary or list of entities we want to spot
• Disambiguate in context (natural language, domain knowledge cues)
• Binary Classification
Thesis Contributions
24
Predominant Focus of Prior Work Thesis Focus
Entity Types : PER, LOC, ORGN, DATE, TIME Entity Type Focus: Cultural Entities
Method: Sequential Labeling Method: Spot and Disambiguate (pre-supposed knowledge)
Document Types: Scientific Literature, News, Blogs (formal)
Document Types: Informal Social Media Content, Blogs, MySpace
Forums, Twitter, Facebook
Features: Word-Level Features, List-lookup Features, Documents
and corpus features
Features: SENSE BIASED Word-Level Features, List-lookup
Features, Documents and corpus features
NER Algorithmic Contributions Supervised, Two Flavors
25
3.2. THESIS FOCUS - CULTURAL NER IN INFORMAL TEXT August 10, 2010
(a) Multiple Senses in the same Music DomainBands with a song “Merry Christmas” 60Songs with “Yesterday” in the title 3,600Releases of “American Pie” 195Artists covering “American Pie” 31
(b) Multiple senses in different domains for the same movie entitiesTwilight Novel, Film, Short story, Albums, Places, Comics, Poem, Time of dayTransformers Electronic device, Film, Comic book series, Album, Song, Toy LineThe Dark Knight Nickname for comic superhero Batman, Film, Soundtrack, Video game,
Themed roller coaster ride
Table 3.3: Challenging Aspects of Cultural Named Entities
3.2.3 Two Approaches to Cultural NER
In this thesis, we present two approaches to Cultural NER, both addressing different challenges in
their identification. Cultural entities display two characteristic challenges related to their sense or
meanings – certain Cultural entities are so commonly used that they tend to have multiple senses
in the same domain. The music industry is a great example of this scenario where popular themes
feature in several track/album titles of different artists. Table 3.3(a) shows examples of such cases
– for example, there are more than 3600 songs with the word ‘Yesterday’ in their title.
Connecting mentions of such entities in free text to their actual real-world references is rather
challenging, especially in light of poor contextual information. If a user post mentioned the song
‘Merry Christmas’, as in, “This new Merry Christmas tune is so good!”; it is non-trivial to disam-
biguate its reference to one among 60 artists who have covered that song.
On the other hand, there are Cultural entities that span multiple domains. The phrase, ‘The
Hangover’ is a named entity in the film and music domain. Movies that are based on novels
or video games are great examples of such cases of sense ambiguity. Resolving the mention of
‘Wanted’ in Figure 3.2 as a reference to the video game entity (and not the movie reference) is a
38
“I am watching Pattinson scenes in <movie id=2341>Twilight</movie> for the nth time.” “I spent a romantic evening watching the Twilight by the bay..”
“I love <artist id=357688>Lilyʼs</artist> song <track id=8513722>smile</track>”.
NER - Approach 1
26
Approach 1: Multiple Senses, Multiple Domains
• When a Cultural entity appears in multiple senses across domains in the same corpus
27
3.3. CULTURAL NER – MULTIPLE SENSES ACROSS MULTIPLE DOMAINSAugust 10, 2010
Title: Peter Cullen Talks Transformers: War for Cybertron
Recently, we heard legendary Transformers voice actor Peter
Cullen talk not only about becoming an hero to millions for his
portrayal of the heroic Autobot leader, Optimus Prime, but also
about being the first person to play the role of video game icon
Mario. But today, he focuses more on the recent Transformersvideo game release, War for Cybertron.
Following are some excerpts from an interview Cullen recently
conducted with Techland. On how the Optimus Prime seen in War
for Cybertron differs from the versions seen in other branches of
the franchise and its multiverse...
Figure 3.1: Showing excerpt of a blog discussing two senses of the entity ‘Transformers’
the second mention.
3.3.1 A Feature Based Approach to Cultural NER
In this work, we propose a new feature that represents the complexity of extracting particular en-
tities. We hypothesize that knowing how hard it is to extract an entity is useful for learning better
entity classifiers. With such a measure, entity extractors become ‘complexity aware’, i.e. they can
respond differently to the extraction complexity associated with different target entities.
Suppose that we have two entities, one deemed easy to extract and the other more complex.
When a classifier knows the extraction complexity of the entity, it may require more evidence (or
apply more complex rules) in identifying the more complex entity compared to the easier target.
Consider concretely a movie recognition system dealing with two movies, say, ‘The Curious Case
of Benjamin Button’ a title appearing only in reference to a recent movie, and ‘Wanted’, a segment
with wider senses and intentions. With comparable signals a traditional NER system can only
apply the same inference to both cases whereas a ‘complexity aware’ system has the advantage of
42
Algorithm Preliminaries• Problem Space
• Corpus: Weblogs, Distribution: unknown
• All senses of a cultural entity: unknown
• Problem Definition
• Input: A target Sense (e.g., movie); List of Entities to be extracted
• Goal: Disambiguating every entity’s mention as related to target sense or not
28
Contribution: Improving NER - feature-based approach
• Improving classifiers using a novel feature• the “complexity of extraction” in a target sense
• Hypothesis: knowing how hard or easy it is to extract this entity in a particular sense will improve extraction accuracy of learners
29
Contribution: Improving NER - feature-based approach
• Improving classifiers using a novel feature• the “complexity of extraction” in a target sense
• Hypothesis: knowing how hard or easy it is to extract this entity in a particular sense will improve extraction accuracy of learners
• Making classifiers ‘complexity aware’ • ‘The Curious Case of Benjamin Button’ vs. ‘Wanted’
29
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
List of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button
List of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button 0.2
Complexity of ExtractionEntityList of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button 0.2
Date Night
Complexity of ExtractionEntityList of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button 0.2
Date Night 0.5
Complexity of ExtractionEntityList of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button 0.2
Date Night 0.5
Complexity of ExtractionEntity
Use Complexity of Extraction as a feature in named entity classifiers
List of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
Overview
The Curious Case of Benjamin ButtonTwilightDate NightDeath at a FuneralThe Last SongUpAngels and Demons
Sample Population
The Curious Case of Benjamin Button 0.2
Date Night 0.5
Complexity of ExtractionEntity
Use Complexity of Extraction as a feature in named entity classifiers
List of movies to extract
Uncharacterized population (blog corpus), target sense (movies)
NOTE: An entity occurring in fewer varied senses (The Curious Case of Benjamin Button) could still have a high complexity of extraction if the distribution is skewed away from the
sense of interest!
Extraction in a Target Sense
• Complexity of extraction in a sense of interest = how much support in corpus toward that sense
31
Extraction in a Target Sense
• Complexity of extraction in a sense of interest = how much support in corpus toward that sense
• How do we find this?
31
Extraction in a Target Sense
• Complexity of extraction in a sense of interest = how much support in corpus toward that sense
• How do we find this?• Documents that mention the entity in word contexts that
are biased to our sense of interest (language models)
31
Extraction in a Target Sense
• Complexity of extraction in a sense of interest = how much support in corpus toward that sense
• How do we find this?• Documents that mention the entity in word contexts that
are biased to our sense of interest (language models)
• More document, implies a lot of support, implies easy to extract, low complexity of extraction
31
Support via Word Associations
• Co-occurring words alone wont cut it!
• Prolific discussion and comparison of different senses
• Co-occurrence based language models will give us everything unless we bias it to our sense (movies)
32
Complexity of Extraction• Goal: Complexity of Extraction in a target sense
• Subgoal: Support in terms of sense-biased contexts in documents that mention entity
• Step1: Extract a sense-biased LM
• Step 2: Identify documents that mention entity in the context of the sense-biased LM
33
Knowledge Features to seed Sense-biased Word Association Gathering
34
• Sense Definition (hints) from Wikipedia Infoboxes
• Working definition: Sense is domain of interest
Knowledge Features to seed Sense-biased Word Association Gathering
34
• Sense Definition (hints) from Wikipedia Infoboxes
• Working definition: Sense is domain of interest
• Use sense hints to derive contextual support
Lot of support, easy to extract, implies a low ‘complexity of extraction’ score!
Measuring ‘complexity of extraction’
• Step 1: Propagate sense evidence in contexts of e, extract a sense-biased language model (LM)• random walks, distributional similarity approaches
• SPREADING ACTIVATION NETWORKS
35
Two step framework (unsupervised)
D e e
Measuring ‘complexity of extraction’
• Step 1: Propagate sense evidence in contexts of e, extract a sense-biased language model (LM)• random walks, distributional similarity approaches
• SPREADING ACTIVATION NETWORKS
35
Two step framework (unsupervised)
Sense hint nodesSense-biased Language Model
D e e
Overview
• Result: Clustered Documents in similar senses
• Not just similar words!
SenseRel doc 1 doc 2 doc n
sense LM term 1 SenseRel (t1) SenseRel (t1) SenseRel (t1)
sense LM term 2 SenseRel (t2)
sense LM term m SenseRel (tm) SenseRel (tm)
• Step 2: Clustering documents represented by sense-relatedness vectors
• CHINESE WHISPERS CLUSTERING
Constructing the SAN
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of
libidinous
Spock 1 imax
1 1
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
libidinous
Spock 1 imax
1 1
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Chris Pine
Kirk 1
1
1 libidinous
Spock 1 imax
1 1
indicative of being a Named Entity
Constructing the SANJ. J. Abrams Damon Lindelof Roberto Orci Alex Kurtzman Paramount Pictures Chris Pine Zachary Quinto Eric Bana Zoe Saldana Karl Urban John Cho Anton Yelchin Simon Pegg Bruce Greenwood Leonard Nimoy Kirk Spock Nero Pavel Chekov Nyota Uhura .. Greenwood Leonard Nimoy Pavel Chekov Nyota Uhura
Star Trek Startrek
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Top X keywords (IDF) - Among the context surrounding (but excluding) entity of interest - Force include sense related words
Spock IMAX .. Kirk Karl Urban James .. canon Chris Pine libidinous
Activation Network
effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Chris Pine
Kirk 1
1
1 Repeat this procedure for all blogs End up with a connected SAN With some sense Nodes and other words in context of entity
libidinous
Spock 1 imax
1 1
indicative of being a Named Entity
Node and Edge Semantics• Pre-adjustment phase
• Node weights: Sense nodes: 1; Other nodes: 0.1
• ambiguous sense nodes
• alternate seeding methods: distributional similarity with unambiguous domain terms (movie, theatre, imax, cinemas)
• Edge weights: co-occurrence counts
38
Node and Edge Semantics• Pre-adjustment phase
• Node weights: Sense nodes: 1; Other nodes: 0.1
• ambiguous sense nodes
• alternate seeding methods: distributional similarity with unambiguous domain terms (movie, theatre, imax, cinemas)
• Edge weights: co-occurrence counts
38
1
1
1
1
1
Eric Bana
Sulu
Romulan
movie
franchise
Chris pine
J. J. Abrams starship
seats
Constructing the Spreading Activation Network G from
words co-occurring with e in D
sense hints Yother vertices X
Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of
sense nodes
39
1
1
1
1
1
Eric Bana
Sulu
Romulan
movie
franchise
Chris pine
J. J. Abrams starship
seats
Constructing the Spreading Activation Network G from
words co-occurring with e in D
sense hints Yother vertices X
Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of
sense nodes
39
At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)
1
1
1
1
1
Eric Bana
Sulu
Romulan
movie
franchise
Chris pine
J. J. Abrams starship
seats
Constructing the Spreading Activation Network G from
words co-occurring with e in D
sense hints Yother vertices X
Propagating Sense Evidences Pulse sense nodes and spread effect As many pulses (iterations) as number of
sense nodes
39
At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)
1
1
1
1
1
Eric Bana
Sulu
Romulan
movie
franchise
Chris pine
J. J. Abrams starship
seats
Constructing the Spreading Activation Network G from
words co-occurring with e in D
sense hints Yother vertices X
Collective Spreading controlled by dampening factor α, co-occurrence thresholds
Propagating Sense Evidences
Eric Bana
Sulu
Romulan
movie
franchise
Chris pine
J. J. Abrams starship
seats
non-activated vertices
Post Propagation of Sense Evidences:
Spreading Activation Theory
Pulse sense nodes and spread effect As many pulses (iterations) as number of
sense nodes
39
Final activated portions of the network indicate word’s
relatedness to sense = sense-biased LM
At every iterationA BFS walk starting at a sense node (weight 1)Revisiting nodes not edgesAmplifying weights of visited nodes:W [ j ] = W [ j ] + (W [ i ] * co-occ[ i, j ] * α)
Collective Spreading controlled by dampening factor α, co-occurrence thresholds
Sense-biased LM
Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM
40
Sense-biased LM
Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM
Sense-biased Spreading Activation already lends one type of clustering (separation of words strongly related to our sense)
40
Sense-biased LM
Entity: Star Trek(movie)20 iterations (pulsed sense nodes)900+ blogs, 35K+ words in co-occ graph167 words in the LM
Sense-biased Spreading Activation already lends one type of clustering (separation of words strongly related to our sense)
40
Documents D Represented in terms of LMe
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
di(LMe) = {w1, LMe(w1) ; .. wx, LMe(wx) }
Step2:Clustering using Extracted LMAlgorithmic Implementations
41
Vector Space ModelTypically: word, tfidf score
Here: word, sense relatedness score
Step2:Clustering using Extracted LMAlgorithmic Implementations
41
Vector Space ModelTypically: word, tfidf score
Here: word, sense relatedness score
http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html
http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html
http://susanisaacs.blogspot.com/2009/04/quantum-leap-convention.html
http://semioblog.blogspot.com/2009/01/retrofuturo-web.html http://wilwheaton.net/2006/05/learn_to_swim.php
No Representation
Step2:Clustering using Extracted LM
Clustering documents in D along same the dimensions of
propagation
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Algorithmic Implementations
Cluster scores are as high as sense-relatedness scores of terms in
documents in clusters41
Vector Space ModelTypically: word, tfidf score
Here: word, sense relatedness score
http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html
http://realart.blogspot.com/2009/05/star-trek-balance-of-terror-from.html
http://susanisaacs.blogspot.com/2009/04/quantum-leap-convention.html
http://semioblog.blogspot.com/2009/01/retrofuturo-web.html http://wilwheaton.net/2006/05/learn_to_swim.php
No Representation
Chinese Whispers
42
*[Biemann 2006] Biemann, C. (2006): Chinese Whispers - an Efficient Graph Clustering Algorithm and its Application to Natural Language Processing Problems. Proceedings of the HLT-NAACL-06 Workshop on Textgraphs-06,!New York, USA.
Clustering documents in D along same the dimensions of
propagation
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
! Randomized graph-clustering algorithm ! for undirected, weighted graphs
! Nodes are documents, edges represent Dot Product similarity between documents ! Feature vector = language model from Step1
! Partitions nodes i.e. documents based on max average similarity
High and Low Scoring Clusters
43
++++++++++CLUSTER++++++++++On cluster 1032 total of 4 members and score = 7.46775436953479E-05 http://...mish-mashy-hodge-podge-with-no.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 5 http://perchedraven.blogspot.com/2006/06/reading-habits.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 4 http:/..9/05/27/new-york-times-crossword-will-shortz-corey-rubin/ Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 1 http://....no-doubt-have-heard-by-now.html Keywords adventure 0.0016552556767791 1 movie 0.000477036040807386 1
++++++++++CLUSTER++++++++++On cluster 2382 total of 4 members and score = 0.130057194715825 http://torontomike.com/2009/05/advanced_screening_of_pixars_u.html Keywords comedy 0.00256627265885554 1 adventure 0.0016552556767791 2 carl 0.020754281327519 1 pete docter 0.0549975353578234 1 carl fredricksen 0.133630375166943 1 russell 0.116638105837327 1 pixar 0.0048327182073532 2 digital 5.70733953613266E-05 1 disney 2.05783382714942E-06 2 fredricksen 1 1 docter 0.016306935328765 1 http://theplaylist.blogspot.com/2009/05/up-pixars-latest-is-profoundly.html Keywords russell 0.116638105837327 2 carl 0.020754281327519 3 carl fredricksen 0.133630375166943 1 pete docter 0.0549975353578234 1 comedy 0.00256627265885554 1 animation 0.0164047754350987 1 pixar 0.0048327182073532 7 film 1.32766200713231E-05 4 adventure 0.0016552556767791 1 movies 0.0399341006575118 2
Low scoring cluster - less evidence for relatedness to sense
High scoring cluster
From Clusters to SupportClustering documents in D along
same the dimensions of propagation
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
44
Cluster scores are as high as sense-relatedness
scores of terms in documents in clusters
! A conservative estimate, a heuristic
! Average strength of all clusters (A) ! average sense relatedness
! C* = Clusters with score >= A
! Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|
From Clusters to SupportClustering documents in D along
same the dimensions of propagation
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
44
Cluster scores are as high as sense-relatedness
scores of terms in documents in clusters
! A conservative estimate, a heuristic
! Average strength of all clusters (A) ! average sense relatedness
! C* = Clusters with score >= A
! Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|
From Clusters to SupportClustering documents in D along
same the dimensions of propagation
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
10 minutes. That is all it took for JJ Abrams to make a believer out of me. 10 minutes. Let us set the stage for my viewing of Star Trek. IMAX? Check. Perfect seats? Check..not sit well with me was the libidinous Spock. It changed one of the fundamental aspects of the character for no good reason. Other than that, however, none of the changes to Trek canon particularly bothered me in a "get a life" kind of way.………….the special effects were stunning, and the performances were...wow. Chris Pine IS James T. Kirk. Karl Urban IS Leonard McCoy…Spock
Higher the proportion of documents, lower the entity’s complexity of extraction score
44
Cluster scores are as high as sense-relatedness
scores of terms in documents in clusters
!
'complexity of extraction'of e =1
|C* | |D |!
! A conservative estimate, a heuristic
! Average strength of all clusters (A) ! average sense relatedness
! C* = Clusters with score >= A
! Support = No. of documents in strong sense related clusters |C*| / No. of documents mentioning the entity |D|
Validating the Framework
45
! Intuitively, some (cultural) entities are harder to extract than some others ! ‘Up’ vs. ‘The Time Traveler’s Wife’ !!!
! For a list of X movies ! Selected blogs from the general corpus ! Obtained Sense definitions from Wikipedia ! Generated support for the movie sense using proposed
framework
46
!
“Up” and “Wanted” will arguably be harder to extract than a movie like “Angels and Demons” – clearly indicating that our approach for computing ‘extraction complexity’ is effective. A note on the confidence of these scores is presented in Section 4.2.
Table 1 Entities and their computed extraction complexities
Entity and Variations used in obtaining D; (possible senses found in Wikipedia Disambiguation pages)
Complexity of Extraction
Twilight (novel, film, time of day, albums..) 0.4
Up (film, relative direction, abbreviations, albums..) 0.352
Wanted (film, common verb in English, video game, music..) 0.161
Star Trek, Startrek (film, tv series, video game, media franchise..) 0.114
Transformers (toy line, film, comic series, electronic device..) 0.085
The Hangover (unpleasant feeling, film, band, song.. ) 0.072
The Dark Knight, Dark Knight (Batman’s nickname, film, comic series, soundtrack..) 0.070
Angels and Demons, Angels & Demons (novel, movie, episode on The Blade: the series…) 0.066
4.2 NER Improvements The underlying hypothesis behind this work is that knowing how hard it is to extract an entity will allow classifiers to use cues differently in deciding whether a mention spotted is mentioned in a valid entity in a target sense or not. In this second set of experiments, we measure the usefulness of our feature in assisting a variety of supervised classifiers.
Labeling Data We randomly selected documents one after another from the pool of documents D collected for all entities E in Experiment 1 and labeled every occurrence of entity e and its valid variations in a document as either a movie entity or not. We labeled a total of 1500 spots. We also observed a 100% inter-annotator agreement between the two authors over a random sample of 15% of the labeled spots, indicating that labeling for movie or not-movie sense is not hard in this data. Figure 3 shows statistics for the percentage of true positives found for each entity. The percentage of entity mentions that appear in the movie sense implicitly indicates how much support there is for the target movie sense for the entities. It is interesting then that this order closely matches the extraction complexity ordering of the entities – an indication that the approach we use for extracting our feature is sound. In the process of random labeling the entity “Angels and Demons” received only 10 labels and was discarded from this experiment.
Classifiers, Features, Experimental Setup: We used 3 different state-of-the-art entity classifiers for learning entity extractors – decision tree classifiers, bagging and boosting (using 6 nodes and stump base learners for boosting) [1, 4, 21]. The goal of using different classification models was to show that our measure is useful with different underlying prediction approaches rather than for the purpose of finding the most suitable classifier. We trained and tested the classifiers on an extensive list of features (Figure 4).
We used two well-known features – word-level features that indicate whether the spotted entity is capitalized, surrounded by quotes etc., and contextual syntactic features that encode the Part- Of-Speech tags of words surrounding the entity. We also used knowledge-derived features that indicate whether words already
known to be relevant to the target sense of the entity (sense definition of e) are found in the document, surrounding paragraph, title of post or in the post or blog url. The intuition is that the presence of such words strengthens its case for being a valid mention. We also encoded similar features using the extracted language model LMe to test the usefulness of the new words we extracted as relevant to the target sense of the entity.
In addition to the basic word-level and syntactic features, we also measure the usefulness of our proposed feature against a strong ‘contextual entropy’ baseline. This baseline measures how specific a word is in terms of the contexts it occurs in. A general word will occur in varied contexts, and will have a high context entropy value [14]. High entropy in context distribution is an indication that extracting the entity in any sense might be hard. This baseline is very similar in spirit to our feature, except that our proposed measure identifies how hard it is to extract an entity in a particular target sense. We evaluated classifier accuracies in labeling test spots with and without our ‘complexity of extraction’ feature as a prior. Specifically, we used the following feature combinations:
a. Basic features: word-level, syntactic, knowledge features obtained from the sense definitions S and Sd. b. Baseline: Basic + ‘contextual entropy’ feature as a prior.
c. Our measure: Basic + knowledge features obtained from the extracted LMe + ‘complexity of extraction’ feature as a prior.
Figure 5 shows the precision-recall curves using the basic, baseline and our proposed measures for entity classification using the decision tree and boosting classifiers. We verified stability of these results using 10 fold cross validation. We see better performance of our measure compared to both the basic setting and the strong ‘contextual entropy’ baseline. Notably, there is overwhelming improvement in entity extraction over traditional extractor settings (basic features). The stability of the suggested improvement is also confirmed across both classifiers.
We see significant improvements using the proposed feature and now turn to confirm that this is indeed a consistent pattern. Here, we show the averaged performance of binary classification over 100 runs, each run using different and random samples of training and test sets (obtained from 50-50 splits).
We measured the F-measure and accuracy of the classifiers using the basic, baseline and our proposed measure features. Accuracy is defined as the number of correct classifications (both true positives and negatives) divided by the total number of classifications. We use accuracy to represent general classification improvement – when we care about classifying both the correct and incorrect cases. The F-measure is the standard harmonic mean of precision and recall of classification results and we use it to represent information retrieval improvement – when we only care about our target sense. We report both of these metrics here for consistency with past literature.
Figure 3 Labeled Data
Figure 4 Features used in judging NER improvements
Computed Extraction Complexities
Hand-labeled data% true positives
46
!
“Up” and “Wanted” will arguably be harder to extract than a movie like “Angels and Demons” – clearly indicating that our approach for computing ‘extraction complexity’ is effective. A note on the confidence of these scores is presented in Section 4.2.
Table 1 Entities and their computed extraction complexities
Entity and Variations used in obtaining D; (possible senses found in Wikipedia Disambiguation pages)
Complexity of Extraction
Twilight (novel, film, time of day, albums..) 0.4
Up (film, relative direction, abbreviations, albums..) 0.352
Wanted (film, common verb in English, video game, music..) 0.161
Star Trek, Startrek (film, tv series, video game, media franchise..) 0.114
Transformers (toy line, film, comic series, electronic device..) 0.085
The Hangover (unpleasant feeling, film, band, song.. ) 0.072
The Dark Knight, Dark Knight (Batman’s nickname, film, comic series, soundtrack..) 0.070
Angels and Demons, Angels & Demons (novel, movie, episode on The Blade: the series…) 0.066
4.2 NER Improvements The underlying hypothesis behind this work is that knowing how hard it is to extract an entity will allow classifiers to use cues differently in deciding whether a mention spotted is mentioned in a valid entity in a target sense or not. In this second set of experiments, we measure the usefulness of our feature in assisting a variety of supervised classifiers.
Labeling Data We randomly selected documents one after another from the pool of documents D collected for all entities E in Experiment 1 and labeled every occurrence of entity e and its valid variations in a document as either a movie entity or not. We labeled a total of 1500 spots. We also observed a 100% inter-annotator agreement between the two authors over a random sample of 15% of the labeled spots, indicating that labeling for movie or not-movie sense is not hard in this data. Figure 3 shows statistics for the percentage of true positives found for each entity. The percentage of entity mentions that appear in the movie sense implicitly indicates how much support there is for the target movie sense for the entities. It is interesting then that this order closely matches the extraction complexity ordering of the entities – an indication that the approach we use for extracting our feature is sound. In the process of random labeling the entity “Angels and Demons” received only 10 labels and was discarded from this experiment.
Classifiers, Features, Experimental Setup: We used 3 different state-of-the-art entity classifiers for learning entity extractors – decision tree classifiers, bagging and boosting (using 6 nodes and stump base learners for boosting) [1, 4, 21]. The goal of using different classification models was to show that our measure is useful with different underlying prediction approaches rather than for the purpose of finding the most suitable classifier. We trained and tested the classifiers on an extensive list of features (Figure 4).
We used two well-known features – word-level features that indicate whether the spotted entity is capitalized, surrounded by quotes etc., and contextual syntactic features that encode the Part- Of-Speech tags of words surrounding the entity. We also used knowledge-derived features that indicate whether words already
known to be relevant to the target sense of the entity (sense definition of e) are found in the document, surrounding paragraph, title of post or in the post or blog url. The intuition is that the presence of such words strengthens its case for being a valid mention. We also encoded similar features using the extracted language model LMe to test the usefulness of the new words we extracted as relevant to the target sense of the entity.
In addition to the basic word-level and syntactic features, we also measure the usefulness of our proposed feature against a strong ‘contextual entropy’ baseline. This baseline measures how specific a word is in terms of the contexts it occurs in. A general word will occur in varied contexts, and will have a high context entropy value [14]. High entropy in context distribution is an indication that extracting the entity in any sense might be hard. This baseline is very similar in spirit to our feature, except that our proposed measure identifies how hard it is to extract an entity in a particular target sense. We evaluated classifier accuracies in labeling test spots with and without our ‘complexity of extraction’ feature as a prior. Specifically, we used the following feature combinations:
a. Basic features: word-level, syntactic, knowledge features obtained from the sense definitions S and Sd. b. Baseline: Basic + ‘contextual entropy’ feature as a prior.
c. Our measure: Basic + knowledge features obtained from the extracted LMe + ‘complexity of extraction’ feature as a prior.
Figure 5 shows the precision-recall curves using the basic, baseline and our proposed measures for entity classification using the decision tree and boosting classifiers. We verified stability of these results using 10 fold cross validation. We see better performance of our measure compared to both the basic setting and the strong ‘contextual entropy’ baseline. Notably, there is overwhelming improvement in entity extraction over traditional extractor settings (basic features). The stability of the suggested improvement is also confirmed across both classifiers.
We see significant improvements using the proposed feature and now turn to confirm that this is indeed a consistent pattern. Here, we show the averaged performance of binary classification over 100 runs, each run using different and random samples of training and test sets (obtained from 50-50 splits).
We measured the F-measure and accuracy of the classifiers using the basic, baseline and our proposed measure features. Accuracy is defined as the number of correct classifications (both true positives and negatives) divided by the total number of classifications. We use accuracy to represent general classification improvement – when we care about classifying both the correct and incorrect cases. The F-measure is the standard harmonic mean of precision and recall of classification results and we use it to represent information retrieval improvement – when we only care about our target sense. We report both of these metrics here for consistency with past literature.
Figure 3 Labeled Data
Figure 4 Features used in judging NER improvements
Computed Extraction Complexities
Hand-labeled data% true positives
NER Improvements• Goal: Evaluate existing NE classifiers with and
without ‘complexity of extraction’ feature• spot and disambiguate paradigm
• decision tree classifiers, bagging and boosting
• Cultural Entity: Movie entities in weblogs
• Binary Classification: 1500+ multiple author annotations from a 2130K general blog corpus
47
Feature Set
Feature SetBoolean Word-level
features First letter capitalize, All capitalized, In quotes
Boolean Contextual Syntactic features POS tags of words before and after the spot
Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL
Basic
Feature SetBoolean Word-level
features First letter capitalize, All capitalized, In quotes
Boolean Contextual Syntactic features POS tags of words before and after the spot
Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL
Basic
Baseline Basic + Contextual Entropy prior
Feature SetBoolean Word-level
features First letter capitalize, All capitalized, In quotes
Boolean Contextual Syntactic features POS tags of words before and after the spot
Knowledge features Infobox sense definitions in same blog, same paragraph as entity, title, URL, blog URL
Basic
Baseline Basic + Contextual Entropy prior
+ Basic + proposed ‘Complexity of Extraction’ prior
Our Measure
Knowledge features Extracted sense-biased LM in same blog, same paragraph as entity, title, URL, blog URL
PR Curves (10 fold cross validation)basic: word-level, syntactic, sense definitionsbaseline: Basic + contextual entropyour measure: Basic + sense-biased LM + ‘complexity of extraction’
49
Binary Classification (over 100 runs) F-measure, Accuracybasic: word-level, syntactic, sense definitionsbaseline: Basic + contextual entropyour measure: Basic + sense-biased LM + ‘complexity of extraction’
Basic Features: Average accuracy - 74%; F Measure 69%Proposed Feature: +10% Accuracy; +11% F Measure
50
Entity level improvementsBasic vs. Proposed feature sets
Accuracy improvementsThe Dark Night (+26.9%) and The Hangover (+31%)
F-measure improvementsUp (+12.6%), The Dark Knight (+14.9%) and The Hangover (+16.5%)
51
The Polluted LM of Twilight
• “I spent a romantic evening watching the Twilight…”
• “here are photos of the Twilight from the bay..” (photos turned up in the extracted LM: “red carpet photos of the Twilight crew”)
• “I am re-reading Edward Cullen chapters of Twilight” mentions Cullen, a cast in the movie.
Extracted sense-biased LM (words in bold) were used to derive knowledge features; negatively impacted classifier results
Thesis Statement• Knowledge Features (Wikipedia Infobox/IMDB)
• From statistically significant co-occurrences to a sense-biased LM
• Baseline settings (state-of-the-art for cultural entities): F-score 69% for traditional extractor settings
• 90.8 F-score on the CoNLL-2003 NER shared task (PER, LOC, ORGN entities) [Lev Ratinov, Dan Roth]
• Average improvements: +11% (F-score)
• Generic methods/proposed feature53
APPLICATIONS
54
Several Applications of this work
• Weak Indicators for contextual browsing/search, reduced document set for manual labeling
• Step1,2: Clustered Sense-biased Documents
55
Several Applications of this work
• Weak Indicators for contextual browsing/search, reduced document set for manual labeling
• Step1,2: Clustered Sense-biased Documents
• IE Pipeline: Ignore entity if high extraction complexity
55
Several Applications of this work
• Weak Indicators for contextual browsing/search, reduced document set for manual labeling
• Step1,2: Clustered Sense-biased Documents
• IE Pipeline: Ignore entity if high extraction complexity
• Step 1 (LM generation): Unsupervised Domain Lexicon Generation
• restaurants and bars55
Related Terms for Topic Classification
Unsupervised Generation of associated words in the Restaurants and Bars topic
56
5
3
4
2
1
Restaurant
Step 1: Pulse on “Restaurant”
Related Terms for Topic Classification
Unsupervised Generation of associated words in the Restaurants and Bars topic
56
1 Restaurant
2
3
Review
Table
5
4
Reservation
Waiters
Step 2: Pulse on top n surfaced terms from step 1
5
3
4
2
1
Restaurant
Step 1: Pulse on “Restaurant”
Related Terms for Topic Classification
Unsupervised Generation of associated words in the Restaurants and Bars topic
57
5
3
4
2
1
Restaurant
Step 1: Pulse on “Restaurant”
restaurant, waiters, tasty, waiter, dish, nutrition, review, cooking, reviews, tibits, vegetarian, chef, sweet, bourdain, waitress, reservations, lunch, dishes, sushi, cuisine, burger, taste, burgers, fries, french, wines, tapas, wineries, wine, café, huang, vietnamese, espresso, anhui, coffee, shops, hotels, cafes, diners, bars, called, hefei, menus, chefs, michelin, dine, establishments, tourist, eateries, chain, meals, culinary, stores, pubs, food, retail, chains, specialty, bakeries, vendors, fuyang, restaurants, entrees, appetizers, salads, menu, assignment, shopper, shoppers, service, delicious, meal, paleo, eating, booths, tables, buffet, shrimp, chopsticks, eat, micah, tierney, dinners, dinner, mkhulu, san, tex, mexican, italian, pizza, brunch, bar, dining, steak, place, seafood, servers, salad, hostess, chinese, sandwich, patrons, bakery, eatery, local, outdoor, diner, mcdonald, greek, fancy, ate, ordering, cheese, business, thai, sandy, dined, hotel, japanese, afternoon, celebrate, birthday, cafe, table, downtown, francisco, good, seating, taco, foods, mex, night, soup, gift, chicken, banquet, anniversary, themed, pizzas, recommend, don, priced, pancakes, burrito, famous, neighborhood, drinks, potato, dessert, sausage, restaurateur, tonight, nearby, german, morton’s, casual, reception, kosher, ranch, favorite, servings, crab, appetizer, steaks, toilets, veggie, grilled, baked, pho, pasta, opened, wonderful, reservation, mussels, quaint, pancake, chinatown, foodies, oasis, swanky, kitchen, enjoyed, patio, work, upscale, friend, plate, cab, corner, coworkers, cooks, valentine, celebrated, arrive, stuffed, owners, discount, bistro, vegan, …
NER - Approach 2
58
• When a cultural entity has several senses in the same domain
• Domain: Music; Senses: album, tracks, band, artist name
Approach 2: Multiple senses same Domain
3.4. CULTURAL NER – MULTIPLE SENSES IN THE SAME DOMAIN August 12, 2010
3.4 Cultural NER – Multiple Senses in the Same Domain
Compared to our first contribution in Cultural NER (Section 3.3) that focused on disambiguating
entity names that span different domains (e.g., movies vs. video games), the focus of our sec-
ond contribution detailed in this chapter is in the identification of Cultural entities that appear in
multiple senses even in the same domain.
The occurrence of a person last names such as ‘Clinton’ in text, even if restricted to docu-
ments in the political domain, could refer to either Bill or Hillary or Chelsea Clinton. Figure 3.12
shows another example of the word ‘Celebration’ used as the name of a band, song, album and
track title by multiple artists in the music domain.
‘Celebration’ (song), a song by Kool & The Gang, notably covered by Kylie Minogue
‘Celebration’ (Voices With Soul song), the debut single from girl band, Voices With Soul
‘Celebration’, a song by Krokus from Hardware
‘Celebration’ (Simple Minds album), a 1982 album by Simple Minds
‘Celebration’ (Julian Lloyd Webber album), a 2001 album by Julian Lloyd Webber
‘Celebration’ (Madonna album), a 2009 greatest hits album by Madonna
‘Celebration’ (Madonna song), same-titled single by Madonna
‘Celebration’ (band), a Baltimore-based band
‘Celebration’ (Celebration album), a 2006 album by ‘Celebration’
‘Celebration’ (musical), a 1969 musical theater work by Harvey Schmidt and Tom Jones
Figure 3.12: Showing usages of the word ‘Celebration’ as the name of a band, hitsong, album and
track title by multiple artists in the music domain.
The goal of the algorithm described in this chapter is in the fine-grained entity identification
and disambiguation of such entities (in social media text) that have multiple real-world references
within a same domain.
79
• Problem Space
• Cultural Entity: Music album, tracks• Smile (Lilly Allen), Celebration (Madonna)..
60
Algorithm Preliminaries
• Problem Space
• Cultural Entity: Music album, tracks• Smile (Lilly Allen), Celebration (Madonna)..
• Corpus: MySpace comments
• Context-poor utterances• “Happy 25th Lilly, Alfie is funny”
• Goal: Semantic Annotationof named entity (w.r.t MusicBrainz)
60
Algorithm Preliminaries
61
• 60 songs with Merry Christmas
• 3600 songs with Yesterday
• 195 releases of American Pie
• 31 artists covering American Pie
Happy 25th! Loved your song Smile ..
Semantic Annotation
Using a Knowledge Resource for NER
61
• 60 songs with Merry Christmas
• 3600 songs with Yesterday
• 195 releases of American Pie
• 31 artists covering American Pie
Happy 25th! Loved your song Smile ..
Using a domain knowledge base is not straight-forward
Semantic Annotation
Using a Knowledge Resource for NER
Approach Overview
62
This new Merry Christmas tune..
SO GOOD!
Which ‘Merry Christmas’?; ‘So Good’ is also a song!
Approach Overview
62
This new Merry Christmas tune..
SO GOOD!
• Scoped Relationship graphs
• Using context cues from the content, webpage title, url..
• Reduce potential entity spot size
• Generate candidate entities
• Spot and Disambiguate
Which ‘Merry Christmas’?; ‘So Good’ is also a song!
Scoping via Real-world Restrictions
“I heart your new album Habits”
eliminate album releases that are not ‘new’ using metadata in MusicBrainz
Scoping via Real-world Restrictions
!"""#$
!""#$
!"#$
!#$
#$
#"$
#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
*****789!9$*,/):0+0-%;
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+
*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+
&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
“I heart your new album Habits”
eliminate album releases that are not ‘new’ using metadata in MusicBrainz
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
Scoping via Real-world Restrictions
!"""#$
!""#$
!"#$
!#$
#$
#"$
#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
*****789!9$*,/):0+0-%;
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+
*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+
&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
Closely follow distribution of random restrictions, conforming loosely to a Zipf distribution
“I heart your new album Habits”
eliminate album releases that are not ‘new’ using metadata in MusicBrainz
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
Scoping via Real-world Restrictions
!"""#$
!""#$
!"#$
!#$
#$
#"$
#""$!"""#$ !""#$ !"#$ !#$ #$ #"$ #""$
!"#$%&%'()'*)+,#)-.'++#"%&'()*+,-..)/*./&0%)1*-%*-%23*405&%%&*+-%6+**
*****789!9$*,/):0+0-%;
&/.0+.+*<5-+)*=0/+.*&2>?@*<&+*0%*.5)*,&+.*8*3)&/+
*&22*&/.0+.+*<5-*/)2)&+)1*&%*&2>?@*0%*.5)*,&+.*8*3)&/+
&.*2)&+.*8*&2>?@+)A&:.23*8*&2>?@+
*)%.0/)*B?+0:*C/&0%D*.&A-%-@3*7"!"""8$*,/):0+0-%;
Closely follow distribution of random restrictions, conforming loosely to a Zipf distribution
“I heart your new album Habits”
eliminate album releases that are not ‘new’ using metadata in MusicBrainz
Choosing which constraints to implement is simple - pick whatever is easiest first
From all of MusicBrainz (281890 artists, 6220519 tracks) to tracks of one artist
• User comments are on MySpace artist pages
• Contextual Restriction: Artist name
• Assumption: no other artist/work mention
Madonna’s tracks
64
Scoped Entity Lists
“this is bad news, ill miss you MJ”
• User comments are on MySpace artist pages
• Contextual Restriction: Artist name
• Assumption: no other artist/work mention
• Naive spotter has advantage of spotting all possible mentions (modulo spelling errors)
• Generates several false positives
Madonna’s tracks
64
Scoped Entity Lists
Non-Music Mentions
• Challenge 1: Several senses in the same domain
• Scoping relationship graphs narrows possible senses
• Challenge 2: Non-music mentions
• Got your new album Smile. Loved it!
• Keep your SMILE on!
65valid mention?
Hand-labeling - Fairly Subjective• 1800+ spots in MySpace user comments from
artist pages
• Keep your SMILE on!
• good spot, bad spot, inconclusive?
• 4-way annotator agreements• Madonna 90% agreement
• Rihanna 84% agreement
• Lily Allen 53% agreement
66
Supervised Learners
67
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Generic syntactic, spot-level, domain, knowledge features
[“Multimodal Social Intelligence in a Realtime Dashboard System”, VLDB Journal, Special Issue on "Data Management and Mining on Social Networks and Social Media", 2010]
Supervised Learners
67
Syntactic features Notation-S+POS tag of s s.POS
POS tag of one token before s s.POSb
POS tag of one token after s s.POSa
Typed dependency between s and sentiment word s.POS-TDsent∗
Typed dependency between s and domain-specific term s.POS-TDdom∗
Boolean Typed dependency between s and sentiment s.B-TDsent∗
Boolean Typed dependency between s and domain-specific term s.B-TDdom∗
Word-level features Notation-W+Capitalization of spot s s.allCaps
+Capitalization of first letter of s s.firstCaps
+s in Quotes s.inQuotes
Domain-specific features Notation-DSentiment expression in the same sentence as s s.Ssent
Sentiment expression elsewhere in the comment s.Csent
Domain-related term in the same sentence as s s.Sdom
Domain-related term elsewhere in the comment s.Cdom+Refers to basic features, others are advanced features
∗These features apply only to one-word-long spots.
Table 6. Features used by the SVM learner
Valid spot: Got your new album Smile.
Simply loved it!
Encoding: nsubj(loved-8, Smile-5) imply-
ing that Smile is the nominal subject of
the expression loved.
Invalid spot: Keep your smile on. You’ll
do great !Encoding: No typed dependency between
smile and great
Table 7. Typed Dependencies Example
Typed Dependencies:
We also captured the typed de-
pendency paths (grammatical rela-
tions) via the s.POS-
TDsent and s.POS-TDdom boolean
features. These were obtained be-
tween a spot and co-occurring senti-
ment and domain-specific words by
the Stanford parser[12] (see exam-
ple in 7). We also encode a boolean
value indicating whether a relation
was found at all using the s.B-TDsent
and s.B-TDdom features. This allows us to accommodate parse errors given the
informal and often non-grammatical English in this corpus.
5.2 Data and Experiments
Our training and test data sets were obtained from the hand-tagged data (see
Table 3). Positive and negative training examples were all spots that all four
annotators had confirmed as valid or invalid respectively, for a total of 571 posi-
tive and 864 negative examples. Of these, we used 550 positive and 550 negative
examples for training. The remaining spots were used for test purposes.
Our positive and negative test sets comprised of all spots that three annota-
tors had confirmed as valid or invalid spots, i.e. had a 75% agreement. We also
included spots where 50% of the annotators had agreement on the validity of the
Generic syntactic, spot-level, domain, knowledge features
[“Multimodal Social Intelligence in a Realtime Dashboard System”, VLDB Journal, Special Issue on "Data Management and Mining on Social Networks and Social Media", 2010]
1. Sentiment Expressions:Slang sentiment gazetteer using Urban Dictionary
2. Domain specific termsmusic, album, concert..
Efficacy of Features
PR tradeoffs: choosing feature combinations depends on end application requirement
Recall intensive
Prec
ision
inte
nsive
Efficacy of Features
PR tradeoffs: choosing feature combinations depends on end application requirement
identified 90% of valid spotseliminated 35% of invalid spots
78-50
42-91
Recall intensive
Prec
ision
inte
nsive
90-35
Efficacy of Features
PR tradeoffs: choosing feature combinations depends on end application requirement
identified 90% of valid spotseliminated 35% of invalid spots
78-50
42-91
Recall intensive
Prec
ision
inte
nsive Thesis Statement: Feature
combinations were most stable, best performing
Gazetteer matched domain words and sentiment expressions proved to be useful
90-35
Dictionary Spotter + NLP
69
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, knowledge base restricted to an
Artists’ tracks
Madonna’s track spots-
23% precision
Dictionary Spotter + NLP
69
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, knowledge base restricted to an
Artists’ tracks
Step 2: Disambiguate using NL features (SVM classifier)
Madonna’s track spots-
23% precision 42-91 : “All features” setting
Dictionary Spotter + NLP
69
!"
#!"
$!"
%!"
&!"
'!!"
()*+,
-./00,1
2!345
6&35!
6$3$!
6#345
6'36!
6!35%
%#3&$
%'36!
%!3&5
$53&%
$#32'
!"#$$%&%'()#**+(#*,)$-"%.$)/0#"%12%30#"%14
5('*%$%63)7)8'*#""
71,89-9/(:;/1:<9=>:?==,(71,89-9/(:;/1:@9A)(()71,89-9/(:;/1:B)C/(()@,8)==:D)==:0A1,,E
Step 1: Spot with naive spotter, knowledge base restricted to an
Artists’ tracks
Step 2: Disambiguate using NL features (SVM classifier)
Madonna’s track spots-
23% precision
Madonna’s track spots-
~60% precision
42-91 : “All features” setting
Thesis Statements
• Highlights issues with using a domain knowledge for an IE task
70
Thesis Statements
• Highlights issues with using a domain knowledge for an IE task
• Two stage approach: chaining NL learners over results of domain model based spotters
70
Thesis Statements
• Highlights issues with using a domain knowledge for an IE task
• Two stage approach: chaining NL learners over results of domain model based spotters
• Improves accuracy up to a further 50%• allows the more time-intensive NLP analytics to run on
less than the full set of input data
70
APPLICATIONS
71
BBC SoundIndex (IBM Almaden)Pulse of the Online Music Populace
http://www.almaden.ibm.com/cs/projects/iis/sound/
Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth: Multimodal Social Intelligence in a Real-Time Dashboard System to appear in a special issue of the
VLDB Journal on "Data Management and Mining for Social Networks and Social Media", 2010
http://www.almaden.ibm.com/cs/projects/iis/sound/
Domain metadata, Artist/Track
unstructured, structured metadata
ETL
UIMA Analytics Environment
Album/Track identification [ISWC09]
Sentiment Identification
Spam and off-topic comments
“U R $o Bad!”, “Thriller is my most fav MJ album”;“this is BAD news, miss u MJ”
ETL
Thriller/NNP is/VBZ my/PRP$ most/RBS fav/JJ MJ/NN album/NN
Exracted concepts into explorable datastructures
ETL
What are 18 year olds in London listening to?
ETL
What are 18 year olds in London listening to?
Crowd-sourced preferences
ETL
Several Insights..
74-ve +ve
< 4%
spamnon-spam
Predictive Power of Data
21
38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments
35% of total comments had no identifiable sentiments
Table 7 Annotation Statistics
As described in Section 8, the structured metadata
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
hypercube.
The data represented by each cell of the cube is the
number of comments for a given artist. The dimension-
ality of the cube is dependent on what variables we
are examining in our experiments. Timestamp, age and
gender of the poster, geography, and other factors are
all dimensions in hypercube, in addition to the mea-
sures derived from the annotators (spam, non-spam,
number of positive sentiments, etc.).
For the purposes of creating a top-N list, all dimen-
sions except for artist name are collapsed. The cube is
then sliced along the spam axis (to project only non-
spam comments) and the comment counts are projected
onto the artist name axis. Since the percentage of nega-
tive comments was very small (4%), the top-N list was
prepared by sorting artists on the number of non-spam
comments they had received independent of the senti-
ment scoring.
In Table 8 we show the top 10 most popular Bill-
board artists and the list generated by our analysis of
MySpace for the week of the survey. While some top
artists appear on both lists (e.g., Soulja Boy, Timba-
land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly
identified rising artists before they reached the Top-10
list on billboard (e.g., Fall Out Boy and Alicia Keys
both climbed to #1 on Billboard.com shortly after we
produced these lists). Overall, we can observe that the
Billboard.com list contains more artists with a long his-
tory and large body of work (e.g., Kanye West, Fer-
gie, Nickleback), whereas our MySpace Analysis List is
more likely to identify ”up and coming” artists. This is
consistent with our expectations, particularly in light
of the aforementioned industry reports which indicate
that teenagers are the biggest music influencers (Me-
diamark, 2004).
11.1.2 The Word on the Street
Using the above lists, we performed a casual preference
poll of 74 people in the target demographic. We con-
ducted a survey among students of an after-school pro-
gram (Group 1), Wright State (Group 2), and Carnegie
Mellon (Group 3). Of the three different groups, Group
Billboard.com MySpace Analysis
Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys
Table 8 Billboard’s Top Artists vs. our generated list
1 was comprised of respondents between ages 8 and 15;
while Group 2 and 3 were primarily comprised of col-
lege students in the 17–22 age group. Table 9 shows
statistics pertaining to the three survey groups.
Groups and Age Range No. of male No. of femalerespondents respondents
Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3
Table 9 Survey Group Statistics
The survey was conducted as follows: the 74 respon-
dents were asked to study the two lists shown in Table 8.
One was generated by Billboard and the other through
the crawl of MySpace. They were then asked the fol-
lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-
sponse along with their age, gender and the reason for
preferring a list was recorded.
The sources used to prepare the lists were not shown
to the respondents, so they would not be influenced by
the popularity of MySpace or Billboard. In addition,
we periodically switched the lists while conducting the
study to avoid any bias based on which list was pre-
sented first.
11.1.3 Results
The raw results of our study immediately suggest the
validity of the system, as can be seen in Table 10. The
MySpace data generated list is preferred over 2 to 1
to the Billboard list by our 74 test subjects, and the
preference is consistently in favor of our list across all
three survey groups.
More exactly, 68.9± 5.4% of subjects prefer the SI
derived list to the Billboard list. Looking specifically at
Group 1, the youngest survey group whose ages range
from 8–15, we can see that our list is even more suc-
cessful. Even with a smaller sample group (resulting in
User study indicated 2:1 and upto 7:1 (younger age groups) preference for MySpace list
Billboards Top 50 Singles chart during the week of Sept 22-28 ’07 vs. MySpace popularity charts
Challenging traditional polling methods!
Predictive Power of Data
21
38% of total comments were spam61% of total comments had positive sentiments4% of total comments had negative sentiments
35% of total comments had no identifiable sentiments
Table 7 Annotation Statistics
As described in Section 8, the structured metadata
(artist name, timestamp, etc.) and annotation results
(spam/non-spam, sentiment, etc.) were loaded in the
hypercube.
The data represented by each cell of the cube is the
number of comments for a given artist. The dimension-
ality of the cube is dependent on what variables we
are examining in our experiments. Timestamp, age and
gender of the poster, geography, and other factors are
all dimensions in hypercube, in addition to the mea-
sures derived from the annotators (spam, non-spam,
number of positive sentiments, etc.).
For the purposes of creating a top-N list, all dimen-
sions except for artist name are collapsed. The cube is
then sliced along the spam axis (to project only non-
spam comments) and the comment counts are projected
onto the artist name axis. Since the percentage of nega-
tive comments was very small (4%), the top-N list was
prepared by sorting artists on the number of non-spam
comments they had received independent of the senti-
ment scoring.
In Table 8 we show the top 10 most popular Bill-
board artists and the list generated by our analysis of
MySpace for the week of the survey. While some top
artists appear on both lists (e.g., Soulja Boy, Timba-
land, 50 Cent, and Pink), there are important differ-ences. In some cases, our MySpace Analysis list clearly
identified rising artists before they reached the Top-10
list on billboard (e.g., Fall Out Boy and Alicia Keys
both climbed to #1 on Billboard.com shortly after we
produced these lists). Overall, we can observe that the
Billboard.com list contains more artists with a long his-
tory and large body of work (e.g., Kanye West, Fer-
gie, Nickleback), whereas our MySpace Analysis List is
more likely to identify ”up and coming” artists. This is
consistent with our expectations, particularly in light
of the aforementioned industry reports which indicate
that teenagers are the biggest music influencers (Me-
diamark, 2004).
11.1.2 The Word on the Street
Using the above lists, we performed a casual preference
poll of 74 people in the target demographic. We con-
ducted a survey among students of an after-school pro-
gram (Group 1), Wright State (Group 2), and Carnegie
Mellon (Group 3). Of the three different groups, Group
Billboard.com MySpace Analysis
Soulja Boy T.I.Kanye West Soulja BoyTimbaland Fall Out BoyFergie RihannaJ. Holiday Keyshia Cole50 Cent Avril LavigneKeyshia Cole TimbalandNickelback PinkPink 50 CentColbie Caillat Alicia Keys
Table 8 Billboard’s Top Artists vs. our generated list
1 was comprised of respondents between ages 8 and 15;
while Group 2 and 3 were primarily comprised of col-
lege students in the 17–22 age group. Table 9 shows
statistics pertaining to the three survey groups.
Groups and Age Range No. of male No. of femalerespondents respondents
Group 1 (8–15) 8 9Group 2 (17–22) 21 26Group 3 (17–22) 7 3
Table 9 Survey Group Statistics
The survey was conducted as follows: the 74 respon-
dents were asked to study the two lists shown in Table 8.
One was generated by Billboard and the other through
the crawl of MySpace. They were then asked the fol-
lowing question: “Which list more accurately reflectsthe artists that were more popular last week?” Their re-
sponse along with their age, gender and the reason for
preferring a list was recorded.
The sources used to prepare the lists were not shown
to the respondents, so they would not be influenced by
the popularity of MySpace or Billboard. In addition,
we periodically switched the lists while conducting the
study to avoid any bias based on which list was pre-
sented first.
11.1.3 Results
The raw results of our study immediately suggest the
validity of the system, as can be seen in Table 10. The
MySpace data generated list is preferred over 2 to 1
to the Billboard list by our 74 test subjects, and the
preference is consistently in favor of our list across all
three survey groups.
More exactly, 68.9± 5.4% of subjects prefer the SI
derived list to the Billboard list. Looking specifically at
Group 1, the youngest survey group whose ages range
from 8–15, we can see that our list is even more suc-
cessful. Even with a smaller sample group (resulting in
User study indicated 2:1 and upto 7:1 (younger age groups) preference for MySpace list
Billboards Top 50 Singles chart during the week of Sept 22-28 ’07 vs. MySpace popularity charts
Challenging traditional polling methods!
INTERMISSION?Up Next: Overview of Key Phrase Extraction,
Applications, Conclusions
Key Phrase Extraction
• Different from Information Extraction
• Extracting vs. Assigning Key Phrases • Thesis Focus: Key Phrase Extraction
• Prior work focus: extracting phrases that summarize a document -- a news article, a web page, a journal article, a book..
• Thesis focus: summarize multiple documents (UGC) around same event/topic of interest
Key Phrase Extraction - Aboutness Understanding
• Prominent discussions (key phrases) around the 2009 Health Care Reform debate and 2008 Mumbai Terror Attack on one day
Key Phrase Extraction - Aboutness Understanding
4.1. KEY PHRASE EXTRACTION - ‘ABOUTNESS’ OF CONTENT August 10, 2010
document that are descriptive of its contents.
The contributions made in this thesis fall under the second category of extracting key phrases
that are explicitly present in the content and are also indicative of what the document is ‘about’.
The focus of previous approaches to key phrase extraction have been on extracting phrases
that summarize a document, e.g. a news article, a web page, a journal article or a book. In
contrast, the focus of this thesis is not in summarizing a document generated by users on social
media platforms but to extract key phrases that are descriptive of information present in multiple
observations (or documents) made by users about an entity, event or topic of interest.
The primary motivation is to obtain an abstraction of a social phenomenon that makes volumes
of unstructured user-generated content easily consumable by humans and agents alike. As an
example of the goals of our work, Table 4.1 shows key phrases extracted from online discussions
around the 2009 Health Care Reform debate and the 2008 Mumbai terror attack, summarizing
hundreds of user comments to give a sense of what the population cared about on a particular day.
2009 Health Care Reform 2008 Mumbai Terror Attack
Health care debate Foreign relations perspectiveHealthcare staffing problem Indian prime minister speechObamacare facts UK indicating supportHealthcare protestors Country of IndiaParty ratings plummet Rejected evidence providedPublic option Photographers capture images of Mumbai
Table 4.1: Showing summary key phrases extracted from more than 500 online posts on Twitteraround two news-worthy events on a single day.
Solutions to key phrase extraction have ranged from both unsupervised techniques that are
based on heuristics to identify phrases and supervised learning approaches that learn from human
105
Key Phrase Extraction on Social Media Content
• Thesis Focus: Summarizing Social Perceptions via key phrase extraction
• Preserving/Isolating the social behind the social data
• Accounting for redundancy, variability, off-topic content
80
“Met up with mom for lunch, she looks lovely as ever, good genes .. Thanks Nike, I love my new Gladiators ..smooth as a feather. I burnt all the calories of Italian joy in one run.. if you are
looking for good Italian food on Main, Buca is the place to go.”
Social and Cultural Logic in UGC
• Thematic components
• similar messages convey similar ideas
• Space, time metadata• role of community and geography in communication
• Poster attributes• age, gender, socio-economic status reflect similar
perceptions
81
Feature Space (in prior work and in thesis)
• Thesis Focus: n-grams, spatio-temporal metadata (social components)
• Syntactic Cues: In quotes, italics, bold; in document headers; phrases collocated with acronyms
• Document and Structural Cues: Two word phrases, appearing in the beginning of a document, frequency, presence in multiple similar documents etc.
• Linguistic Cues: Stemmed form of a phrase, phrases that are simple and compound nouns in sentences etc.
82
Key Phrase Extraction Overview
Spatio-Temporal Clusters δs Event Spatial Bias
δt Event Temporal Bias
Key Phrase Generation
n-gram generation
n-gram WeightingThematic, Temporal and
Spatial Scores
User-generated Content
textual component tc,temporal parameter tt,spatial parameter tg
Off-topic Key Phrase Elimination
Key Phrase Extraction Overview
Spatio-Temporal Clusters δs Event Spatial Bias
δt Event Temporal Bias
Key Phrase Generation
n-gram generation
n-gram WeightingThematic, Temporal and
Spatial Scores
User-generated Content
textual component tc,temporal parameter tt,spatial parameter tg
Off-topic Key Phrase Elimination
“President Obama in trying to regain control of the health-care debate will likely shift his pitch in September”
1-grams: President, Obama, in, trying, to, regain, ...
2-grams: “President Obama”, “Obama in”, “in trying”, “trying to”...
3-grams: “President Obama in”, “Obama in trying”; “in trying to”...
“President Obama in”“President” “President Obama”
Spatio-Temporal Clusters δs Event Spatial Bias
δt Event Temporal Bias
Key Phrase Generation
n-gram generation
n-gram WeightingThematic, Temporal and
Spatial Scores
User-generated Content
textual component tc,temporal parameter tt,spatial parameter tg
Off-topic Key Phrase Elimination
• A descriptor is an n-gram weighted by:
• Thematic Importance• TFIDF, stop words, noun phrases
• redundancy: statistically discriminatory in nature
• variability: contextually important
• Spatial Importance (local vs. global popularity)
• Temporal Importance (always popular vs. currently trending)
“President Obama in”“President” “President Obama”
Spatio-Temporal Clusters δs Event Spatial Bias
δt Event Temporal Bias
Key Phrase Generation
n-gram generation
n-gram WeightingThematic, Temporal and
Spatial Scores
User-generated Content
textual component tc,temporal parameter tt,spatial parameter tg
Off-topic Key Phrase Elimination
Higher-order n-grams picked over lower-order n-grams (if same scores)
Eliminating Off-topic Content [WISE2009]
• Frequency based heuristics will not eliminate off-topic content that is ALSO POPULAR
Spatio-Temporal Clusters δs Event Spatial Bias
δt Event Temporal Bias
Key Phrase Generation
n-gram generation
n-gram WeightingThematic, Temporal and
Spatial Scores
User-generated Content
textual component tc,temporal parameter tt,spatial parameter tg
Off-topic Key Phrase Elimination
• “Yeah i know this a bit off topic but the other electronics forum is dead right now. im looking for a good camcorder, somethin not to large that can record in full HD only ones so far that ive seen are sonys”
• “Canon HV20. Great little cameras under $1000.”
87
Approach Overview
Approach Overview
• Assume one or more seed words (from domain knowledge base) C1 - ['camcorder']
• Extracted Key words / phrases : C2 - ['electronics forum', 'hd', 'camcorder', 'somethin', 'ive', 'canon', 'little camera', 'canon hv20', 'cameras', 'offtopic']
• Gradually expand C1 by adding phrases from C2 that are strongly associated with C1
• Mutual Information based algorithm [WISE2009]
Key Phrases & Aboutness - Evaluations
• Are the key phrases we extracted topical and good indicators of what the content is about?
• If it is, it should act as an effective index/search phrase and return relevant content
• Evaluation Application: Targeted Content Delivery
89
Targeted Content Delivery - Evaluations
• 12K posts from MySpace and Facebook Electronics forums
• Baseline phrases: Yahoo Term Extractor
• Our method phrases: Key phrase extraction, elimination
• Targeted Content from Google AdSense
90
Targeted Content for Extracted Key Phrases
91
A. Showing Advertisements generated for phrases identified by the Yahoo Term Extractor (YTE)
B. Showing Advertisements generated for topical phrases extracted by our algorithm
User Studies and Results
92
User Studies and Results
92
! !"#$"%&'()#*%+*"%$#,#-+./%/0%/1#%&0"/%! 2/%,#+"/%345%'./#$6#-+,7+/0$%+8$##9#./%
! :0$%/1#%;4%&0"/"%! <0/+,%0=%>??%+*%'9&$#""'0."%! >@5%0=%+*"%&'()#*%+"%$#,#-+./%
! :0$%/1#%/0&'(+,%)#AB0$*"%! <0/+,%0=%>;C%+*%'9&$#""'0."%! ?45%0=%+*"%&'()#*%+"%$#,#-+./%
User Studies and Results
92
! !"#$"%&'()#*%+*"%$#,#-+./%/0%/1#%&0"/%! 2/%,#+"/%345%'./#$6#-+,7+/0$%+8$##9#./%
! :0$%/1#%;4%&0"/"%! <0/+,%0=%>??%+*%'9&$#""'0."%! >@5%0=%+*"%&'()#*%+"%$#,#-+./%
! :0$%/1#%/0&'(+,%)#AB0$*"%! <0/+,%0=%>;C%+*%'9&$#""'0."%! ?45%0=%+*"%&'()#*%+"%$#,#-+./%
Extracted, topical phrases yield 2X more relevant content
Thesis Statement
• TFIDF + social contextual cues yield more useful phrases that preserve social perceptions
• Corpus + seeds from a domain knowledge base eliminate off-topic phrases effectively
93
APPLICATIONS
94
http://twitris.knoesis.org/
Twitris (with Knoesis)online pulse around news-worthy events..
(Mumbai Terror Attack ‘08, Health Care Debate 2010)
Meenakshi Nagarajan, Karthik Gomadam, Amit Sheth, Ajith Ranabahu, Raghava Mutharaju and Ashutosh Jadhav, Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and Experiences, Tenth International Conference on Web
Information Systems Engineering, Oct 5-7, 2009: 539-553
Chatter around news-worthy events
Hundreds of tweets, facebook posts, blogs about a single event Multiple narratives, strong opinions, breaking news..
Preserving Social Perceptions
The Health Care Reform Debate
Zooming in on Florida
Summaries of Citizen Reports
Zooming in on Washington
Summaries of Citizen Reports
RT @WestWingReport: Obama reminds the faith-based groups "we're neglecting 2 live up 2 the call" of being R
brother's keeper on #healthcare
Providing Context
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University Opinion on Ira
n
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
SOYLENT GREEN and the HEALTH CARE REFORMInformation right where you need it !
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Culled out user observations correlated well with mainstream media (news, blogs)
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
CONCLUSIONS
103
Contributions and Summary
104
• Thesis motivation
• Understand characteristics of user-generated textual content on social media.
• Thesis demonstrated that
• UGC is different, variability and lack of context affects performance of TM tasks
• Example: 69% F-measure for NER on UGC
Summary• Described frameworks and implementations
• How contextual knowledge from multiple sources can be integrated to supplement traditional NLP/ML algorithms
• Showed effectiveness of these frameworks for NER, Key Phrase extraction on UGC
• for e.g., +11% average F-score improvements for NER in Weblogs
• Building Social Intelligence Applications
Other Contributions
106
UG
C U
nder
stan
ding
Tas
ks
FacetsHOWWHYWHAT
Named Entity Recognition [ISWC2009]
Key Phrase Extraction[WISE2009, WI2009]
Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]
Other Contributions
106
UG
C U
nder
stan
ding
Tas
ks
FacetsHOWWHYWHAT
Named Entity Recognition [ISWC2009]
Key Phrase Extraction[WISE2009, WI2009]
Semantic Document Classification [WWW2007]
Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]
Other Contributions
106
UG
C U
nder
stan
ding
Tas
ks
FacetsHOWWHYWHAT
Named Entity Recognition [ISWC2009]
Key Phrase Extraction[WISE2009, WI2009]
Semantic Document Classification [WWW2007]
Intent Mining [WI2009]
Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]
Other Contributions
106
UG
C U
nder
stan
ding
Tas
ks
FacetsHOWWHYWHAT
Named Entity Recognition [ISWC2009]
Key Phrase Extraction[WISE2009, WI2009]
Semantic Document Classification [WWW2007]
Intent Mining [WI2009]
Network Effects[ICWSM2010]
Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]
Other Contributions
106
UG
C U
nder
stan
ding
Tas
ks
FacetsHOWWHYWHAT
Named Entity Recognition [ISWC2009]
Key Phrase Extraction[WISE2009, WI2009]
Semantic Document Classification [WWW2007]
Intent Mining [WI2009]
Network Effects[ICWSM2010]
Gendered Language Usage in online Self-Expression
[ICWSM2009]
Domain Models: Disambiguating entities in merging Ontologies; applications in Conflict-of-interest detection [WWW2006, TWEB2008]
Future Work
• The long-term outlook
• research online social user interactions
• build and design systems to understand and impact how society produces, consumes and shares data
• The near-term goals
• transformative and robust ways of coding, analyzing and interpreting user observations
107
Future Work
108
Future Work
• Big Data Challenges & Availability of Domain Models
108
Future Work
• Big Data Challenges & Availability of Domain Models
• Computational Social Science
108
Future Work
• Big Data Challenges & Availability of Domain Models
• Computational Social Science
• ‘why’ are we seeing what we seeing
108
Future Work
• Big Data Challenges & Availability of Domain Models
• Computational Social Science
• ‘why’ are we seeing what we seeing
• people-content-network interactions
108
Future Work
• Big Data Challenges & Availability of Domain Models
• Computational Social Science
• ‘why’ are we seeing what we seeing
• people-content-network interactions
• building tools that close that loop; slice and dice of data, correlation with other media..
108
THANK-YOU!Are there any questions?
109