user-generated content on social media
DESCRIPTION
Keynote talk at Social Data on the Web workshop, ISWC 2009TRANSCRIPT
User-Generated Content on Social Media
Challenges, Opportunities
Meena Nagarajan, KNO.E.SIS, Wright [email protected], http://knoesis.wright.edu/researchers/meena/
1
1Tuesday, October 27, 2009
The Shift.. in the rules of the game
• Online Media: Packaged Goods Media to a Conversational Media
• Variety of networked interactions, many in nearreal-time
• Information economy: from dearth of signals to plenty much!
2 http://gregverdino.typepad.com/greg_verdinos_blog/images/2007/07/09/web2_logos.jpg
2Tuesday, October 27, 2009
Social Media Investigations• Network: Social structure
emerges from the aggregate of relationships (ties)
• People: poster identities, the active effort of accomplishing interaction
• Content : studying the content of communication."Who says what, to whom, why, to what extent and with what effect?" [Laswell]
!" !"
3
3Tuesday, October 27, 2009
Effects of Networked Publics
• Certain social phenomenon admittedly more complex
• begs for a people-content-network confluence
• Micro-level variations of Content-people-network on macro-level features
• “How do the topic of discussion, emotional charge of a conversation, poster characteristics & network connections affect ....?”
4
4Tuesday, October 27, 2009
People-Content-Network - Possibilities
• Emerging social order in online conversations
• How are the people-content-network dynamics shaping online conversations?
• Can we understand the Influentials theory, information diffusion properties in networks (etc.) while taking people and content into account?
5
5Tuesday, October 27, 2009
Stand on the shoulder of micro-giants
• The point is that we need a strong grasp on the micro-level variables of the content, people and network dimensions to begin explaining what they are doing to any social phenomenon...
• My focus is on the micro-level variables in the content dimension.
6
6Tuesday, October 27, 2009
Mapping User-generated Content to Context
7
7Tuesday, October 27, 2009
Dimensions Of Analysis
• What are the Named Entities and topics that people are making references to?
• How are they interpreting any situation in local contexts and supporting them in their variable observations?
WHAT
8
• Named Entity Identification and Disambiguation
• Cultural Named Entities
• Music artist, track named entities (IBM) [ISWC09a, VLDB09], Movie named entities (MSR) [WWW2010]
• Summaries of user perceptions behind real-time events from Twitter
8Tuesday, October 27, 2009
Dimensions Of Analysis
• What are the Named Entities and topics that people are making references to?
• How are they interpreting any situation in local contexts and supporting them in their variable observations?
WHAT
9
www.evri.com
http://memetracker.org/
9Tuesday, October 27, 2009
Dimensions Of Analysis
• What are the diverse intentions that produce the diverse content on social media?
• Why we share by looking at what we predominantly do with the medium. Value derived, repurposing..
• Emotion, sentiment expressions..
WHY
10
10Tuesday, October 27, 2009
Dimensions Of Analysis
• Mapping User Intentions
• Information Seeking, Sharing, Transactional intents [WI09]
• What is the intention landscape of social media
• where is the monetization potential
WHY
11
11Tuesday, October 27, 2009
Dimensions Of Analysis
• What do word usages tell us about an active population, or about the medium?
• Dynamics of a conversation - snubs, flaming words, coordination.. or lack thereof!
HOW
12
• Self-presentation in Online Dating Profiles (with Prof. Marti Hearst, UC Berkeley) [ICWSM09]
12Tuesday, October 27, 2009
Dimensions Of Analysis
• What do word usages tell us about an active population?
• Self-presentation
• Dynamics of a conversation - snubs, flaming words, coordination.. or lack thereof!
HOW
13http://wordwatchers.wordpress.com/2008/10/06/language-in-speeches-and-interviews-summary-comparisons/
13Tuesday, October 27, 2009
The Social Media Content Landscape..
14
14Tuesday, October 27, 2009
!"#$%$#&'()*+,'((*-./&0)*
1'.$'2(3*4/"5.$2&6/"*7'.83*-./&0)*
9"$:/.,*7'.83*-./&0)*
;353./83"3/&)**1'.$'2(3*4/"5.$2&6/"**
7'.83*-./&0)*
!"#$%%&&&'()*+,'*-.%#!-/-0%123,45--/%676879:8;8%0)<40%-%==
Population, Medium DiversitySome mediationRate of exchange (asynchronous, synchronous)Many-to-many reachShared ContextsSlangs, abbreviations, grammar, spelling, media-specific vocabulary Interpersonal interactions
15
15Tuesday, October 27, 2009
Variety & Formality
TYPE OF DATA FORMALITY
Nat Broadcast Reportage 62.2
Informational writing 61
Academic Social Science 60.6
Writing 58
Professional Letters 57.5
Non Acad Social Science 56.9
Broadcasts 55
Blog corpus 53.3
Scripted Speech 53
Email Corpus 50.8
TYPE OF DATA FORMALITY
Prepared speeches 50
Personal Letters 49.7
Imaginative writing 47
Fiction Prose 46.3
Interviews 46
Unscripted Speeches 44.4
Spontaneous speech 44
Conversations 38
Phone Conversations 36* Heylighen, F. & Dewaele, J. Variation in the contextuality of language: An empirical measure Foundations of Science, 2002, 293-340Weblogs, Genres and Individual Differences: How bloggers write for who they write for; Scott Nowson
Formality Score = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)/2 *
16
16Tuesday, October 27, 2009
Variety & Formality
TYPE OF DATA FORMALITY
Nat Broadcast Reportage 62.2
Informational writing 61
Academic Social Science 60.6
Writing 58
Professional Letters 57.5
Non Acad Social Science 56.9
Broadcasts 55
Blog corpus 53.3
Scripted Speech 53
Email Corpus 50.8
TYPE OF DATA FORMALITY
Prepared speeches 50
Personal Letters 49.7
Imaginative writing 47
Fiction Prose 46.3
Interviews 46
Unscripted Speeches 44.4
Spontaneous speech 44
Conversations 38
Phone Conversations 36* Heylighen, F. & Dewaele, J. Variation in the contextuality of language: An empirical measure Foundations of Science, 2002, 293-340Weblogs, Genres and Individual Differences: How bloggers write for who they write for; Scott Nowson
Formality Score = (noun frequency + adjective freq. + preposition freq. + article freq. – pronoun freq. – verb freq. – adverb freq. – interjection freq. + 100)/2 *
16
TYPE OF DATA FORMALITY
Broadcasts 55
Blog corpus 53.3
Scripted Speech 53
Email Corpus 50.8Critic Music reviews from www.metacritic.com/music/ 50.13
Yahoo Personals AboutMe 50.10
MySpace About Me 50.07
MySpace - comments on Artist Pages 50.06
Prepared speeches 50
Personal Letters 49.7
Twitter 49.46
Facebook posts 48.20
Imaginative writing 47
Fiction Prose 46.3
16Tuesday, October 27, 2009
Making up for lack of context..
• Supplement what the data is showing you with what you already know..
• Statistical NLP + Contextual Knowledge
• Ontologies, Taxonomies, Dictionaries, social medium, shared spatio-temporal contexts..
17
17Tuesday, October 27, 2009
Representative Efforts
18
WHAT WHY HOW
18Tuesday, October 27, 2009
Cultural NER
19
WHAT
19Tuesday, October 27, 2009
Cultural NER
It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice picking “GI Jane” worse now
19
WHAT
19Tuesday, October 27, 2009
Cultural NER
It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice picking “GI Jane” worse now
19
WHAT
LOVED UR MUSIC YESTERDAY!
19Tuesday, October 27, 2009
Cultural NER
I decided to check out the Wanted demo today even though I really did not like the movie
minus Mrs Jolie a.k.a Fox of course! It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice picking “GI Jane” worse now
19
WHAT
LOVED UR MUSIC YESTERDAY!
19Tuesday, October 27, 2009
Cultural NER
Obama the Dark Knight of socialism.. the man is not as impressive as Ledger yea
I decided to check out the Wanted demo today even though I really did not like the movie
minus Mrs Jolie a.k.a Fox of course! It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice picking “GI Jane” worse now
19
WHAT
LOVED UR MUSIC YESTERDAY!
19Tuesday, October 27, 2009
Intuitions..• Spotting and Sense Identification
• Open vs. Closed world • unlike person, location, named entities, contexts and
senses change fairly rapidly
• We assume an open-world wrt senses• No comprehensive sense knowledge base
• Reduce it to a spotting and binary sense classification problem
20
It was THE HANGOVER of the year..lasted forever.. so I went to the movies..bad choice picking “GI Jane” worse now
20Tuesday, October 27, 2009
Two flavors..• Artist and tracks spotting in MySpace
music forums • using the MusicBrainz Taxonomy
• with Daniel Gruhl, Jan Pieper, Christine Robson, IBM Almaden, Amit Sheth, Knoesis [ISWC09a]
• on Thursday Oct 29, Session: Discovering Semantics
• Movie names from Weblogs • with Amir Padovitz, Social Streams MSR, [WWW2010]
21
21Tuesday, October 27, 2009
Cultural NER in Weblogs• Goal: Supplement classifiers with
information that will help them disambiguate the reference of a term better!
• A Complexity of Extraction measure associated with an entity in target sense in a corpus
• with all cues equal, systems that are ‘complexity aware’ will treat cues differently
22
22Tuesday, October 27, 2009
Measure of Extraction Complexity
• Feature extraction: Graph-based spreading activation and clustering
• entity sense definition from Wikipedia + evidence a corpus presents for the target sense of the entity
• Ranked list speaks for itself• More varied senses and contexts,
implies higher extraction complexity23
Extracted Complexity (general weblogs)Time Travellerʼs WifeAngels and Demons..The Hangover..Star Trek..WantedUpTwilight...
23Tuesday, October 27, 2009
Feature as a Prior
24
Decision Tree and Boosting Classifiers
X axis: precisionY axis: recall1500+ hand-labeled data points
Blue: basic featuresRed: with Entropy baselineGreen: with our Complexity of Extraction feature
24Tuesday, October 27, 2009
As a Prior in Binary Classification
25
Average F-measure over 1000 decision tree, boosting models
Average Accuracy over 1000 decision tree, boosting models
1500+ hand-labeled data pointsBlue: basic features
Red: with Entropy baselineGreen: with our Complexity of
Extraction feature25Tuesday, October 27, 2009
To chew on..
• The concept of ‘Extraction Complexity’ as an additional prior is very promising
• applies to general NER
26
26Tuesday, October 27, 2009
User Intention Mapping
• Unlike Web search intent, entity alone is in-sufficient to characterize intent here..
• Three broad intentions: information seeking, sharing, transactional, combinations thereof.
• ‘i am thinking of getting X’ (transactional)‘i like my new X’ (information sharing)‘what do you think about X’ (information seeking)
27
WHY
27Tuesday, October 27, 2009
Action Patterns• Resorted to ‘action patterns’ surrounding named
entities• “where can i find a psp cam..”
• A minimally supervised bootstrapping algorithm
• 10 seed action patterns, learn new ones from unannotated corpus, relying on a empirical and semantic similarity with seed patterns
• semantic similarity from communicative functions of words Linguistic Inquiry Word Count (www.LIWC.net)
28
28Tuesday, October 27, 2009
Information Seeking, Transactional
• Patterns learned using 8000 uncategorized posts on MySpace forums
• Intent recognition recall using pre-classified user posts from Facebook Marketplace (to buy): 81%
29
Sample learned patternsdoes anyone know howknow where i canwas wondering if someoneIm not sure howsomeone tell me how
29Tuesday, October 27, 2009
Impact on Online Advertising?
• Generate ads from user profile (interests, hobbies) or from posts with monetizable intents?
30
30Tuesday, October 27, 2009
Targeted Content Delivery Platform
• Of all the ads generated using profile (hobbies, interests) information, 7% received attention
• Ads generated using authored, monetizable posts, 59% received attentionMore at [WI09], Beyond Search and Internet Economics Workshop, MSR, Redmond, WA
31
WhatWhy
http://research.microsoft.com/en-us/um/redmond/about/collaboration/awards/beyondsearchawards.aspx
31Tuesday, October 27, 2009
Self-Presentation• On Online-dating profiles [ICWSM09] (with Prof.
Marti Hearst, UCB)
• quantifying usages of words from linguistic, personal and psychological categories in LIWC
• Exploratory Factor Analysis to identify systematic co-occurrence patterns among LIWC variables
• grouping user profiles on the basis of their shared multi-dimensional features to compare and contrast self-presentation
32
HOW
32Tuesday, October 27, 2009
Imitate to Impress !?• More similarities than differences
• Men displaying a higher usage of tentative words (maybe, perhaps..)
• typically attributed to feminine discourse
• Many similarities in word combinations and words used!
• Perhaps, self-expression tends towards attempting homophily in online dating..
33
33Tuesday, October 27, 2009
Science, Fun and Profit
34
34Tuesday, October 27, 2009
BBC SoundIndex, IBM
35
“A pioneering project to tap into the online buzz surrounding artists and songs, by leveraging several popular online sources”
De-spam, slang transliterations, entity identification, voting theory to combine multi-modal online data sources [ICSC08a,VLDB09]
http://www.almaden.ibm.com/cs/projects/iis/sound/
WhatWhy
When, Where, Who
35Tuesday, October 27, 2009
Twitris: Kno.e.sis
36
Real-time user perceptions as the fulcrum for browsing the Web [ISWC09b] What
When, Where
36Tuesday, October 27, 2009
37
Iran elections: Discussions in the US and Iran on the same day
The mystery of Soylent Green: information where you can use it
37Tuesday, October 27, 2009
Twitris: Twitter through space,time,theme
Come see & play with Twitris @ the International Semantic Web Challenge
at ISWC ’09
http://twitris.knoesis.org
38
twitris socially influenced browsingAshu, Raghava, Wenbo, Pramod. Vinh, Karthik, Meena, Amit, and Ajith
kno.e.sis center, Wright State University
Opinion on Iran
Electio
n from th
e
US talks
about O
il
economies
,
blogging
Opinion on Iran
Electio
n from Ira
n
talks
about
theocra
cy,
oppressio
n,
demonstr
ation
Spatial perspective
Capture changing perceptions, issues of interest every day; legalize illegal immigrants in the healthcare context on September 18.
Temporal perspective
Capture changing perceptions, issues of interest every day; Nobel is no more the news for Obama! captured October 12.
Find resources related to social perceptions
News and Wikipedia articles to put extracted descriptors in context
Twitris aggregates social perceptions from Twitter using a spatio, temporal and thematic approach. Twitris captures what was said, when it was saidand where it was said. Fetch resources from the Web to explore perceptions further. Browse the Web for issues that matter to people, using
people's perceptions as the fulcrum.
What does twitris do?
✓ Exploit spatio, temporal semantics for thematic aggregation
✓ Analyze the anatomy of a tweet "RT @m33na come back and checkl new events on twitris #twitris" RT: Retweet or a repost of a tweet; # (hashtags) user generated meta; @- refer to
other users
✓Data from diverse sources (Twitter, news services, Wikipedia, and other Web resources)
✓ End user application
Little statistics from Tiwtris (unit: tweets)
Healthcare ( Aug 19 - Oct 20) : 721 K (US Only)
Obama (Oct 8 - 20): 312 K (US Only)
H1N1 (Oct 5 - 20) : 232 K (US Only)
Iran Election (June 5 - Oct 20) : 2.8 m (Worldwide)
`
Twitris
Concept Cloud, News and related
articles
Google News widget
DBpedia widget
Context + Selected
Term
Context + Selected
Term
Twitris DB
Data Collection
event-1 crawler....
event-kcrawler
.
.
.
.event-ncrawler
Author Location Lookup
.
.
.Author Location
Lookup..
Author Location Lookup
Geocode Lookup....
Geocode Lookup....
Geocode Lookup
Data ProcessingTFIDF based
descriptor extraction
Spatio, Temporal, Thematic descriptor extraction
Extracting storylines around
descriptorsTwitter Search
Shared
Memory
Data Dumper....
Data Dumper....
Data Dumper
Shared
Memory
Shared
Memory
Parallel crawling to scaleData processing pipeline to streamline Twitter, geocode services, data analytics, to handle heterogeneityLive resource aggregationNear real time: Processing upto a day before Spatio-temporally weighted text analytics
twitris internals in less than
140 characters
Integrate user observations with news on a particular day; Correlate citizen journalism with the fourth estate; On September 18, Obama was talking about Illegal immigrants in the context of health care;
The fourth estate perspective
Cavetas and Future work
1. Handle Twitter constructs such as hashtags, retweets, mentions and replies better2. Different viz widgets such as time series to show changing perceptions from a place for an event and demographic based visualizations.3. Sentiment analysis 4. Robust computing approaches (Cloud, Hadoop)5. FB Connect for sharing and personalization
Check us out at: http://twitris/.knoesis.org
Follow us @7w17r15
Become a FB Fan and share Twtitris with everyone
A tetris like approach to twitter to gather aggregated social signals is defined as
38Tuesday, October 27, 2009
Thank You!
Google, Bing, Yahoo: Meena Nagarajan
http://knoesis.wright.edu/researchers/meena
39
39Tuesday, October 27, 2009
References[WISE09] Meenakshi Nagarajan, Karthik Gomadam, Amit Sheth, Ajith Ranabahu, Raghava Mutharaju and Ashutosh Jadhav, Spatio-Temporal-Thematic Analysis of Citizen-Sensor Data - Challenges and Experiences, Tenth International Conference on Web Information Systems Engineering, Oct 5-7, 2009.[ISWC09a] Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Context and Domain Knowledge Enhanced Entity Spotting in Informal Text, The 8th International Semantic Web Conference, 2009.[ISWC09b] Twitris, Submission to the International Semantic Web Challenge, collocated with the International Semantic Web Conference 2009.[VLDB09] Daniel Gruhl, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Amit Sheth, Multimodal Social Intelligence in a Realtime Dashboard System, Pending Review, VLDB Journal, Special Issue on "Data Management and Mining on Social Networks and Social Media", 2009.[WWW2010] Meenakshi Nagarajan, Amir Padovitz, A Measure of Extraction Complexity: a Novel Prior for Improving Recognition of Cultural Entities, Manuscript in Preparation, for The Nineteenth International World Wide Web Conference, 2010.[ICSC08a] Alfredo Alba, Varun Bhagwan, Julia Grace, Daniel Gruhl, Kevin Haas, Meenakshi Nagarajan, Jan Pieper, Christine Robson, Nachiketa Sahoo. Applications of Voting Theory to Information Mashups, Second IEEE International Conference on Semantic Computing, ICSC 2008.[ICSC08b] Meenakshi Nagarajan, Cartic Ramakrishnan, Amit Sheth, “Text Analytics for Semantic Computing - the good, the bad and the ugly”, Second IEEE International Conference on Semantic Computing Santa Clara, CA, USA, 2008.[WI09] Meenakshi Nagarajan, Kamal Baid, Amit P. Sheth, and Shaojun Wang, Monetizing User Activity on Social Networks - Challenges and Experiences, 2009 IEEE/WIC/ACM International Conference on Web Intelligence, Sep 15-18 2009.[ICWSM09] Meenakshi Nagarajan, Marti A. Hearst. An Examination of Language Use in Online Dating Personals, 3rd Int'l AAAI Conference on Weblogs and Social Media, ICWSM 2009[IC09] Amit Sheth, Meenakshi Nagarajan. Semantics-Empowered Social Computing IEEE Internet Computing 13(1), 2009.
40
http://knoesis.wright.edu/researchers/meena/pubs.php
40Tuesday, October 27, 2009