content intelligence at scale being smart about large collections of user generated content lessons...
TRANSCRIPT
C o n t e n t I n t e l l i g e n c e a t S c a l eBeing Smart About Large Collections of User Generated
Content
Lessons drawn from case studies using Temnos and DISQUS platforms
Tim Musgrove, Founder, TemnosSmart Data Conference, San Jose, CA August 19,2015
Our Mission
75% market share for commenting platformso 350 million active
monthly userso 6 billion monthly page
views
Temnos History• Team spun out of AI group at CNET to form
TextDigger• TextDigger was acquired by Federated Media (FM)• While at FM the team refined successful solutions
for analyzing millions of premium publisher pages
• Temnos was formed as a spin-out of FM, in 2013, to offer similar solutions to third-parties
Questions Addressed1. Can you programmatically capture “the wisdom of
the crowd” (not just what’s trending in the crowd)?2. Is UGC a biased sample because it is self-selecting? 3. Is there more to the bottom line of user evaluations
than something like “3.5 stars average rating”?4. Can UGC be used to predict anything?5. How often is there substantive information-gain in
user comments on general news articles? 6. What tools do I need to intelligently manage
“firehose grade” UGC?
Can you programmitcally capture insights from the crowd?
You know you want to.
Is it possible?
Mostly Yes. We find you can programmatically set all the right metadata in front of you, so that you are just one step from human insight.
Can you programmitcally capture insights from the crowd?
1. Programmatically find differences or changes among categories of media, categories of users, clusters of topics, families of named entities, dimensions of attitude/sentiment.
2. Programmatically highlight the most influential or paradigmatic cases of those differences.
3. Now have a smart human read those examplars.
Insight will emerge.
Examples of this procedure: Coverage Gap Analysis
Often, the media topic balance contrasts starkly with that of corresponding user comments.• Recent GOP-candidate race coverage gave roughly triple
the coverage to immigration, and likewise to gay marriage, as to the economy and job-creation. But among commenting, the ratio was almost exactly the opposite of this.
• Insight: Professional media over-reacted to Trump
Immigration!!
JOBS!!
Another example of coverage gap analysis
• Media coverage of last year’s X-Men movie gave far more coverage to Jennifer Lawrence’s role in the movie than to Peter Dinklage’s, but audience engagement levels were the reverse of that.
• Insight: Lawrence’s “star power” mesmerized the entertainment media writers and they missed the overlap between MCU and GoT crowds.
VS
Uncovering differences in vocabulary: users vs. media
Wordor phrase
Found inbody text,number of URLs
Found incomments,number of URLs
Percentdifference
muslim 151 240 + 59%
terrorist 91 176 + 93%
death 124 213 + 72%
hell 102 269 +164%
soldier 324 523 + 61%
Oscar 293 84 - 71%
Siena Miller 170 28 - 84%
box office 382 72 - 81%
critic 445 167 - 62%
history 385 211 - 45%
How to do this1. Identify topics (with semantics, not keywords) in
both the articles and their comments2. Come up with related topic sets, we can call
them A, B, and C3. Answer the question, How often do journalists
attach B to A, compared to often how often they attach C to A?
4. Then answer the same question for commenters.5. Keep going, and odds are you’ll find big
differences.
The self-selecting nature of UGC is a blessing or a curse depending on how refined your filters are
• Yes, it’s a biased sample. But wait…o It’s no more biased than the alternative
sampling methods (e.g. Gallup)
• The bias is a blessing in disguiseo Most commenters visit more, read more,
share more, and influence more than non-commenters
o Some commenters are abusive, start-flame wars, have an agenda, etc.
o If you can filter out the latter group, you have a strongly engaged, knowledgeable set of people predictive capability
Sure, I’ll take this unsolicited call on my home land-line and give you my next twenty minutes!
In evaluative commenting, what users say is their bottom-line, often isn’t
Commenters’ conclusions are often in juxtaposition to their observations:• “Hillary is smart and she
raises good points and she’s a strong leader and she’s better than Obama and it would be good to finally have a woman Prez, but I hate her because of her snide attitude.”
In evaluative commenting, what users say is their bottom-line, often isn’t
Commenters’ conclusions are often in juxtaposition to their observations:• “I love the MacBook Pro but
it’s over-priced for the features and it gets hot sometimes, and also there’s no good tech support unless you are lucky to live close to an apple store, but it sure looks cool.” (5-star rating)
What users wish would happen can be a better predictor than their actual predictions
Leading up to the Oscars, more people predicted Boyhood would win Best Picture. But more of them wished that Birdman would win.
They got their wish, not their prediction.
VS
This is why, when you want prediction, you really have to capture desire and sentiment.
Don’t forget geography!• User interest varies widely by region. Consider
NY vs LA on discussion movies during the last Oscar season
Don’t forget gender!
What users wish would happen can be a better predictor than their actual predictions
• After some of Trump’s inflammatory comments, users predicted he would sink in the polls, but meanwhile the net sentiment balance on Trump was rising, not falling.
Polls followed users’ sentiment, not their predictions.
volu
me
sentiment
volu
me
sentiment
USER PREDICTIONUSER SENTIMENT
VS
It’s a myth that user commentary on articles is either just band-wagon or flame-war; many
comments add relevance
• About 50% of comment threads that introduce new topics, are introducing relevant ones.*
• This is more true when the original author wrote a thin article, e.g. one that failed to explore all the relevant topic space.
*Applies to substantive threads, e.g. 3-or-more comments. Relevance of topic measured both intensionally (distance in semantic network) and extensionally (corpus co-occurrence).
How to manage UGC when it’s Big Data
It’s a huge space, and you need it to be• Most often, stakeholders
want a needles-in-haystacks analysis. And they can’t describe the needles, or how many there are.
• Therefore you need it to be a Big Data project, e.g. DISQUS numbers
How to manage UGC when it’s Big Data
• So you need candidate detection before you start your real analysis1. Initial detection should
emphasize recall more than precision, e.g. rudimentary NEE
2. Then you verify the candidates with slightly deeper analysis, e.g. smarter NEE
3. Finally, you do the actual analysis, e.g. deeper semantics
All the UGC!
Candidates
Validated
A
How to manage UGC when it’s Fast Data
• It changes constantly, and you want it to!o If things didn’t change much, the analysis wouldn’t be as
valuable
• The only way to keep up with the pace is:1. Compare things on the metadata level, not the data level2. Devise a “momentum” metric that factors in volume, intensity, and rate-
of-change3. Set up an alert system so you can see when there’s an action-worthy
change in the weighted momentum
The components that anyone needs for “distilling the essence” of massive user generated content.
1. Extraction of both the article (or catalog record) and the comments, for contrastive analysis. o Retrieval of full text of both, keeping them paired appropriately
2. The metadata around the comments. o User-id, timestamp, geo-data, gender, up/down votes, reply status
3. A way to tease out topics, themes, etc. in which you are interestedo Topics, sub-topics, related-topics, metatopics – these are foundationalo Supplemental chracteristics, like reading level, sentiment, experiential
dimension, etc. These add depth, texture, nuance.
4. Time series analysis. o If you must throw data away, keep the metadata forever. Compare it
day-to-day, week-to-week.
Final WordsMany thanks to:
For a link to these slides, go to:http://temnos.com/press
For more info on Temnos or to request the white paper (showing many more examples, statistics and results), contact Tim Musgrove:[email protected]