content intelligence at scale being smart about large collections of user generated content lessons...

C o n t e n t I n t e l l i g e n c e a t S c a l eBeing Smart About Large Collections of User Generated

Content

Lessons drawn from case studies using Temnos and DISQUS platforms

Tim Musgrove, Founder, TemnosSmart Data Conference, San Jose, CA August 19,2015

Our Mission

75% market share for commenting platformso 350 million active

monthly userso 6 billion monthly page

views

Temnos History• Team spun out of AI group at CNET to form

TextDigger• TextDigger was acquired by Federated Media (FM)• While at FM the team refined successful solutions

for analyzing millions of premium publisher pages

• Temnos was formed as a spin-out of FM, in 2013, to offer similar solutions to third-parties

Questions Addressed1. Can you programmatically capture “the wisdom of

the crowd” (not just what’s trending in the crowd)?2. Is UGC a biased sample because it is self-selecting? 3. Is there more to the bottom line of user evaluations

than something like “3.5 stars average rating”?4. Can UGC be used to predict anything?5. How often is there substantive information-gain in

user comments on general news articles? 6. What tools do I need to intelligently manage

“firehose grade” UGC?

Can you programmitcally capture insights from the crowd?

You know you want to.

Is it possible?

Mostly Yes. We find you can programmatically set all the right metadata in front of you, so that you are just one step from human insight.

Can you programmitcally capture insights from the crowd?

1. Programmatically find differences or changes among categories of media, categories of users, clusters of topics, families of named entities, dimensions of attitude/sentiment.

2. Programmatically highlight the most influential or paradigmatic cases of those differences.

3. Now have a smart human read those examplars.

Insight will emerge.

Examples of this procedure: Coverage Gap Analysis

Often, the media topic balance contrasts starkly with that of corresponding user comments.• Recent GOP-candidate race coverage gave roughly triple

the coverage to immigration, and likewise to gay marriage, as to the economy and job-creation. But among commenting, the ratio was almost exactly the opposite of this.

• Insight: Professional media over-reacted to Trump

Immigration!!

JOBS!!

Another example of coverage gap analysis

• Media coverage of last year’s X-Men movie gave far more coverage to Jennifer Lawrence’s role in the movie than to Peter Dinklage’s, but audience engagement levels were the reverse of that.

• Insight: Lawrence’s “star power” mesmerized the entertainment media writers and they missed the overlap between MCU and GoT crowds.

VS

Uncovering differences in vocabulary: users vs. media

Wordor phrase

Found inbody text,number of URLs

Found incomments,number of URLs

Percentdifference

muslim 151 240 + 59%

terrorist 91 176 + 93%

death 124 213 + 72%

hell 102 269 +164%

soldier 324 523 + 61%

Oscar 293 84 - 71%

Siena Miller 170 28 - 84%

box office 382 72 - 81%

critic 445 167 - 62%

history 385 211 - 45%

How to do this1. Identify topics (with semantics, not keywords) in

both the articles and their comments2. Come up with related topic sets, we can call

them A, B, and C3. Answer the question, How often do journalists

attach B to A, compared to often how often they attach C to A?

4. Then answer the same question for commenters.5. Keep going, and odds are you’ll find big

differences.

The self-selecting nature of UGC is a blessing or a curse depending on how refined your filters are

• Yes, it’s a biased sample. But wait…o It’s no more biased than the alternative

sampling methods (e.g. Gallup)

• The bias is a blessing in disguiseo Most commenters visit more, read more,

share more, and influence more than non-commenters

o Some commenters are abusive, start-flame wars, have an agenda, etc.

o If you can filter out the latter group, you have a strongly engaged, knowledgeable set of people predictive capability

Sure, I’ll take this unsolicited call on my home land-line and give you my next twenty minutes!

In evaluative commenting, what users say is their bottom-line, often isn’t

Commenters’ conclusions are often in juxtaposition to their observations:• “Hillary is smart and she

raises good points and she’s a strong leader and she’s better than Obama and it would be good to finally have a woman Prez, but I hate her because of her snide attitude.”

In evaluative commenting, what users say is their bottom-line, often isn’t

Commenters’ conclusions are often in juxtaposition to their observations:• “I love the MacBook Pro but

it’s over-priced for the features and it gets hot sometimes, and also there’s no good tech support unless you are lucky to live close to an apple store, but it sure looks cool.” (5-star rating)

What users wish would happen can be a better predictor than their actual predictions

Leading up to the Oscars, more people predicted Boyhood would win Best Picture. But more of them wished that Birdman would win.

They got their wish, not their prediction.

VS

This is why, when you want prediction, you really have to capture desire and sentiment.

Don’t forget geography!• User interest varies widely by region. Consider

NY vs LA on discussion movies during the last Oscar season

Don’t forget gender!

What users wish would happen can be a better predictor than their actual predictions

• After some of Trump’s inflammatory comments, users predicted he would sink in the polls, but meanwhile the net sentiment balance on Trump was rising, not falling.

Polls followed users’ sentiment, not their predictions.

volu

me

sentiment

volu

me

sentiment

USER PREDICTIONUSER SENTIMENT

VS

It’s a myth that user commentary on articles is either just band-wagon or flame-war; many

comments add relevance

• About 50% of comment threads that introduce new topics, are introducing relevant ones.*

• This is more true when the original author wrote a thin article, e.g. one that failed to explore all the relevant topic space.

*Applies to substantive threads, e.g. 3-or-more comments. Relevance of topic measured both intensionally (distance in semantic network) and extensionally (corpus co-occurrence).

How to manage UGC when it’s Big Data

It’s a huge space, and you need it to be• Most often, stakeholders

want a needles-in-haystacks analysis. And they can’t describe the needles, or how many there are.

• Therefore you need it to be a Big Data project, e.g. DISQUS numbers

How to manage UGC when it’s Big Data

• So you need candidate detection before you start your real analysis1. Initial detection should

emphasize recall more than precision, e.g. rudimentary NEE

2. Then you verify the candidates with slightly deeper analysis, e.g. smarter NEE

3. Finally, you do the actual analysis, e.g. deeper semantics

All the UGC!

Candidates

Validated

A

How to manage UGC when it’s Fast Data

• It changes constantly, and you want it to!o If things didn’t change much, the analysis wouldn’t be as

valuable

• The only way to keep up with the pace is:1. Compare things on the metadata level, not the data level2. Devise a “momentum” metric that factors in volume, intensity, and rate-

of-change3. Set up an alert system so you can see when there’s an action-worthy

change in the weighted momentum

The components that anyone needs for “distilling the essence” of massive user generated content.

1. Extraction of both the article (or catalog record) and the comments, for contrastive analysis. o Retrieval of full text of both, keeping them paired appropriately

2. The metadata around the comments. o User-id, timestamp, geo-data, gender, up/down votes, reply status

3. A way to tease out topics, themes, etc. in which you are interestedo Topics, sub-topics, related-topics, metatopics – these are foundationalo Supplemental chracteristics, like reading level, sentiment, experiential

dimension, etc. These add depth, texture, nuance.

4. Time series analysis. o If you must throw data away, keep the metadata forever. Compare it

day-to-day, week-to-week.

Final WordsMany thanks to:

For a link to these slides, go to:http://temnos.com/press

For more info on Temnos or to request the white paper (showing many more examples, statistics and results), contact Tim Musgrove:[email protected]

http://temnos.com/press

http://temnos.com/press

mailto:[email protected]

content intelligence at scale being smart about large collections of user generated content lessons...

Documents