tim estes - generating dynamic social networks from large scale unstructured data
DESCRIPTION
Tim Estes, CEO of Digital Reasoning, delivered this presentation at the Strata Conference (Feb 2011). It discusses how large scale blog data can be mined to yield social networks of influencers, connections, discussion topics, etc.TRANSCRIPT
Generating Dynamic Social
Networks from Large Scale
Unstructured DataEnterprise Software to Make Sense of Really Junky Data
Tim Estes - CEO, Digital Reasoning
• What is a social network?
• The web of relationships between entities that influences actions
• Why does it matter?
• To reference Aesop: “You are known by the company you keep.”
• What’s required to build one algorithmically?
• What’s similar, what’s the same, what’s connected
What We’ll Discuss
president bush
president george w
administration
bush administration
george
george w
george bush
brown
american
clinton
house
gov
white
clinton
the administration
president-elect
barack obama
barack
president george w
tenn
the predators
predators
oakland
milwaukee
st louis
carolina
a season
baltimore
kentucky
miley cyrus
pussycat dolls
bob dylan
nine inch nails
rock star
the timberwolves
sean preston
lanarkshire
ticket prices
nme
britney spears
the album
x factor
my friends
mtv
madonna
lady gaga
singer
a student
NashvilleBush
What’s similar?
We use patented algorithms for deducing related terms from the data…
White
House
Justin
Timberlake
Britney
Spears
What’s the same?
Concept resolution:
Roll up similar things into groups of the same (again, algorithmically)
Example: Tony Blair
What’s connected?
Link analysis:
Show who and what are connected (again, you guessed it, algorithmically)
Terrorist Leader Connections
Let’s Put an Idea to the Test...
YES
YES
and
With powerful analytics can you remove some or
most of the need for a priori structure in designing
and understanding social networks or other quasi-
ontological schemas?
Can you also do it with messy unstructured data?
But first...
Why do we (Digital Reasoning) care?
Because its what we do for a living. We make sense of the senseless.
Our customers have critical needs
- Digital Reasoning works primarily in the Defense and Intelligence
Community making sense of noisy, unstructured data and turning it
into usable entity-centric systems supporting mission critical
intelligence.
The data is big and bad
- Little structure in content, topics all over the place, and totally different
ontologies/schemas across the community.
The times we live in create urgencies
- We care because the better and faster we are at making sense of this
kind of data, the safer our country is.
Why did we take a data-centric, deployed software model?
Unique Environments
- Given who our customers are... we can’t host their data. No one can.
The solution had to be a pure deployed software model.
Meaning in Hard to Reach Places
- The data is basically a bunch of pieces that don’t want to be connected.
People that don’t want to be found.
Result?
- Imagine trying to turn that kind of data in that type of architecture from a
bunch of loose communication into a social network that has patterns of
life, weightings of influence, and projections of probable future actions...
Here’s what it looks like in an architecture…
Now let’s show what can be learned with a little application of
Entity-Oriented Analytics to a bunch of web data.
Test Case
Web Blog+Wikipedia data (collected by Fetch)
- 6M Blog URLs collected over 1Yr +
- 16M unique blog messages
- no unifying these, topic or author
- tricky to get “good” big data from the open web. ended up using .5% of that
original source. 1TB became 4GB.
No a priori structure, sparse metadata, nearly all meaning emerges
from analysis
Let’s see what we can find out...
Examining connections related to “Carl Icahn”
The data shows
connections to and from
Carl Icahn by:
• people
• periodicals
• topics
• companiesOn closer examination
the data tells us:
Carl Icahn “is backing” a
startup company that “would build” products
related to Barack Obama
Let’s examine what connections we find to “Egypt”
Egypt is identified as a
location, as an organization
(country) and as an
unassigned entity with all
related connections
On closer examination we see
interesting connections in the
blogs for Egypt, Cairo, Issues and the phrase “powder keg”.
If we drill down into the actual
blog entry we see the context of
the connections
How about connections to “Steve Jobs”?
The entities and connections in
the blog data are vast – which
is not surprising.
The large amount of authors
and topics reflect the popularity
of Steve Jobs as a blog subject
Authors
TopicsOne connection is interesting:
“Steve Jobs” to “Walt Mossberg”to “Kindle”
Synthesys shows the reason for connection as “pricing”
Clicking on this word we see the
context of the connection
Demo Platform
Synthesys Platform Beta
elastic
user-driven
entity-oriented-analytics on demand
Observations
New innovations will be algorithmic and focused on turning hard-
to-use data into dynamic, evolving knowledge that can automate
machine execution
Architectures/solutions will have to accommodate customers that don’t want to move their data to a Public Cloud
It is a true statement... “If you can connect the dots, you can
connect the people”
So why should You care?
Because there is a lot of data that doesn‘t belong on a shared grid.
Such as Top Secret data, Sensitive Corporate Data, and Personal
Data.
Because people may want to own (Personal Computing model)
vs. rent (Mainframe model) analytics
Because you may not want to convert your data to fit the model of
the hosted solution or map to their ontology to get the answers
you need.
To learn more…
See us at:
- Strata Science Fair (Wed evening 6:45PM)
- Digital Reasoning Booth #305
- www.digitalreasoning.com
Questions?
Automated Understanding, Trusted Decisions, True Intelligence