social data analytics using ibm big data technologies
DESCRIPTION
Distilling Insights from Social Media Using Big Data Technologies Have you ever wondered what your customers are saying about you in Social media, and the impact it might be having on your business? This session will focus on how BigInsights and Big Data technologies can be used to glean useful and actionable insights from social media data. You'll see how data can be ingested and prepped and do text analytics on social data in real time. Using Hadoop, we'll show you how you can store and analyze your large volume of historical social media data and reference data. This talk and demo will provide an introduction to text analytics and how it is used within the IBM Big Data platform for a social media solution.TRANSCRIPT
© 2012 IBM Corporation October 21, 2013
Social Data Analytics
using
IBM Big Data technologies
Vijay Bommireddipalli [email protected]
Development Manager, Social Data Accelerator
IBM Big Data
© 2011 IBM Corporation 2
Please note
IBM’s statements regarding its plans, directions, and intent are subject to change or
withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general
product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment,
promise, or legal obligation to deliver any material, code or functionality. Information
about potential future products may not be incorporated into any contract. The
development, release, and timing of any future features or functionality described for
our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM
benchmarks in a controlled environment. The actual throughput or performance that
any user will experience will vary depending upon many factors, including
considerations such as the amount of multiprogramming in the user’s job stream, the
I/O configuration, the storage configuration, and the workload processed. Therefore,
no assurance can be given that an individual user will achieve results similar to those
stated here.
© 2011 IBM Corporation 3
Before we begin …
© 2011 IBM Corporation 4
Tag ! You’re it ! - Micro-segmentation
© 2011 IBM Corporation 5
Maybe our politicians should take
a playbook out of the rivalry
between duke/unc and take it
to the courts
http://ity.com/wfUsir
I'm at Mickey's Irish Pub Downtown
(206 3rd St, Court Ave, Raleigh) w/ 2
others http://4sq.com/gbsaYR @silliesylvia good!!! U
shouldnt! Think about the
important stuff, like ur 43rd
birthday ;)
btw happy birthday Sylvia ;)
Location
Intent to consume
@silliesylvia I <3 your leather
leggings!! Its so katniss!!
Age
Personal Attributes
• Sylvia Campbell, Female, In a
Relationship
• 32 years old, birthday on 7/17
• Lives near Raleigh, NC
• College graduate; Income of 80-120k
Buzz/Sentiment
• Retweets BF’s comments
• Interest in BBC shows: Downton Abbey,
Sherlock, Fringe, (P&P?)
• Sherlock Holmes, Robert Downey, Jr.
• Hunger Games, Katniss/J. Lawrence
Interests/Behavior
• Watch movies, tv shows
• Romance plots, “hero types”, strong
women
• Uses iPad 3, Redbox, Hulu
• Shopping , interest in sales/deals
• Duke/ UNC basketball
@silliesylvia $10 dollars says
matthew & mary get married
next season :)
#downtownabbey
Behavior
Interest
@bamagirl can’t wait to
watch sherlock with you!
Oh, robert downey jr, I still
love you but bbc is so
amazing
OMG OMG. just
dropped my new ipad3
crappola!!!
Interest
Consumption
Prediction
dear redbox please have
kings speech for my new tv
colin firth movie marathon
360 degree profile
Intent to consume
Consumption
Social Data Analytics - Using social media as a rich source of information
© 2011 IBM Corporation 6
Name: Jane Doe, Cava Address: Tampa, Fl Twitter: @maryguida Blog Topic: politics Hobbies: running, yoga, … Relationships: Tony C (brother)…
Challenges: Scale
1000’s sites, 100s millions users
Complex matching decisions Partial, noisy and incomplete profile
attributes Only 3% of consumers have sufficient
attribute information in their profiles.
Name: Jane Doe Id: jaydee Address: Home of the Buccaneers Interests: running, yoga, football…
Name: jane Address: Tampa, FL Relationships: Tony C (brother)., …
Entity Integration
Name: Jane Doe Address: Tampa, FL Twitter: jaydee Blog Topic: food Hobbies: running, yoga, … Relationships: Tony C (brother)…
Name: J Doe Blog Topic: food
All names are fictitious
Social Data Analytics - Comprehensive Entity Extraction and Integration
© 2011 IBM Corporation 7
Consumer Intelligence
Personal Attributes • Identifiers: name, address, age, gender, occupation… • Interests: sports, pets, cuisine… • Life Cycle Status: marital, parental
Relationships • Personal relationships: family, friends and roommates… • Business relationships: co-workers and work/interest network…
Products Interests • Personal preferences of products • Product Purchase history
Social Media based
360-degree Consumer Profiles
Life Events • Life-changing events: relocation, having a baby, getting married, getting divorced, buying a house…
Monetizable intent to buy
products
Life Events
Location announcements
Intent to buy a house I'm thinking about buying a home in Buckingham Estates per a recommendation. Anyone have advice on that area? #atx #austinrealestate #austin
Looks like we'll be moving to New Orleans sooner than I thought.
College: Off to Stanford for my MBA! Bbye chicago!
I'm at Starbucks Parque Tezontle http://4sq.com/fYReSj
I need a new digital camera for my food pictures, any recommendations around 300?
What should I buy?? A mini laptop with Windows 7 OR a Apple MacBook!??!
Timely Insights • Intent to buy various products • Current Location
© 2011 IBM Corporation 8
Social Data Analytics - Profile construction
© 2011 IBM Corporation 9
Social Data Analytics - Profile construction
© 2011 IBM Corporation 10
Big Data Platform and Accelerators - Summary
Software components that
accelerate development and/or
implementation of specific
solutions or use cases on top
of the Big Data platform
Provide business logic, data
processing, and
UI/visualization, tailored for a
given use case
Bundled with Big Data platform
components – InfoSphere
BigInsights and InfoSphere
Streams
Key Benefits
Time to value
Leverage best practices
around implementation of a
given use case. Cloud | Mobile | Security
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
Content
Analytics
Analytic Applications
IBM Big Data Platform
Systems
Management
Applications &
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
Hadoop
System
Stream
Computing
Data
Warehouse
Contextual
Search
© 2011 IBM Corporation 11
Social Media Analytics Architecture
Data Ingest
and Prep
Extract Buzz,
Intent ,
Sentiment
Entity
Analytics:
Profile
Resolution
Real time analytics.
Pre-defined views
and charts
Dashboard
Stream Computing and Analytics
BigInsights System and Analytics
Online flow: Data-in-motion analysis
Offline flow: Data-at-rest analysis
Pre-defined
Workbooks and
Dashboards
Social Media
Data
Extract Buzz,
Intent ,
Sentiment And
Consumer
Profiles
Entity
Analytics and
Integration
Comprehensive
Social Media
Customer
Profiles
Social Media
Optional: Indexed Search
Index using
Push API
Data Explorer
Ad hoc access
© 2011 IBM Corporation 12
SDA 1.2
Social Media Sources Supported – Gnip, Boardreader – Tweets, Boards, Blogs
Analyze Streaming data as well as data at rest
– Streams for processing of streaming data – BigInsights/Hadoop for input, output and configuration data
Key Micro-segmentation Attributes (out-of-box)
– Personal Info: Gender, Location, Parental status, Marital status, Employment – Interests: Movie interest, Comic book fan, Product interest, Current customer
of, Products owned – ** Attributes can be added in (requires some development effort)
Entity resolution across the different social media sources
© 2011 IBM Corporation 13
SDA 1.2
Outputs/Measures (out-of-box) – Buzz – Sentiment – Intent to buy/start service – Intend to attend/see
Example use cases – Retail – Lead generation, Brand management – Financial – Lead generation and Brand management – Media & Entertainment: Brand management – Generic
Visualization using BigSheets
Extendable/Customizable Solution
© 2011 IBM Corporation 14
SDA - Acting on the insights
Metrics based understanding of Feedback in Social Media – And more importantly Feedback from whom !
Comprehensive (social media) profiles with microsegmentation
information
Campaign execution can be done in Social Media
Entity resolution across the different social media sources
External (social media) to Internal (CRM) linkage **coming
© 2011 IBM Corporation 15
SDA Outputs
Pre-defined Workbooks
Dashboards
Granular outputs for further slicing and dicing by Data Scientists
© 2011 IBM Corporation 16
SDA Conceptual Flow
© 2011 IBM Corporation 17
BigInsights & Streams Text Analytics
High Performance rule based Information Extraction Engine
Highly scalable solution available for at-rest and in-motion analytics
Pre-built extractors, and toolkit to build custom Extractors
• Rich Extractor library supports multiple languages
• Declarative Information Extraction (IE) system based on an algebraic framework
Sophisticated tooling to help build, test, and refine rules
Developed at IBM Research since 2004
Embedded in several IBM products • BigInsights, Streams.
• Lotus Notes
• Cognos Consumer Insights
What is TA
How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 18
Applications of Text analytics
Broad range of applications in many industries • CRM Analytics
Voice of customer
Product and Services gap analysis
Customer churn
• Social Media Analytics Purchase intent
Customer churn prediction
Reputational Risk
• Digital Piracy Illegal broadcast of streaming and video content
• Log Analytics Failure analysis and root cause identification
Availability assurance
• Regulatory Compliance Data Redaction
• Identify and protect sensitive information What is TA
How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 19
Performance Comparison (with ANNIE open source **)
0
100
200
300
400
500
600
700
0 20 40 60 80 100
Average document size (KB)
Th
rou
gh
pu
t (K
B/s
ec)
Open Source Entity Tagger
SystemT
ANNIE
Task: Named Entity Recognition
Dataset : Different document collections from the Enron corpus obtained by randomly sampling 1000 documents for each
size
>10x faster
< 60% memory
** http://dl.acm.org/citation.cfm?id=1858681.1858695
Performance comparison with GATE 5
What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 20
Text Analytics Development Flow
Text Analytics Optimizer
Text Analytics Runtime
Compiled Operator
Graph
Rule based language Annotator Query Language - AQL
with familiar SQL-like syntax Specify annotator semantics
declaratively
Choose an efficient execution plan
Highly scalable, embeddable Java runtime
Sample Input Documents
Extracted Information
Development Tooling
Declarative language for extractor logic
Optimization and deployment to scalable runtime
Extractor
What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 21
JAQL Function Wrapper
SystemT
Runtime
Input
Adapter
Output
Adapter
{
label: “http://www.ibm ...”,
text: “<html>\n<head> …”
}
{
label: “http://www.ibm ...”,
text: “<html>\n<head> …”
Person:
[
{ firstName: [10, 15],
lastName: [16, 25] },
…
{ firstName: [1042, 1045],
lastName: [1046, 1050] }
],
Hyperlink:
[
{ anchorText: [25, 33] },
…
{ anchorText: [990, 997] }
],
H1: …
}
Input Record Output Record
Document
encoded as
JSON record.
Invoking Text Analytics within BigInsights
Jaql runtime coordinates a
multi-stage map-reduce flow.
AQL SystemT
Optimizer
Compiled
Plan Annotations added as
additional attributes to
JSON record. Dictionaries
© 2011 IBM Corporation 22
Additional Advantages of IBM Text Analytics
Quality: Drives effectiveness of entire application
• Enables high accuracy and coverage
Performance: Dominant cost is CPU
• Process large documents and large number of documents
with high throughput
Explain-ability
• Determine the cause of errors and fix it without affecting the
remaining correct results
Reusability: easily adaptable for a different domain
• The development platform must enable layers of abstractions to be built and easily reused
in a different domain
Expressivity
• Rule language with a rich set of operators available to enable complex extraction tasks
What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 23
BigInsights Text Analytics Development
What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 24
AQL editor with content assist What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 25
Click to drill down and see
the rules that triggered
inclusion of results
Explain and search
through the results
Understanding the lineage of results
What is TA How is TA Deployed & used
Dev. tools Why
Biginsights TA
© 2011 IBM Corporation 26
IBM Text Analytics for Big Data
High Performance Information Extraction Engine
Analysis can be applied to data at-rest and in-motion
• Build extractor once and use with BigInsights or Streams
Parallel execution scales to Big Data volumes
• Linearly scalable to extremely high volumes
Highly customizable to a variety of domains and languages
• Pre-built extractors available out of the box
Sophisticated tooling enables ease of development and refinement of results
© 2011 IBM Corporation 27
Thank you