ibm swg day 2012 jhb big data (white)
DESCRIPTION
Big Data Presentation from IBM Software DayTRANSCRIPT
© 2011 IBM Corporation IBM Confidential
Big DataSimon Jeggo24 May 2012
© 2011 IBM Corporation IBM Confidential
What is Big Data
Some Big Data Use Cases
IBM’s Big Data Platform
Agenda
© 2011 IBM Corporation IBM Confidential
What is
Big Data
© 2011 IBM Corporation IBM Confidential4
The Big Data Challenge – a Term defined “Big Data is a term applied to data sets that are large, complex and dynamic (or a combination thereof) and for which there is a
requirement to capture, manage and process the data set in its entirety, such that it is not possible to process the data using traditional software tools and analytic techniques within tolerable time frames.”
New technologies that bring cost effective approaches to explore, understand and predict better business outcomes MPP databases Streams In-database analytics Apache Hadoop Cloud computing platforms Archival storage systems
Why something different? Data x Computation > typical warehouse Schema Flexibility Programming Flexibility
We are engaged in over 50 clients, working with them to apply big data techniques to a class of problems -- e.g., text analytics, log analysis, customer insights, fraud detection etc.
We have a set of unique value-adds – JAQL, GPFS, System-T and others coming… And we can make BigData for our clients sit in their complex IT environment
Integrate Secure
Automate
Integrate Secure
Automate
© 2011 IBM Corporation IBM Confidential
…b
y th
e en
d o
f 20
11, t
his
was
ab
ou
t 30
bill
ion
an
d g
row
ing
eve
n f
aste
r
In 2
005
ther
e w
ere
1.3
bil
lio
n R
FID
tag
s in
cir
cula
tio
n…
© 2011 IBM Corporation IBM Confidential
An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of
data with MACHINE SPEED characteristics…
1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!
© 2011 IBM Corporation IBM Confidential
350B Transactions/Year
Meter Reads every 15 min.
3.65B – meter reads/day120M – meter reads/month
© 2011 IBM Corporation IBM Confidential
In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.”
Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken
By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work
© 2011 IBM Corporation IBM Confidential
The Social Layer in a Instrumented Interconnected World
12+ TBs of tweet data
every day
25+ TBs oflog data every
day
? T
Bs
of
dat
a ev
ery
da
y
2+ billion
people on the Web
by end 2011
30 billion RFID tags today
(1.3B in 2005)
4.6 billion camera phones
world wide
100s of millions of GPS
enabled devices
sold annually
76 million smart meters in 2009… 200M by 2014
© 2011 IBM Corporation IBM Confidential
Twitter Tweets per Second Record Breakers of 2011
Social-media analytics can be used from healthcare to predicting votes
Challenges– Volume– Velocity– Variety– Language Processing: consider that
Twitter sentences are not well formed and often use urban talk
© 2011 IBM Corporation IBM Confidential
Extract Intent, Life Events, Micro Segmentation Attributes
Jo Jobs
Tina Mu
Tom Sit
Chloe
Name, Birthday, Family
Not Relevant - Noise
Not Relevant - Noise
Monetizable Intent
Monetizable IntentRelocation
Location Wishful Thinking
SPAMbots
© 2011 IBM Corporation IBM Confidential
Watson’s advanced analytic capabilities can sort through the equivalent of 200 MILLION pages of data to uncover an answer in 3 SECONDS.
© 2011 IBM Corporation IBM Confidential
4Trillion 8GB
iPods
1.8 ZB
1 ZB1 ZB=1T GB
© 2011 IBM Corporation IBM Confidential
Big Data Use Cases
Cisco turns to IBM big data for intelligent
infrastructure management
• Optimize building energy consumption with centralized monitoring
• Automate preventive and corrective maintenance
Capabilities Utilized:• Streaming Analytics• Hadoop System• Business Intelligence
Applications:• Log Analytics• Energy Bill Forecasting• Energy consumption optimization• Detection of anomalous usage• Presence-aware energy mgt.• Policy enforcement
© 2011 IBM Corporation IBM Confidential
Applications for Big Data Analytics
Homeland Security
Finance Smarter Healthcare Multi-channel sales
Telecom
Manufacturing
Traffic Control
Trading Analytics Fraud and Risk
Log Analysis
Search Quality
Retail: Churn, NBO
© 2011 IBM Corporation IBM Confidential
Retail Industry
Issues for the Retail Industry Deliver value to empowered customers
Move from market analysis to understanding individuals
Take charge of growing volume, velocity and variety of data
Foster lasting connections
Focus on relationships, not just transactions
Invest in expanding the corporate brand
Capture value, measure results
Developing complete understanding of the point of sale
Build new skills and solutions
© 2011 IBM Corporation IBM Confidential
Structured/Unstructured data
What is our next best offer?
Use Case: Social Media Analytics
As consumers continue to adopt social media technologies, businesses must be able to track customer sentiment and brand perception, finding new opportunities and avoiding business problems from negative perceptions
Solution
Problem
Social Media Analytics What consumers and the industry are saying
Optimizing Internal Operations Better utilization of tools for web analytics
Decreased latency for analysis
Predictive Analytics Promotion targeting for offers
Prospect harvesting
POS analytics, predictive and discovery
Competitive Intelligence
Unlock information across the web
© 2011 IBM Corporation IBM Confidential
Warehouse Off-load Use Case: Transactional Analytics
Retailers have massive amounts of transaction data that offers a wealth of information about customer purchasing behavior in stores
This data isn't being used effectively because of its volume, the cost to store it, and the barriers to analyzing massive data
Solution
Store POS transactions in BigInsights, reducing the cost from traditional data warehousing
BigInsights enables ad-hoc query for historical reporting, trend analysis, and analyst needs
Data mining feeds for store and customer segmentation, market basket analysis, promotion targeting and other analyticsbased solutions
Historical POS made available for analysis of new product introductions, new store openings, and other disruptive business events
Problem
© 2011 IBM Corporation IBM Confidential
FSS - Customer Correspondence Analytics
Current approaches limit insight and predictive analytics to structured data, limiting insight and losing the “state” of the customer
Human-based review of correspondence is limited to small scale sampling
Results of sampling are too dependant on the skills of reviewer and cannot learn from information sets outside of that human reviewers knowledge
Detecting and acting on rapidly changing customer sentiment and understanding why a service touch is occurring from the customer POV
The need to take cost out of service touch points while improving effectiveness/intamacy
Solution Use of un-instrumented or under-instrumented information source to identify and head-off issues
• Extends risk modeling to underutilized sources such as email, chat, social media, call center, and CSR interactions and notes Move from small scale sampling to 100% coverage using BigInsights and cross correlation of information sources
– Natural language analytics combined with machine learning to identify opportunities and issues that are not apparent in small sample sizes and human awareness.
Use of natural language sophisticated analytics to allow develop a predictive understanding customer actions based on customer state
– Topic and sentiment extraction from email, chat, social media, call center, and CSR interactions and notes to predict call reasons and next best action
Problem
© 2011 IBM Corporation IBM Confidential
FSS - Risk Platforms and Analytics
Real-time analytics and need to meet SLA windows are outstripping existing infrastructure capabilities
Burst-oriented trading close volumes and resulting position analytics are expanding faster than traditional technologies can cost effectively meet
Standard policies of flushing the data after hours or days is not meeting risk modeling needs
Web, unstructured and machine generated data does not fit existing relational analytics tools
SQL is not the natural tool to manipulate untapped information sources that can improve the dimension of risk modeling
The changing nature of risk requires flexibility in sizing, speed and methods that are not easy to respond to with existing SQL based platforms
Solution
Predict, identify and triage risk anomalies in real-time– Use of SystemT and SystemML analytics engines to identify problems based on historical data and then push those
models to Streams Use of BigInsights to ingest and analyze hundreds of TB an hour to meet SLA requirements for high
volume and complex trading operations Use of un-instrumented or under-instrumented information source to identify and head-off issues
• Extends risk modeling to underutilized sources such as email, chat, social media, call center, and CSR interactions and notes
Problem
© 2011 IBM Corporation IBM Confidential
FSS - Social Media Analytics
Important source of information, but requires new approaches to collecting, storing, understanding and utilizing the value to be found.
Fuzzy and messy data are the norm
Little if any of the information is easily structured
Reconciling external and internal sources
Identifying individuals among the fog of external data is not easily done but is often necessary
Linking to known individuals requires Entity analytics concepts and capabilities
Solution
Ability to acquire, parse, analyze, link and persist external information sources to a variety of analytics platforms
– Use of SystemT and SystemML analytics engines to digest and make sense of external sources
Sophisticated text/language analytics to allow powerful and accurate understanding of the external sources
– Entity resolution capabilities to match external sources to known customers and groups– Graphical interfaces to quickly explore data sets, test hypothesis, create production jobs and synthesize data sources
from multiple disparate internal and external sources– Ability to push normalized data to Netezza for analytics with existing methods and tools
Problem
© 2011 IBM Corporation IBM Confidential
Explosion of data in Telecom
From 500PB per month 2011
To 5,000PB per month 2016
© 2011 IBM Corporation IBM Confidential
Explosion of Data for Telecom
5 Billon Mobile Phones WW – 550K Android phone activated every day
AT&T Global Network carries 24 Petabytes of data PER DAY
> 2 Billion Internet users 2011> 2 Billion Internet users 2011
Twitter process 7 terabytes of data every dayFacebook processes 10 terabytes of data every day
YouTube – Massive bits through Networks48 Hours of Web of Video uploaded per min3 Billion views per day
Skype 300 Million Min of Video Calls Per MonthSkype 300 Million Min of Video Calls Per Month
Traffic
Revenues
VoiceDominant
Data Dominant
Network Cost
How to lower network costs ($/GB)?
How to improve data revenue ($/GB)?
Profitability Gap (value/GB)
Time
Traffic Volume
$/bit
Telecoms need to be smarter….. smarter networks and smarter business models
All Telecom Enterprises have BIG DATA CHALLENGES
© 2011 IBM Corporation IBM Confidential
Churn Prediction and Targeted Offers with Social Media Text Analytics
Lost revenue and increase customer acquisition cost is directly related to churn Churn not only lost customers due to pricing, but to service level, new tech offerings, service offerings, and
customer perception
Significant challenge increasing ARPU Revenue per customer is much harder to increase as competition increases
Current churn prediction systems are not up to the challenge
Too slow and not using social media data
Solution
Improve churn prediction using social media– Analyze social media on its own or with current warehouse/BI analytics to predict churn quicker (real-
time) and more accurately– BigInsights Text Analytics is the key to finding new analytics and Streams for RT alerts
Discover ARPU opportunities directly from social media– New source of customer intent and sentiment will drive new revenue opportunities– Real time feedback to marketing systems or warehouse/BI to place offers quickly– Finding ready-to-buy customers
Problem
© 2011 IBM Corporation IBM Confidential
Real Time CDR Analytics and Ingest
Gathering CDR’s, mediating them into relevant data, and moving them to analytical systems is slow and costly By the time CDR data is mediated and ingested by data warehouses, the ability to respond to problems is significantly
reduced.
Systems tend to be old and require extensive application maintenance and hardware
Cannot achieve real time billing, requires handling billions of CDRs per day, and de-duplication against 15 days worth of CDR data
Solution
Big Data Streams Telecommunications Mediation and Analytics (TMA) offering – Real-time CDR processing– Real-time analytics and dashboard– Unparalleled price/performance benefits– Connectors to Warehouse and BigInsights
Real-Time dashboards include:– Dropped calls by high priority customers, location, providers, etc– Terminated calls by location and customer type– Revenue monitoring by voice and SMS
The solution will enable novel Business Intelligence applications
Problem
© 2011 IBM Corporation IBM Confidential
CDR Analytics with Extended Data
Telecom is experiencing an explosion of data from 3G and LTE (4G) network traffic. CDR’s are almost only used for billing systems because storing and analyzing them was too expensive with EDW and BI alone.
Competition driving the need for focus on: customer retention
customer profitability
No connection between CDR, Web, and other data making everything from fraud detection to targeted marketing to ad optimization difficult and expensive
Solution
Problem
BigInsights for cost effective store of original data and large-scale text analytics– Stores data unstructured and non-typed ingested with no data model– Discovery and Analytics tools are built into BigInsights – Machine Learning extensions– Integration to Netezza and DB2. JDBC to other data bases
Big Data Streams Telecommunications Mediation and Analytics (TMA) offering – Real-time CDR processing can be extended to other data sources – fast and low cost
Netezza integration opens Big Data solutions to warehouse and BI – Deep analytics and model development– Can act as a high performance operational data store
© 2011 IBM Corporation IBM Confidential
Ad Effectiveness Analysis with Social Media
Telecom and Media spend large sums of money on advertising. Measuring the effectiveness of the Ads difficult and almost impossible online without costly services Service providers are slow with responses and expensive
Current ad analysis is mostly guesswork and intuition – not lending itself to timely decisions
Enterprises are demanding better ROI from ad budgets and proof of effectiveness of each ad campaign To increase effectiveness, enterprises have to react in near-real-time
Solution
Problem
BigInsights used for social media ingest and fast analysis– Answers questions like what was the awareness, who did we reach, and what was the reaction to an
ad in a few hours vs weeks– Offers ad departments to react: modify, localize, and focus
Streams for real-time ad analysis extending predictive models for fast reaction React very quickly to ad effectiveness
1. Adjust ad budgets2. Tailor ad’s to geography 3. Alter messaging4. Adjust targeted and direct marketing initiatives
© 2011 IBM Corporation IBM Confidential
Why IBM for Big Data
The Solution Side
© 2011 IBM Corporation IBM Confidential
The IBM Big Data Platform
InfoSphere BigInsights Hadoop-based low latency
analytics for variety and volume
IBM Netezza High Capacity Appliance
Queryable Archive Structured Data
IBM Netezza 1000BI+Ad Hoc
Analytics on Structured Data
IBM Smart Analytics System
Operational Analytics on Structured Data
IBM InfoSphere Warehouse
Large volume structured data analytics
InfoSphere StreamsLow Latency Analytics for
streaming data
MPP Data Warehouse
Stream ComputingInformation Integration
Hadoop
InfoSphere Information ServerHigh volume data integration and
transformation
© 2011 IBM Corporation IBM Confidential
A Big Data Platform
Embrace and ExtendAnalytics Excellence Text Analytics ToolkitMachine Learning ToolkitIndustry Accelerators Development Tooling Visualization ToolingDeployment Tooling (“App Store”)$14B in 5 yrs. on Analytics +++
At-Rest Operational Excellence
Harden Hadoop - GPFSSurface Area Lock Down
Policy Driven Retention & ImmutabilityRole-Based Security
Adaptive MapReduceWorkload Manager
Fast Splittable CMX Compression
REST-exposed Administration +++
In-MotionAnalyze extreme amounts of
data in milliseconds
Uses same analytics as BigInsights
Data can be analyzed on the way into the enterprise for earlier pattern
detection
At-RestBeyond traditional
structured data BigInsights uses same analytics as Streams
No forked, not ported: Hadoop Extended with operational excellence and security
Netezza for in-database MapReduce
MPP Data Warehouses
Open Source HadoopIBM Big Data
Platform
In-Motion Operational Excellence
Unrivalled….
© 2011 IBM Corporation IBM Confidential
Continuous Ingestion Continuous Queries /Analytics on data in motion
Stream Computing: A new paradigm for ultra low latency and high throughput in-motion analytics
© 2011 IBM Corporation IBM Confidential
Data In Motion
Information used to be aggregated and analyzed every 30-60 minutes and discarded after 72 hours
Analyzing 1000 pieces of unique medical diagnostic information per/sec. and stored in a dynamic model
Perspective: 20% drop in mortality of control group in trials (extend approach to daily activities)
- 120 children monitored:120K messages/sec…billions/day
© 2011 IBM Corporation IBM Confidential
Data In Motion
Hear what’s going on miles away to optimize perimeter displacements
Perspective: Try to find the word “Zero” in a 1000 MP3 song library in a fraction of a second
– Figure out the difference between the sound of a human whisper and the wind
© 2011 IBM Corporation IBM Confidential
Data In Motion – Improving What They Already Have
Old Microsoft-based solution not able to keep up with new 3G demands for their real-time xDR analysis business requirements
Streams and Netezza solution proposed– Time to merge and load data reduced 90%+– Time to market for new products from 4 hours to minutes
Internal Use Only Reference
© 2011 IBM Corporation IBM Confidential
How Text Analytics Works
Football World Cup 2010, one team distinguished themselves well, losing to the eventual champions 1-0 in the Final. Early in the second half,
Netherlands’ striker, Arjen Robben, had a breakaway, but the keeper for Spain, Iker Casilas made the save. Winger Andres Iniesta scored for Spain for the win.
NetherlandsStrikerArjen Robben
Keeper SpainIker Casilas
WingerAndres Iniesta Spain
World Cup 2010 Highlights
© 2011 IBM Corporation IBM Confidential
IBM Text Analytics Toolkit Lets You…
Build out world-class text analysis applications 50% faster than manual method
Run faster text analysis (10x or more vs. some marketplace alternatives)
Get more precise and correct answers (2x vs. some marketplace alternatives)
© 2011 IBM Corporation IBM Confidential
Browser-based Big Data analytics tool for business users Big Data Challenges…
Business users need a no programming approach for analyzing Big Data
Extremely difficult to find actionable business insights in data from multiple sources with different formats
Translating untapped data into actionable business insights is a common requirement that requires visualization
What is BigSheets?
How can BigSheets help? Spreadsheet-like discovery interface lets
business users easily analyze Big Data
with ZERO PROGRAMMING
BUILT-IN “readers” can work with data in several common formats
– JSON arrays, CSV, TSV, Web crawler output, . . .
Users can VISUALLY combine and explore various types of data to identify “hidden” insights
© 2011 IBM Corporation IBM Confidential
© 2011 IBM Corporation IBM Confidential
Big Data Made Easy for the Little Guy
USC’s Film Forecaster correctly predicted a clamor for "Hangover 2” that resulted in $100 million opening over Memorial Day weekend– Looked at 250K-500K Tweets and broke down
positive and negative messages using a lexiconof 1700 words
The Film Forecaster sounds like a big undertaking for USC, but it really came down to one communications masters student who learned Big Sheets in a day, then pulled in the tweets and analyzed them - Ryan Kim
© 2011 IBM Corporation IBM Confidential
Why IBM for Big Data?
Only IBM is showing data-in-motion and data-at-rest analytics: a bigger more opportunistic view of Big Data
Development and research sit side by side
Virtualization tooling, development, file system, analytics
Not just same company: same org, same people, same leadership
BigInsights being used in IBM products today such as Cognos Consumer Insight
© 2011 IBM Corporation IBM Confidential
Without a Big Data PlatformYou Code…
IBM Big Data Platform
Streams provides development, deployment, runtime, and infrastructure services
“TerraEchos developers can deliver applications 45% faster due to the agility of Streams Processing Language…”– Alex Philip, CEO and President
Multithreading
Custom SQLand
Scripts
PerformanceOptimization
Debug
ApplicationManagement
EventHandling
Connectors
CheckPointing
Security
HAAccelerators
and
Toolkits
Over 100 sample applications and toolkits with industry focused toolkits
with 300+ functions and operators!
© 2011 IBM Corporation IBM Confidential42
THINK
https://w3-connections.ibm.com/wikis/home?lang=en_US#/wiki/Info%20Mgmt%20Client%20Technical%20Professional%20Resources%20Wiki/page/Understanding%20Big%20Data