michigan dgs 2015 presentation - leveraging big data with meaningful analytics - paul groll
DESCRIPTION
Michigan DGS 2015 Presentation - Leveraging Big Data With Meaningful Analytics by Paul GrollTRANSCRIPT
Big Data and Information Complexity
Paul Groll - MS, CISSP, CISSO, CSM
Enterprise Information Architect
State of Michigan
Michigan Digital Government Summit
29 September 2015
Agenda
Objectives
Challenges
Complexity in IT
Complexity is emerging as Security Risk
Where to find it, how to tame it
Strange New World, Strange New Words
Challenges
Data scientists to CEOs:
“YOU CAN’T HANDLE THE TRUTH”
- VentureBeat.com, 01 Aug 2015
. Growing fear to let the data speak for itself
. Growing distrust of primary sources, when the
results fail to match the CXO's "intuition"
dN dt
rN( K – N ) = K
Logistic Growth (Verhulst, Lotka/Volterra, Kolmogorov)
where K = the “carrying capacity” of the environment, r = the intrinsic rate of growth
Diversity and Complexity
• As the ecosystem grows:
–What does the emerging community look like?
–What “information” does this community contain?
–How diverse, complex is it?
–Is this something we can model?
Complexity
• Claude Shannon
– From Petoskey, Michigan (yay)
– “Father of Information Theory”
–Office down the hall from Einstein
– Looked at the “information” in a system
–Developed a model of Information Diversity that works in scores of fields
Complexity
• Shannon’s Diversity Index – Let’s assume we have a community with six “types”:
• A 6 individuals
• B 13 individuals
• C 19 “
• D 129 “
• E 372 “
• F 1187 “
How “Diverse” is this system? How much “information” does it contain?
Complexity (i)
6 13 19 129 372 1187
p(i) 0.003, 0.008, 0.011, 0.075, 0.216, 0.688 log(p(i)) -2.46 -2.12 -1.96 -1.13 -0.67 -0.16 p(i)[log(p(i))] -0.009 -0.015 -0.02 -0.08 -0.14 -0.11
Σ all of these * -1 = H' = 0.39 (N = 1726)
Complexity of the “DATA” system
We have “populations”: Data Sources – Agencies, etc.
Data Formats - Hundreds
Data Fields - Thousands
Database Vendors - Many
Database Versions - Dozens
Data Models – Scores
Data Volume - Petabytes
Permissions – Privacy Limits, Data that need Special Handling
Confidentiality – Varied Encryption Requirements
Complexity of the “DATA” system
Each of these has some limit, K, as size grows: STAFF!! - Security!! DBAs, Developers, other special skills
“Hard Problem” conversion, mapping (ETL)
Server Capacity, Elasticity
Licensing restrictions
Raw storage
Limited Resources – FLASH Storage, High-Speed Computing
Planned and managed storage, backup, recovery
Specific security demands and practices
An ever-growing number of “one-offs”, “exceptions”
dN dt
rN( K – N ) =
K
N at time (t), given Limit K Complexity of the “DATA” system
“r” Phase
“K” Phase
Source: memrise.com
Complexity models
Haphazard, “Unmanaged” (real world)
CANNOT predict the overall state from one time (t) to the next Expensive! Highest TCO we can have
Complexity models
Step-wise Consolidation
This is what we’re after! Reducing Complexity Reduces Costs!
Source: Journal of Integer Sequences, Vol. 1 (1998), Article 98.1.5
The N-Squared Complexity Problem
- A mere 8 objects will require 28 separate “interfaces” to fully realize ubiquitous communications
- 100? 4,950 interfaces!
Velocity - It’s coming fast
Volume - There’s a lot
Variety - It’s not all the same type, format, or size
Veracity - How much can we trust the data?
The FOUR V’s
. Audit Logs – Record Formats Vary - Who has looked at Record #38,491? - When? - Why? - Did that violate any laws or policies? Which? - Repeat 1,000,000,000 times…
Use Case Factory: - Variability -
. Health & Business Information Messaging
10k per message (on average…) 20 million messages per month Must retain for 7 years 10k * 20mil * 84mos * 250 systems =
~ 4 petabytes
Use Case Factory: - Volume -
. Vibration and Alerting Sensors Up to hundreds per structure Up to 15,000 structures Sending in real-time 5k messages 400 * 15,000 * 5k = ~ 30 Gb stream
Use Case Factory: - Velocity -
. State Police video 3 8-hour shifts 2 HD cameras per unit (body, car) 10 Gb/hour per camera* ~ 155 Tb / day = Backhaul Challenge ~ 55 Pb / year = Storage Challenge * Depending on streaming bitrate, resolution – could be 15-25 GB/hour
Use Case Factory: - Combines 3 V’s -