michigan dgs 2015 presentation - leveraging big data with meaningful analytics - paul groll

39
Big Data and Information Complexity Paul Groll - MS, CISSP, CISSO, CSM Enterprise Information Architect State of Michigan Michigan Digital Government Summit 29 September 2015

Upload: erepublic

Post on 08-Dec-2015

8 views

Category:

Documents


1 download

DESCRIPTION

Michigan DGS 2015 Presentation - Leveraging Big Data With Meaningful Analytics by Paul Groll

TRANSCRIPT

Big Data and Information Complexity

Paul Groll - MS, CISSP, CISSO, CSM

Enterprise Information Architect

State of Michigan

Michigan Digital Government Summit

29 September 2015

Agenda

Objectives

Challenges

Complexity in IT

Complexity is emerging as Security Risk

Where to find it, how to tame it

Strange New World, Strange New Words

Objectives

Challenges

Challenges

Data scientists to CEOs:

“YOU CAN’T HANDLE THE TRUTH”

- VentureBeat.com, 01 Aug 2015

. Growing fear to let the data speak for itself

. Growing distrust of primary sources, when the

results fail to match the CXO's "intuition"

Source: about.com

Source: http://

Background

Moku O Loe

Source: hawaii.edu

Source: about.com

Source: reefbuilders.com

Complexity

• Questions of interest:

–What factors control growth?

–Is this something we can model?

Complexity

dN dt

rN( K – N ) = K

Logistic Growth (Verhulst, Lotka/Volterra, Kolmogorov)

where K = the “carrying capacity” of the environment, r = the intrinsic rate of growth

Diversity and Complexity

• As the ecosystem grows:

–What does the emerging community look like?

–What “information” does this community contain?

–How diverse, complex is it?

–Is this something we can model?

Complexity

• Claude Shannon

– From Petoskey, Michigan (yay)

– “Father of Information Theory”

–Office down the hall from Einstein

– Looked at the “information” in a system

–Developed a model of Information Diversity that works in scores of fields

Complexity

• Shannon’s Diversity Index – Let’s assume we have a community with six “types”:

• A 6 individuals

• B 13 individuals

• C 19 “

• D 129 “

• E 372 “

• F 1187 “

How “Diverse” is this system? How much “information” does it contain?

Source: Shannon & Weaver, 1948

Complexity

Complexity (i)

6 13 19 129 372 1187

p(i) 0.003, 0.008, 0.011, 0.075, 0.216, 0.688 log(p(i)) -2.46 -2.12 -1.96 -1.13 -0.67 -0.16 p(i)[log(p(i))] -0.009 -0.015 -0.02 -0.08 -0.14 -0.11

Σ all of these * -1 = H' = 0.39 (N = 1726)

Complexity of the “DATA” system

We have “populations”: Data Sources – Agencies, etc.

Data Formats - Hundreds

Data Fields - Thousands

Database Vendors - Many

Database Versions - Dozens

Data Models – Scores

Data Volume - Petabytes

Permissions – Privacy Limits, Data that need Special Handling

Confidentiality – Varied Encryption Requirements

Complexity of the “DATA” system

Each of these has some limit, K, as size grows: STAFF!! - Security!! DBAs, Developers, other special skills

“Hard Problem” conversion, mapping (ETL)

Server Capacity, Elasticity

Licensing restrictions

Raw storage

Limited Resources – FLASH Storage, High-Speed Computing

Planned and managed storage, backup, recovery

Specific security demands and practices

An ever-growing number of “one-offs”, “exceptions”

dN dt

rN( K – N ) =

K

N at time (t), given Limit K Complexity of the “DATA” system

“r” Phase

“K” Phase

Source: memrise.com

Complexity models

Monoculture (the trivial case): 100% A

Complexity models

Haphazard, “Unmanaged” (real world)

CANNOT predict the overall state from one time (t) to the next Expensive! Highest TCO we can have

Complexity models

Step-wise Consolidation

This is what we’re after! Reducing Complexity Reduces Costs!

Source: Journal of Integer Sequences, Vol. 1 (1998), Article 98.1.5

The N-Squared Complexity Problem

- A mere 8 objects will require 28 separate “interfaces” to fully realize ubiquitous communications

- 100? 4,950 interfaces!

C(N) = 28,680

N = 240

The N-Squared Complexity Problem

Velocity - It’s coming fast

Volume - There’s a lot

Variety - It’s not all the same type, format, or size

Veracity - How much can we trust the data?

The FOUR V’s

. Audit Logs – Record Formats Vary - Who has looked at Record #38,491? - When? - Why? - Did that violate any laws or policies? Which? - Repeat 1,000,000,000 times…

Use Case Factory: - Variability -

. Health & Business Information Messaging

10k per message (on average…) 20 million messages per month Must retain for 7 years 10k * 20mil * 84mos * 250 systems =

~ 4 petabytes

Use Case Factory: - Volume -

. Vibration and Alerting Sensors Up to hundreds per structure Up to 15,000 structures Sending in real-time 5k messages 400 * 15,000 * 5k = ~ 30 Gb stream

Use Case Factory: - Velocity -

. State Police video 3 8-hour shifts 2 HD cameras per unit (body, car) 10 Gb/hour per camera* ~ 155 Tb / day = Backhaul Challenge ~ 55 Pb / year = Storage Challenge * Depending on streaming bitrate, resolution – could be 15-25 GB/hour

Use Case Factory: - Combines 3 V’s -

Thanks for listening Keep in touch

Questions Welcome

PAUL GROLL

GROLLP @ Michigan.gov 517.373.9578