michigan dgs 2015 presentation - leveraging big data with meaningful analytics - kirk ocke
DESCRIPTION
Michigan DGS 2015 Presentation - Leveraging Big Data With Meaningful Analytics by Kirk OckeTRANSCRIPT
Background
PARC | 2
• What is Big Data?– Characterized by: Volume, Velocity, Variety, …– Cutting edge techniques and technologies often required.– Buzz word or real phenomenon? Both!– Big Data requirements become apparent once you understand the problem you’re
solving.
• Just Plain Old Data:– Most uses of data do not currently require big data solutions.– Instead focus on what you want to use your data for? What are your business
objectives? What value are you trying to create?– There is a lot of value locked away inside your plain old data.
The Dream: Data that leads to Actionable Insights
PARC | 3
XIX F4A F4A 1 0 CPR 20 FEMIVA C4A C4A 4 0 CPR 17 FEMN4A F4A F4A 2 0 CPR 31 FEMXIX F4A F4A 3 0 CPR 20 FEMXIX F4A F4A 3 0 CPR 34 FEMN4A F4A F4A 2 0 CPR 29 FEMN4A ONA ONA 2 1 CPR 20 FEMIVA ONA ONA 1 0 CPR 19 FEMN4A F4A F4A 1 0 WNG 18 FEMIVA C4A C4A 5 0 CPR 18 FEMIVA C4A C4A 1 0 NCP 20 FEMN4A ONA ONA 2 0 CPR 21 FEMN4A ONA ONA 1 0 CPR 30 FEMN4A F4A F4A 1 0 CPR 16 FEMAAO F4A F4A 3 0 CPR 31 FEMN4A F4A F4A 1 0 CPR 16 FEMAAO F4A F4A 3 0 CPR 17 FEMN4A ONA ONA 1 0 CPR 36 FEMXIX F4A F4A 2 0 CPR 32 FEMXIX MNA MNA 2 0 CPR 15 FEMAAO F4A C4A 2 0 CPR 19 FEMN4A F4A F4A 2 0 CPR 18 FEMN4A F4A F4A 1 1 CPR 22 FEMAAO F4A F4A 1 1 CPR 16 FEMIVA C4A C4A 1 0 CPR 23 FEMXIX F4A F4A 1 1 CPR 20 FEMN4A F4A F4A 1 0 CPR 14 FEMXIX ONA ONA 3 0 CPR 20 FEMN4A F4A F4A 1 0 CPR 21 FEMIVA C4A C4A 1 0 WNG 21 FEM
Data AnalysisInsights that allow us to take the actions required to accomplish something extraordinary!
Small market teams effectively compete with large market teams.
BaseballPlayerData
Bill James
The Reality: Disparate Data, Vague Goals and Organizational Inertia
PARC | 4
LOGS
SensorsMainframes
VariousDatabases Paper
Restricted Access
Disparate DataWhat do we want to do again?
Get results of course!
Vague Goals
The Results speak for themselves
Everyone will love them.
Organizational Inertia
The Solution:Understand your data, goals and organization
PARC | 5
Understand your data:• Explore the data in order to gain a rich understanding of what’s there.• Transform the data• Never under estimate the effort required here.
Perform the right analysis:• What problem do you want to solve? • Focus on the value to be created.• The right team (yes you need a team) • Use the simplest model to get the job done.
Change the organization is necessary:• Using insights from data often requires people change the way they work:• Consider this when you set your goals.• Often change agents are needed at many levels of the organization.• Organizational structure and culture are important, and they are often completely
overlooked – or are an after thought – when data analysis projects are conceived.
PARC | 6
Ohio DJFS:Who pays their child support
IMS Database
CustomCode
SQL
Raw Data
Census Data
Understand the data:
Augment with public dataMissing Values?Etc…
1.33 TB
SQL
TransformedData
2 GB
6+ Months
Goals: • Describe who is and isn’t paying their child
support. • Develop predictive scoring models to rank
cases at case initiation.• Build models that are interpretable by case
workers.
Team:• Data Scientist and Researchers• Software Engineers, • Subject Matter Experts - including former
IV-D directors
Organization:• State Run - County Administered• Case workers are knowledge workers• Case workers touch real lives.
PARC | 7
Ohio DJFS:Who pays their child support
Percent of Cases
30% pay ≥ 80% ofestablished obligation
52%pay < 80% ofestablished obligation
10%Zeroorders
8%No
Order
Pay 80% or more
Pay Less than 80%
Average Age Oldest Child at Case Initiation 4.7 3.9
Average Age Youngest Child at Case Initiation 3.8 3.3
Average Number Children CP Born Out of Wedlock 1.0 1.6
Average Number Children AP Born Out of Wedlock 1.0 2.0
Percent where CP and AP Formerly Married 46% 19%
Pay 80% or more
Pay Less than 80%
Average Percent of Total Obligation Paid over life of case 95% 38%
PARC | 8
Ohio DJFS:Scoring Model for Cuyahoga County
Largest county in Ohio by population in 2010Surrounds Cleveland
62002 CasesInitiated btw
FY 05 – FY 13
57 featuresData known at case initiation
27 categorical30 numerical
Link to Dashboard
53
53 is the right hand side of this equation: if k � 𝑋𝑋𝑋𝛽𝛽𝑋 ≥ k 𝑙𝑙𝑙𝑙𝑙𝑙 𝑡𝑡
1−𝑡𝑡− 𝛽𝛽0
then �𝑌𝑌 = 1
Taking the suggestion from the last presentation to consider county level analysis we built our models only at the county level.
Ohio DJFS:Cuyahoga County Model Performance
PARC | 9
Prediction Model
Pool of All Cases (N)
15% green 85% red
High-Value Cases (Top 20% of N)
63% Recall
47% Precision(205% increase in predictive
power over random guessing)
= case with an AP that paid ≥ 80% of child support obligation over life of case
= case with an AP that paid < 80% of child support obligation over life of case
Key MessageThe top fifth of the caseload, according to the model,
contains almost 2/3 of the good paying cases.
“You could theoretically touch almost 2/3 of the good paying cases by only working 20% of the
caseload”
PARC | 10
County of Los Angeles, CA:Who’s doing all the printing?The largest county in the U.S. – County of Los Angeles, CA.
Employees in 33 departments were spending millions of dollars printing. No oversight. No data. No acquisition strategy.
Through assessments, the CIO’s office discovered 43,000 printers and copiers.
PARC | 11
County of Los Angeles, CAManaging Printing Across 33 Departments
Department inputs
“The fact that we have numbers has been very effective. It’s rare there are quantitative results for these kinds of projects.”
Rich Sanchez, CIO County of LA, CA
• $9M annual savings• $50M+ total savings • 56% printer decrease -
44,000 to 18,500 • Less administrative
effort & IT support • Cut electrical
consumption 58% = 700 homes annually