the practice of data science - ibm€¦ · − did the booking date stamp occur on a weekend? −...
TRANSCRIPT
© 2016 IBM Corporation2
Look At All The DataLook At All The Data
Let Data Lead the WayLet Data Lead the Way Leverage Data as it is CapturedLeverage Data as it is Captured
Changing the Way We Do Analytics
© 2016 IBM Corporation3
Basic Process
Ingest
data
Transform
: clean
Create
and build
model
Evaluate
Deliver
and deploy
model
Communicate
results
Understand
problem and
domain
Explore and
understand
data
Transform:
shape
OUTPUT
ANALYSIS
INPUT
© 2016 IBM Corporation4
Work Task Example
Given a person is Arrested
Who Gets Released on Bond? and
How Fast?
© 2016 IBM Corporation5
Understand the Domain
�Analytics requires an understanding of the data & the judicial process
�Need to learn how a Judge decides whether or not to allow bond
−SME’s indicate Judicial bond decisions are
based on
• “Threat to community” = Qualitative assessment (Current Charges + Past Charges +Time Line)
•Ties to Community
© 2016 IBM Corporation6
The Hunt for Data -- The Rap Sheet
�Charges
− Criminal code - thousands of numbers
− Time/date of arrest
− Sentenced or not (sometimes)/Released or not
�Personal Information
− Dirty & incomplete
�Arresting Organization
© 2016 IBM Corporation7
�Acquiring Rap Sheet Data
−Access required all sorts of agreements
−Different jurisdictions, different content and form
−Task requirement: Meta data mapping and integration
•Consistent Crime codes (NCIC)
© 2016 IBM Corporation8
Explore and Understand the Data
�Analyze variables
� Values, max, min, number of variables, coverage or % missing data, distribution shapes, etc.
� Outliers
� Anomalous values
, Number of days from arrest to adjudication
© 2016 IBM Corporation9
Data Transformation
�To clean or not to clean
−Strategic decision
−Decision criteria
�Identity normalization
−Alias challenge
−Alias challenge as it relates to Data Science
•Model Creation
•Model Score
© 2016 IBM Corporation10
Data Transformation: Enriching the data by adding context to data
Context: The cumulative history derived from data observations about entities
� Example – Safety of firefighters
� Current environmental temperature
Or
� Current environmental temperature and temperature history for that person
Or
� Current environmental temperature, temperature history for that person, and how long it will take to exit the building
© 2016 IBM Corporation11
Data Transformation: Threat
NCIC Charge Code NCIC Charge Category NCIC Charge
101 Sovereignty Treason
105 Sovereignty Sedition
�There a several thousand codes
− Which codes are considered “threat”
− How do codes compare in “threat”
− How do you combine codes
− How to figure in the temporal aspect of crime
© 2016 IBM Corporation12
Scoring: Threat to Community
�Two main components: scoring of individual charges and crime history
�Scoring of each charge derived from two parameters− A loss-of-memory parameter, which determines how fast the severity
of the charge declines over time (this parameter might be zero)− A lack-of-forgiveness parameter, which will determine what
proportion of the original severity level remains forever� Scoring of crime history
− Scores of each charge/conviction are accumulated (the model determines how)
For each crime,
look up the scoring parameters, and the time
the crime was committed, and evaluate the
individual crime scores
Submit all of
those scores into the
cumulative history scoring
function
Threat to the
Community of the
offender
12
© 2016 IBM Corporation13
Data Transformation
� Target Variable: Time to Release− The time-to-release variable is obtained by subtracting the booking time stamp
from the release time stamp.
� Counting− Total number of a type of crime
− Total number of a specific threat to community grouping
� Distance variables− Compare ZIP codes of booking location and arrestee’s home to determine if
arrestee is “local” to booking locality
� Date stamp variables− Did the booking date stamp occur on a weekend?
− Did the booking date stamp occur on a holiday, during holiday, just before a holiday?
� Time of Day− Early in shift? Late in shift?
− Net – about 1600 variables created
© 2016 IBM Corporation14
Modeling Process is Iterative
Predictive Modeling Algorithm: Train Model
Evaluate and Tweak Model
Score and Assess Model
Divide Data Set into 3 Segments
© 2016 IBM Corporation15
Picking a Model
�Target variable characteristics (binary, continuous, etc.) typically dictate model selection
�Model selection
− Assessment via Accuracy and Error
− Different models can select different variables as predictors
© 2016 IBM Corporation16
05/1 1
Predictive Modeling Environment
© 2016 IBM Corporation17
Model – Decision Tree
© 2016 IBM Corporation18
Scoring - How good is the Model?
�Mission dictates model accuracy requirements
�Lots of different measurements of goodness
− Model Confidence
− Two types of error
• Number of people who were predicted to be released AND were not
• Number of people who were not to predicted to be released AND were
− Number of different other scoring mechanisms
© 2016 IBM Corporation19
Disappointment
�Horrible accuracy and error
�Re-Think assumptions
�Aha moment
© 2016 IBM Corporation20
Deployment
�Models (or rules) get deployed to the mission environment
− Can deploy more than one model
�Model should exploit new data as it arrives
�Predictive power of models must be monitored over time
− Develop thresholds which define the limits of allowable model variance; if model exceeds variable, must re-calibrate the model
− Need to establish monitoring mechanism