a day in the life of a functional data scientist - qcon … · a day in the life of a functional...
TRANSCRIPT
![Page 1: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/1.jpg)
A day in the life of a functional data scientistRichard Minerich, Director of R&D at Bayard Rock@Rickasaurus
![Page 2: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/2.jpg)
![Page 3: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/3.jpg)
Projecting onto a 2D Plane
![Page 4: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/4.jpg)
The Pairwise Entity Resolution Process
Blocking
• Two Datasets (Customer Data and Sanctions)
• Pairs of Somehow Similar Records
Scoring
• Pairs of Records
• Probability of Representing Same Entity
Review
• Records, Probability, Similarity Features
• True/False Labels (Mostly by Hand)
![Page 5: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/5.jpg)
Blocking
![Page 6: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/6.jpg)
Scoring:Risk vs Probability
(The Ideal)Likely to
Launder Money
Probably the
Same Person
![Page 7: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/7.jpg)
The Reality (Dominated by Garbage)
Tiny Bump
937Upper
Threshold
161
161,358
![Page 8: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/8.jpg)
Let’s dig into a single pointJimmy Cournoyer
El: 95/ SI:16
![Page 9: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/9.jpg)
![Page 10: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/10.jpg)
Citation Network (Safe View)
![Page 11: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/11.jpg)
Relationship Network (Safe View)
![Page 12: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/12.jpg)
British
Columbia
Rizzuto Crime Family
Jimmy “Cosmo”
“Superman”
Cournoyer
Quebec
New York/NYC
Bonanno Crime Family
John “Big Man”
Venizelos
Reinvested in CocaineCalifornia
Flow of DrugsHells Angels
El Chapo
Sinaloa Cartel
![Page 13: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/13.jpg)
Jorge HankRhon
Family & Friends
$100s Millions
Citibank, CH
Brother Murdered
![Page 14: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/14.jpg)
0
0.1
0.2
0.3
0.4
0.5
0.6
Munging Data Redoing Work /Investigating Problems
Fun Algorithms
% Time Spent
![Page 15: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/15.jpg)
Disgustingly Bad but Fairly Large Datasets
▪ Both Wide (many fields) and Tall (many records)
▪ From different systems (different encodings)
▪ Missing data
▪ Poorly merged data
▪ Extra data
▪ Non-unique IDs
Every client is awful in a completely different way.
NAME LARRY O BRIAN
STATE CANADA
CITY 121 Buffalo Drive, Montreal,
Quebec H3G 1Z2
ADDRESS NULL
ZIP 00000
DOB 10/24/80; 1/1/1979
![Page 16: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/16.jpg)
SAM – Building for Bad Data
▪ Lazy Pure Functional Core
▪ Programmable Data Cleaning
▪ Programmable ETL
▪ Ad-Hoc Behaviors
All with an F# Core and Barbfor scripting.
UI (C#) &
Analysis (C#)
Glue (F# and Barb)
Data &
Config
In
Data
Out
Algorithms (F#)
![Page 17: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/17.jpg)
Other Kinds of Problems (sometimes even my fault)
▪ Extra / Missing Data (e.g. incorrect subset or incorrect joins)
▪ Wrong version of data (e.g. bad sync in SQL)
▪ Bad configuration of dependencies
The data lives in a locked down environment and so feedback cycles are slow.
Lesson: Be Paranoid
![Page 18: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/18.jpg)
F# Tools From Bayard Rockhttp://github.com/BayardRock
Tokens Classification
Pegasus Airlines ORGANIZATION
Istanbul LOCATION
Sochi LOCATION
Russia LOCATION
Turkey LOCATION
Transportation
Ministry
ORGANIZATION
![Page 19: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/19.jpg)
FSharpWebIntellisense
https://github.com/BayardRock/FSharpWebIntellisense
![Page 20: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/20.jpg)
iFSharp Notebook
https://github.com/BayardRock/IfSharp
![Page 21: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/21.jpg)
Barb, a simple .net record query language
Name.Contains "John“ and (Age > 20 or Weight > 200)
https://github.com/Rickasaurus/Barb
![Page 22: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/22.jpg)
MITIE Dot Net (a wrapper for MIT’s MITIE)
https://github.com/BayardRock/MITIE-Dot-Net
A Pegasus Airlines plane landed at
an Istanbul airport Friday after a
passenger "said that there was a
bomb on board" and wanted the
plane to land in Sochi, Russia, the
site of the Winter Olympics, said
officials with Turkey's
Transportation Ministry.
Tokens Classification
Pegasus Airlines ORGANIZATION
Istanbul LOCATION
Sochi LOCATION
Russia LOCATION
Turkey LOCATION
Transportation
Ministry
ORGANIZATION
![Page 23: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/23.jpg)
Other F# Community Tools (Not by Us)
▪ Data Type Providers (SQL, OData, CSV, etc..)
▪ Language Type Providers (R, Matlab, Python soon)
▪ Deedle (like Pandas but for F#)
▪ F# Charting
![Page 24: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/24.jpg)
The Magic of Type Providers
type Netflix = ODataService<"http://odata.netflix.com">
let avatarTitles = query { for t in netflix.Titles do
where (t.Name.Contains "Avatar")sortBy t.Nametake 100 }
![Page 25: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/25.jpg)
How it works!
Type ProviderCompiler
Types
Erased Types
The
World!
Type Providers!Libraries For Free!
![Page 26: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/26.jpg)
Deedle (Like Python’s pandas but for F#)
▪ Designed with Data Type Providers in Mind
▪ Interops with the R Type Provider
![Page 27: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/27.jpg)
But what about algorithmic code?
![Page 28: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/28.jpg)
Ranking vs Regression
▪ Regression - you’re trying to guess a number, only distance matters
▪ May do a very bad job at ordering
▪ In Ranking you’re trying to figure out some order, only order matters
▪ May do a very bad job at providing a meaningful number
Example: You’re a doctor with 20 spots open and 100 patents who want to see you today, which method would be the best for selecting 20?
![Page 29: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/29.jpg)
Regression
𝑦 = 𝑋𝛽 + 𝜀
y is labels
X is features
𝛽 is weights
𝜀 is errors
![Page 30: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/30.jpg)
“OLS” Regression via Gradient Descent in F#
![Page 31: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/31.jpg)
Simple Ranking? You Can Use Regression.
▪ The features are the difference in would-be regression features
▪ The value to predict is the difference in rank
Select 2 labeled samples randomly => (x1,y1) (x2,y2)
x = x1 – x2y = y1 – y2
Sample 1 Sample 2 Result
Names? 1 1 0
Addresses? 1 0 1
DOB? 0 1 -1
Same Person? 0 0 0
![Page 32: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/32.jpg)
Simple Ranking in F#
![Page 33: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/33.jpg)
Combined Ranking andRegression – D. Sculley
You can improve your regression with ranking, and your ranking with regression.
The best of both worlds!
![Page 34: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/34.jpg)
Combined Ranking and Regression –D. Sculley @ Google, Inc
![Page 35: A day in the life of a functional data scientist - QCon … · A day in the life of a functional data scientist Richard Minerich, Director of R&D at Bayard Rock @Rickasaurus. Projecting](https://reader031.vdocument.in/reader031/viewer/2022021712/5b5cc4df7f8b9ac8618cee63/html5/thumbnails/35.jpg)
Thank You!
Check out the NYC F# User Group:http://www.meetup.com/nyc-fsharp
Find out more about F#:http://fsharp.org
Contact me on twitter:
@Rickasaurus