aua data science meetup
TRANSCRIPT
D AV I D G E V O R K YA N
@ d a v i d g e v
d a v i d g e v o r k y a n
G R A D U AT E D A U A I N 2 0 0 8
W H AT I S B I G D ATA ?
FA S H I O N A B L E T E R M ?
8 0 % O F D ATA E X I S T I N G I N A N Y E N T E R P R I S E I S U N S T R U C T U R E D D ATA
S T RUC TUR ED DATA
S EM I -‐ S T RUC TUR ED
UNSTRUC TUR ED DATA
RDBMS Data Warehousing
9 0 % O F T H E D ATA I N T H E W O R L D T O D AY H A S B E E N C R E AT E D I N T H E L A S T T W O Y E A R S A L O N E
S o u r c e : h t t p : / / w w w. i n t e l . c o m / c o n t e n t / w w w / u s / e n / c o m m u n i c a t i o n s / i n t e r n e t - m i n u t e - i n f o g r a p h i c . h t m l
4 V ’ S O F B I G D ATA
VOLUME (large amount of data)
VARIETY (sensors, video, audio, email, social)
VELOCITY (speed of data generation)
VERACITY (authenticity and/or accuracy)
S O L U T I O N S R E Q U I R E D
f o r c e s y o u t o c h a n g e t h e w a y y o u • C O L L E C T • T R A N S P O RT • S T O R E • M A N A G E • A N A LY Z E • V I S U A L I Z E
W H AT I S D ATA S C I E N C E ?
D ATA S C I E N C E ! = S TAT I S T I C A L A N A LY S I S I T I S S C I E N C E A N D “ A RT ” O F …
• E X P L O R I N G T H E U N K N O W N A B O U T D ATA “ m a k e d i s c o v e r i e s w h i l e s w i m m i n g i n t h e d a t a ”
• R E F I N I N G T H E R E S U LT S F O R A C C U R A C Y • D E R I V I N G A C T I O N A B L E I N S I G H T • C R E AT I N G D ATA - D R I V E N P R O D U C T S
W H O A R E D ATA S C I E N T I S T S ?
W H O A R E D ATA S C I E N T I S T S ?
D r e w C o n w a y, 2 0 1 0
B I G D ATA S C I E N C E T O O L S ?
• S c a l a , J a v a , P y t h o n , R … ( b o n u s : C l o j u re , H a s k e l l , E r l a n g )
• H a d o o p , H D F S , M a p R e d u c e … ( b o n u s : S p a r k , S t o r m , Te z )
• S c a l d i n g , H B a s e , P i g , H i v e … ( b o n u s : S h a r k , T i t a n , G i r a p h )
• F l u m e , S q o o p , E T L , We b s c r a p e r s … ( b o n u s : H u m e )
• S Q L , R D B M S , D W, O L A P… ( b o n u s : S O L R , E l a s t i c S e a rc h )
• K n i m e , We k a , R a p i d M i n e r… ( b o n u s : S c i P y, N u m P y, P a n d a s )
• D 3 . j s , K i b a n a , g g p l o t 2 , Ta b l e u … ( b o n u s : S h i n y, F l a re ,
D a t a m e e r )
• S P S S , M a t l a b , S A S … ( t h e e n t e r p r i s e m a n )
• N o S Q L , M o n g o D B , C a s s a n d r a , C o u c h D B
• A n d Ye s ! … M S - E x c e l : t h e m o s t u s e d , m o s t u n d e r r a t e d D S t o o l
G O A L ?
• R e v e n u e , re v e n u e , re v e n u e • I m p ro v e t h e c u s t o m e r e x p e r i e n c e • I n c re a s e o p e r a t i o n a l e f f i c i e n c y • G E : O p t i m i z e m a i n t e n a n c e i n t e r v a l s f o r i n d u s t r i a l
p ro d u c t s • G o o g l e : R e f i n e s e a r c h a n d a d - s e r v i n g a l g o r i t h m s • Z y n g a : O p t i m i z e t h e g a m e e x p e r i e n c e f o r b o t h
l o n g - t e r m e n g a g e m e n t a n d re v e n u e • N e t f l i x : M o v i e re c o m m e n d a t i o n s • K a p l a n : U n c o v e r e f f e c t i v e l e a r n i n g s t r a t e g i e s • e H a r m o n y : C re a t e h a p p y re l a t i o n s h i p s
W H O A R E W E ?
T R A D I T I O N A L M E T H O D S D O N O T W O R K A N Y M O R E …
E H A R M O N Y C R E AT E S T H E H A P P I E S T, M O S T PA S S I O N AT E A N D M O S T F U L F I L L I N G R E L AT I O N S H I P S *
* A C C O R D I N G T O A R E C E N T S T U D Y
4 3 8 M A R R I A G E S P E R D AY
T H E D I F F E R E N C E ?
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y M AT C H I N G
A F F I N I T Y M AT C H I N G
M AT C H D I S T R I B U T I O N
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y M AT C H I N G
A F F I N I T Y M AT C H I N G
M AT C H D I S T R I B U T I O N
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I AB I D I R E C T I O N A L
Leo
Ian
Steve
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Leo
Ian
Steve
Nicolette
B I D I R E C T I O N A L
150 ques5ons
Personality Values A@ributes Beliefs
Intellect Energy
Sociability Ambition
Kindness Curiosity
Humor Spirituality
C O M PAT I B I L I T Y M AT C H I N G
U S E R D E F I N E D C R I T E R I A
C O M PAT I B I L I T Y M O D E L S
M O N G O D B
V O L D E M O RT
M O N G O D BDATA STORE NEEDS
P O W E R F U L I N D E X I N G M O D E L S
FA S T M U LT I -AT T R I B U T E S E A R C H E S
E A S Y T O M A I N TA I N
6 0 M + Q U E R I E S
per day
M O N G O D BWINS
A U T O S C A L I N G
B U I LT- I N S H A R D I N G
A U T O B A L A N C I N G
M M S
V O L D E M O RT ?
T H AT N A M E S O U N D S FA M I L I A R
V O L D E M O RTDATA STORE NEEDS
C R U D O P E R AT I O N S
VA R I E D T R A N S A C T I O N
S I Z E S
B I L L I O N + P O T E N T I A L M AT C H E S
per day
V O L D E M O RTWINS
A U T O R E P L I C AT I O N
A U T O PA RT I T I O N I N G
P L U G G A B L E S E R I A L I Z AT I O N
A F F I N I T Y M AT C H I N G
Compatibility Matching System®
C O M PAT I B I L I T Y M AT C H I N G
A F F I N I T Y M AT C H I N G
M AT C H D I S T R I B U T I O N
65 30
3000 miles
Com
m p
roba
bilit
y
Distance in Miles
0 1 3 7 15 63 255 1023 4095
P R O B
Com
m p
roba
bilit
y
Height difference in cm-29 -25 -21 -17 -13 -9 -6 -3 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 56
4 -‐ 8 in
P R O B
W O R D S T O U S E
W O R D S T O U S E
S O M E I N S I G H T
D ATA N E E D S F O R A F F I N I T Y
5 0 M + R E G I S T E R E D U S E R S
1 0 3 AT T R I B U T E S
1 0 7 D A I LY M AT C H E S
2 5 0 M + P H O T O S
4 B + Q U E S T I O N N A I R E S A N S W E R E D
C O M M U N I C AT I O N A G G R E G AT E S
E V E N T L I S T E N E R S E R V I C E
U S E R A C T I V I T Y S E R V I C E
~ 5 M S R E S P O N S E
T I M E S
1 0 K E V E N T S P E R S E C O N D
U S E R S E R V I C E
H O U R LY, D A I LY T O TA L
O F F L I N E B AT C H J O B S
U S E R S E R V I C E
M A P - S I D E J O I N S ( T B ) S C O R I N G
1+GB Compressed Protocol Buffers
PA I R I N G S S E R V I C E
750M Compressed Protocol Buffers
B I L L I O N + P O T E N T I A L M AT C H E S
A M A Z O N E M R
AW S D I R E C T C O N N E C T
2 5 6 N O D E S 5 0 T B S T O R A G E
I N - H O U S E S E A M I C R O
D ATA R E T R I E VA L L AT E N C Y
L O W O P E R AT I O N A L C O S T
L O W P O W E R C O N S U M P T I O N
P R E D I C TA B L E C O M P L E T I O N T I M E S
M O D E L R E T R A I N I N G
distcp
Protocol Buffers from Offline Jobs
M AT C H D I S T R I B U T I O N
Compatibility Matching System®
C O M PAT I B I L I T Y M AT C H I N G
A F F I N I T Y M AT C H I N G
M AT C H D I S T R I B U T I O N
Delivering the right matches at the right time to as many people as possible across
the entire network
T H A N K Y O U Q U E S T I O N S ?
C R E D I T S :
The Noun Project
http://thenounproject.com
Visual Elements From