revolutionize text mining with spark and zeppelin
TRANSCRIPT
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Rev
olut
ioni
ze T
ext M
inin
gw
ith S
park
and
Zep
pelin
Apr
il 20
17
Yanb
o Li
ang
Apa
che
Spar
k co
mm
itter
Softw
are
engi
neer
@ H
orto
nwor
ks
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Agenda
Text
min
ing
wor
kflow
on
Big
Dat
a
Text
min
ing
with
Spa
rk a
nd M
Llib
Spar
k an
d Ze
ppel
in a
s the
pla
tform
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing:
Pra
ctic
al A
pplic
atio
ns
•Te
xt c
lass
ifica
tion
–Sp
am fi
lterin
g–
Frau
d de
tect
ion
•Te
xt c
lust
erin
g
•Se
ntim
ent a
naly
sis
•En
tity
extra
ctio
n
•R
ecom
men
datio
ns
•A
utom
atic
labe
ling
•C
onte
xtua
l adv
ertis
ing
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Trad
ition
al T
ext M
inin
g
•Co
mm
erci
al so
ftwar
e
•O
pen
sour
ce so
ftwar
e–
Gen
sim, K
NIM
E, N
LTK
,sk
lear
n, R
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Trad
ition
al T
ext M
inin
g
•Co
mm
erci
al so
ftwar
e–
IBM
SPS
S, R
apid
Min
er, S
AS
•O
pen
sour
ce so
ftwar
e–
Gen
sim, K
NIM
E, N
LTK
,sk
lear
n, R
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing
on B
ig D
ata
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing
on B
ig D
ata
Dat
a Sc
ient
ists
Softw
are
engi
neer
s
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Why
Apa
che
Spar
k M
Llib
•Sc
alab
le m
achi
ne le
arni
ng a
lgor
ithm
s on
top
of S
park
–A
ltern
atin
g Le
ast S
quar
es o
n Sp
otify
dat
a•5
0+ m
illio
n us
ers x
30+
mill
ion
song
s, 50
bill
ion
ratin
gs•F
or ra
nk 1
0 w
ith 1
0 ite
ratio
ns, ~
1 ho
ur ru
nnin
g tim
e
•W
ork
flow
util
ities
–M
L pi
pelin
e–
Mod
el im
port/
expo
rt–
cros
s val
idat
ion
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing
wor
kflow
•Pr
otot
ype
(Pyt
hon/
R)
•C
reat
e Pi
pelin
e–
Load
dat
aset
–Ex
tract
raw
feat
ures
–Tr
ansf
orm
feat
ures
–Se
lect
key
feat
ures
–Fi
t and
cho
ose
best
mod
els
•R
e-im
plem
ent P
ipel
ine
for
prod
uctio
n (J
ava/
Scal
a)
•D
eplo
y Pi
pelin
e
•Sc
orin
g
Dat
a Sc
ienc
eSo
ftwar
e en
gine
erin
g
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing
wor
kflow
•Pr
otot
ype
(Pyt
hon/
R)
•C
reat
e Pi
pelin
e–
Load
dat
aset
–Ex
tract
raw
feat
ures
–Tr
ansf
orm
feat
ures
–Se
lect
key
feat
ures
–Fi
t and
cho
ose
best
mod
els
•R
e-im
plem
ent P
ipel
ine
for
prod
uctio
n (J
ava/
Scal
a)
•D
eplo
y Pi
pelin
e
•Sc
orin
g
Dat
a Sc
ienc
eSo
ftwar
e en
gine
erin
g
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Load
dat
a
Text
Labe
lI b
ough
t the
gam
e…4
Do
NO
T bo
ther
try…
1Th
is sh
irt is
aw
esom
e…5
neve
r got
it. S
elle
r…1
I ord
ered
this
to…
3
Dat
aset
Feat
ure
engi
neer
ing
Mod
eltra
inin
gM
odel
eval
uatio
n
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Extr
act f
eatu
res
Text
Labe
lW
ords
Feat
ures
I bou
ght t
he g
ame…
4“i
”, “
boug
ht”,
…[1
, 0, 3
, 9, …
]D
o N
OT
both
er tr
y…1
“do”
, “no
t”, …
[0, 0
, 11,
0, …
]Th
is sh
irt is
aw
esom
e…5
“thi
s”, “
shirt
”, …
[0, 2
, 3, 1
, …]
neve
r got
it. S
elle
r…1
“nev
er”,
“go
t”, …
[1, 2
, 0, 0
, …]
I ord
ered
this
to…
3“i
”, “
orde
red”
, …[1
, 0, 0
, 3, …
]
Dat
aset
Feat
ure
engi
neer
ing
Mod
eltra
inin
gM
odel
eval
uatio
n
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Fit a
mod
el
Text
Labe
lW
ords
Feat
ures
Prob
abili
tyPr
edic
tion
I bou
ght t
he g
ame…
4“i
”, “
boug
ht”,
…[1
, 0, 3
, 9, …
]0.
84
Do
NO
T bo
ther
try…
1“d
o”, “
not”
, …[0
, 0, 1
1, 0
, …]
0.6
2Th
is sh
irt is
aw
esom
e…5
“thi
s”, “
shirt
”, …
[0, 2
, 3, 1
, …]
0.9
5ne
ver g
ot it
. Sel
ler…
1“n
ever
”, “
got”
, …[1
, 2, 0
, 0, …
]0.
71
I ord
ered
this
to…
3“i
”, “
orde
red”
, …[1
, 0, 0
, 3, …
]0.
74
Dat
aset
Feat
ure
engi
neer
ing
Mod
eltra
inin
gM
odel
eval
uatio
n
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Righ
ts Re
serv
ed
Evaluate Text
Label
Words
Features
Probability
Prediction
I bou
ght t
he g
ame…
4“i
”, “
boug
ht”,
…[1
, 0, 3
, 9, …
]0.
84
Do
NO
T bo
ther
try…
1“d
o”, “
not”
, …[0
, 0, 1
1, 0
, …]
0.6
2Th
is sh
irt is
aw
esom
e…5
“thi
s”, “
shirt
”, …
[0, 2
, 3, 1
, …]
0.9
5ne
ver g
ot it
. Sel
ler…
1“n
ever
”, “
got”
, …[1
, 2, 0
, 0, …
]0.
71
I ord
ered
this
to…
3“i
”, “
orde
red”
, …[1
, 0, 0
, 3, …
]0.
74
Dat
aset
Feat
ure
engi
neer
ing
Mod
eltra
inin
gM
odel
eval
uatio
n
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Key
abs
trac
tion
of S
park
ML
pipe
line
•Tr
ansf
orm
er–
Feat
ure
trans
form
ers (
e.g.
, Has
hing
TF) a
nd tr
aine
d M
L m
odel
s (e.
g., N
aive
Bay
esM
odel
).
•Es
timat
or–
ML
algo
rithm
s for
trai
ning
mod
els (
e.g.
, Nai
veB
ayes
).
•Ev
alua
tor
–Th
ese
eval
uate
pre
dict
ions
and
com
pute
met
rics,
usef
ul fo
r tun
ing
algo
rithm
par
amet
ers (
e.g.
,B
inar
yCla
ssifi
catio
nEva
luat
or).
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Spar
k’s T
ext M
inin
g al
gori
thm
s
•LD
A fo
r top
ic m
odel
•W
ord2
Vec
an u
nsup
ervi
sed
way
to tu
rn w
ords
into
feat
ures
bas
ed o
n th
eir m
eani
ng
•C
ount
Vect
oriz
er tu
rns d
ocum
ents
into
vec
tors
bas
ed o
n w
ord
coun
t
•H
ashi
ngTF
-ID
F ca
lcul
ates
impo
rtant
wor
ds o
f a d
ocum
ent w
ith re
spec
t to
the
corp
us
•A
nd m
uch
mor
e
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
MLl
ib T
ext M
inin
g Pi
pelin
e - c
lass
ifica
tion
Dat
aset
Reg
exTo
keni
zer
Stop
Wor
dsR
emov
er
Cou
ntVe
ctor
izer
Has
hing
TFID
F
Strin
gInd
exer
Nai
veB
ayes
Logi
stic
Reg
ress
ion
SVM
MLP
text
cla
ssifi
catio
n
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
MLl
ib T
ext M
inin
g Pi
pelin
e –
topi
c m
odel
Dat
aset
Reg
exTo
keni
zer
Stop
Wor
dsR
emov
er
Cou
ntVe
ctor
izer
Has
hing
TFID
FLD
Ato
pic
mod
el
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
MLl
ib T
ext M
inin
g Pi
pelin
e - r
ecom
men
datio
n
Dat
aset
Reg
exTo
keni
zer
Wor
d2Ve
c
reco
mm
enda
tion
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
MLl
ib T
ext M
inin
g Pi
pelin
e
Dat
aset
Reg
exTo
keni
zer
Stop
Wor
dsR
emov
er
Cou
ntVe
ctor
izer
Has
hing
TFID
F
Strin
gInd
exer
Nai
veB
ayes
Logi
stic
Reg
ress
ion
SVM
MLP
LDA
Wor
d2Ve
c
text
cla
ssifi
catio
n
topi
c m
odel
reco
mm
enda
tion
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Demo
•lo
ad th
e fil
e co
nten
ts a
nd th
e ca
tego
ries
•ex
tract
feat
ure
vect
ors s
uita
ble
for m
achi
ne le
arni
ng
•tra
in a
line
ar m
odel
to p
erfo
rm c
ateg
oriz
atio
n
•us
e a
grid
sear
ch st
rate
gy to
find
a g
ood
confi
gura
tion
of b
oth
the
feat
ure
extra
ctio
nco
mpo
nent
s and
the
clas
sifie
r
http
s://g
ithub
.com
/yan
bolia
ng/d
ataw
orks
-mun
ich-
2017
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Cus
tom
ing
ML
Pipe
lines
•M
Llib
2.1
incl
udes
:–
30+
feat
ure
trans
form
ers (
Toke
nize
r, W
ord2
Vec,
…)
–25
+ m
odel
s (fo
r cla
ssifi
catio
n, re
gres
sion
, clu
ster
ing,
…)
–M
odel
tuni
ng &
eva
luat
ion
•B
ut so
me
appl
icat
ions
requ
ire c
usto
miz
ed–
Tran
sfor
mer
s & M
odel
s
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Opt
ions
for
cust
omiz
atio
n
•Ex
istin
g us
e ca
ses:
–sp
ark-
core
nlp
–sp
ark-
vlbf
gs
•Ex
tend
abs
tract
ions
–Tr
ansf
orm
er–
Estim
ator
& M
odel
–Ev
alua
tor
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Spar
k vi
rtua
l env
iron
men
t
Dat
a Sc
ient
ist A
Dat
a Sc
ient
ist B
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Spar
k vi
rtua
l env
iron
men
t
Dat
a Sc
ient
ist A
Dat
a Sc
ient
ist B
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
Pyth
on2.
7
Pyth
on3.
5
Pyth
on3.
5
Pyth
on3.
5
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Text
Min
ing
wor
kflow
•Pr
otot
ype
(Pyt
hon/
R)
•C
reat
e Pi
pelin
e–
Load
dat
aset
–Ex
tract
raw
feat
ures
–Tr
ansf
orm
feat
ures
–Se
lect
key
feat
ures
–Fi
t and
cho
ose
best
mod
els
•R
e-im
plem
ent P
ipel
ine
for
prod
uctio
n (J
ava/
Scal
a)
•D
eplo
y Pi
pelin
e
•Sc
orin
g
Dat
a Sc
ienc
eSo
ftwar
e en
gine
erin
g
Dup
licat
ed a
nder
ror-p
rone
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
ML
pers
isten
ce
•Pr
otot
ype
(Pyt
hon/
R)
•C
reat
e Pi
pelin
e
•Lo
ad P
ipel
ine
(Jav
a/Sc
ala)
–M
odel
.load
(“s3
n://…
”)
•D
eplo
y in
pro
duct
ion
Dat
a Sc
ienc
eSo
ftwar
e en
gine
erin
g
Pers
ist m
odel
or P
ipel
ine:
mod
el.sa
ve(“
s3n:
//…”)
‹# ›©
Hor
tonw
orks
Inc.
201
1 –
2016
. All
Rig
hts R
eser
ved
Dat
a sc
ient
ists w
ork
with
softw
are
engi
neer
Dat
a Sc
ient
ists
Softw
are
engi
neer
s
Expl
ore
data
Cre
ate
pipe
line
Find
bes
t par
ams
Save
mod
el
Load
mod
elD
eplo
y in
pro
duct
ion
Scor
ing
onba
tch/
stre
amin
g da
ta