search in the apache hadoop ecosystem: thoughts from the field

70
1 Search in the Apache Hadoop Ecosystem: Thoughts from the Field Open Source Search Conference, November 2013 Alex Moundalexis [email protected] @technmsg

Upload: alex-moundalexis

Post on 27-Jan-2015

103 views

Category:

Technology


0 download

DESCRIPTION

This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.

TRANSCRIPT

Page 1: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

1

Search  in  the  Apache  Hadoop  Ecosystem:  Thoughts  from  the  Field  Open  Source  Search  Conference,  November  2013  Alex  Moundalexis  [email protected]    @technmsg  

Page 2: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

2

Thoughts  of  a  Former  SA  

Page 3: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

3

Thoughts  of  a  Former  SA  Field  Guy  

Page 4: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Disclaimer  

•  Technologies,  not  products  •  Cloudera  builds  things  soKware  

•  most  donated  to  Apache  •  some  closed-­‐source  

•  I  will  likely  menPon  “Cloudera  Something”  •  Cloudera  “products”  I  reference  are  open  source  

•  Apache  Licensed  •  Source  code  is  on  GitHub  

•  hTps://github.com/cloudera  

4

Page 5: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

What  This  Talk  Isn’t  About  

•  Deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  Sizing  &  Tuning  •  Depends  heavily  on  data  and  workload  

•  Coding  •  Algorithms  

5

Page 6: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

6  

“  The  answer  to  most  Hadoop  quesPons  is  it  

depends.”  

Page 7: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

7

Quick  and  dirty,  more  Pme  for  use  cases.  

The  Apache  Hadoop  Ecosystem  

Page 8: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaPons  •  ConfiguraPon  •  Workflow  

8

Page 9: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

ParPal  Ecosystem  

9

Hadoop  

external  system  

RDBMS  /  DWH  

web  server  

device  logs  

API  access  

log  collecPon  

DB  table  import  

batch  processing  

machine  learning  

external  system  

API  access  

user  

RDBMS  /  DWH  

DB  table    export  

BI  tool  +  JDBC/ODBC  

Search  

SQL  

Page 10: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpPmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hTp://research.google.com/archive/gfs.html  

10

Page 11: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Lots  of  Commodity  Machines  

11

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 12: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realPme  •  Works  well  with  distributed  compuPng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hTp://research.google.com/archive/mapreduce.html  

12

Page 13: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Under  the  Covers  

Page 14: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Page 15: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  HBase  

•  Random,  realPme  read/write  access  •  Key/value  columnar  store  •  (b|tr)illions  of  rows/columns  •  Based  on  Google  BigTable  

•  hTp://research.google.com/archive/bigtable.html  

15

Page 16: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Accumulo  

•  Random,  realPme  read/write  access  •  Key/value  columnar  store  •  (b|tr)illions  of  rows/columns  •  Based  on  Google  BigTable  

•  hTp://research.google.com/archive/bigtable.html  

•  Adds  cell-­‐level  security  •  Implemented  by  NaPonal  Security  Agency  

•  Donated  to  ASF  

16

Page 17: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Hive  &  Pig  

•  AbstracPon  of  Hadoop’s  Java  API  •  Hive  is  SQL-­‐based  •  Pig  is  more  data-­‐flow  oriented  

•  Eases  analysis  using  MapReduce  

17

Page 18: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Impala  

•  SQL-­‐based,  but  interacPve  response  •  Backed  by  HDFS  or  HBase  •  Allows  for  fast  iteraPon/discovery  •  Not  as  fault-­‐tolerant  as  MapReduce  

18

Page 19: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Sqoop  &  Flume  

•  Get  your  data  in  and  out  of  HDFS  •  Sqoop  focuses  on  relaPonal  databases  •  Flume  focuses  on  log  files  

19

Page 20: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Hue  

•  Hadoop  User  Experience  •  Hadoop  is  largely  command  line  •  Hue  provides  a  UI  for  end-­‐users  •  SDK  to  build  your  own  apps  on  top  

20

Page 21: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Mahout  

•  Machine  learning  algorithms  that  run  on  MapReduce  •  Clustering  •  ClassificaPon  •  Filtering  

•  I  didn’t  study  these  algorithms  in  school  •  Data  science  people  are  excited  •  Math  people  are  excited  •  I’m  excited  for  them  

21

Page 22: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Tika  

•  Content  analysis  toolkit  •  Simply  put,  a  lot  of  parsers  •  Detect/extract  metadata/text  from  documents  

•  HTML  •  XML  •  Office  •  PDF  •  mbox  •  More…  

22

Page 23: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  ZooKeeper  

•  Distributed  systems  are  HARD  •  Everyone  was  trying  to  implement  the  same  subsystems  •  Bugs  leads  to  race  condiPons,  other  bad  things  

•  ZK:  Highly  reliable  distributed  coordinaPon  services  •  ConfiguraPon  •  Naming  •  SynchronizaPon  •  Group  Services  

23

Page 24: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Oozie  

•  Workflow  scheduling  for  Hadoop  •  Like  cron,  but  in  directed  graph  fashion  •  Out  of  box  hooks:  

•  MR  •  Pig  •  Hive  •  Sqoop  •  Impala  

24

Page 25: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Sentry  (incubaPng)  

•  Role-­‐based  access  control  for  Hive/Impala/Solr  •  Regulatory/compliance  assurance  

25

Page 26: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Morphlines  

•  In-­‐memory  transformaPons  •  Load,  parse,  transform,  process  •  Records  as  name-­‐value  pairs  w/  opPonal  blob/pojo  objects  

•  Java  library,  embedded  in  your  codebase  •  Used  to  ETL  data  from  Flume  and  MR  into  Solr  

26

Page 27: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Lucene  

•  Java-­‐based  index  and  search  •  Spellchecking  •  Hit  highlighPng  •  TokenizaPon  

27

Page 28: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  Solr  

•  Enterprise  search  plaoorm  •  Based  on  Apache  Lucene  

•  Full-­‐text  search  •  FacePng  •  NRT  indexing  

28

Page 29: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Apache  SolrCloud  

•  IntegraPon  of  Solr  +  ZooKeeper  •  Provides  for  shard  failover  

29

Page 30: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Search  

•  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)  •  Fault-­‐tolerance:  collecPons  backed  by  HDFS  or  Hbase  •  IntegraPon  galore:  

•  HBase/Flume/MapReduce  w/  Lucene  •  Hue  w/  Solr  •  Avro  w/  Tika  •  HDFS  w/  Solr/Lucene  •  Sentry  w/  Solr    

30

Page 31: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Search  +  Hue  

31  

Page 32: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  Search  +  Hue  

32  

Page 33: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

33

Apologies,  I  swiped  some  preTy  slides  from  markePng…  

Why  Search?  

Page 34: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Search  Design  Strategy  

34

One  pool  of  data  

One  security  framework  

One  set  of  system  resources  

One  management  interface  

An  Integrated  Part  of  the  Hadoop  System  

Storage  

Integra5on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Engines  

InteracPve  SQL  

CLOUDERA  IMPALA  

InteracPve  Search  CLOUDERA  SEARCH  

Machine  Learning  MAHOUT  

Math  &  Sta5s5cs  

SAS,  R    

Page 35: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Benefits  of  Search  IntegraPon  

35

Improved  Big  Data  ROI  §  An  interacPve  experience  without  technical  knowledge  §  Single  data  set  for  mulPple  compuPng  frameworks  

Faster  Time  to  Insight  §  Exploratory  analysis,  esp.  unstructured  data  §  Broad  range  of  indexing  opPons  to  accommodate  needs  

Cost  Efficiency  §  Single  scalable  plaoorm;  no  incremental  investment  §  No  need  for  separate  systems,  storage  

Solid  Founda5ons  &  Reliability  §  Solr  in  producPon  environments  for  years  §  Hadoop-­‐powered  reliability  and  scalability  

Page 36: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

36

So  much  soKware…  

Making  Decisions  

Page 37: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

That’s  a  Lot  of  SoKware  

•  21  packages,  depending  on  how  you  count  •  And  there’s  plenty  more…  

•  How  to  decide  what  to  use?  

37

Page 38: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

38  

“  The  answer  to  most  Hadoop  quesPons  is  it  

depends.”  

Page 39: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Some  of  the  Big  Issues  

•  Response  Pme  •  User  interfaces  •  Programming  paradigm  •  Input/output  formats  •  Use  cases    

39

Page 40: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Response  Time  

•  MapReduce  is  batch  oriented  •  Resilient  to  hardware  failures  •  Robust  scheduling  opPons  

•  Impala  is  near-­‐realPme  •  HBase  is  realPme  

•  Key/values  are  cached  in  memory  

•  Search  can  be  (near-­‐)realPme.  

•  Hybrid  systems  are  common!  

40

Page 41: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

User  Interfaces  

•  Java  •  MapReduce,  HBase  

•  SQL  •  Hive,  Impala  

•  Shell  •  Pig  

•  Natural  Language  /  Free  Text  •  Search  

41

Page 42: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Data  Constraints  

•  MapReduce  •  Paradigm  takes  some  getng  used  to  •  Processing  must  accommodate  format  

•  HBase  •  Columnar  key/value  store  •  Hue  makes  this  easier  

•  Search  •  Indexing  and  display  •  Hue  makes  this  easier  

42

Page 43: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Input/Output  Formats  

•  Know  what  they  are…  opPonal.  •  Don’t  know?  That’s  okay.  •  Schema  on  read.  

•  Be  able  to  extract  what  you  need  

43

Page 44: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Lack  of  Use  Case  

•  “Big  Data”  and  Hadoop  •  They  ENABLE  you  to  solve  problems  •  Won’t  solve  problems  for  you  •  Doesn’t  know  about  your  business  logic  •  “Big”  is  bigger  than  you’re  accustomed  to…  

•  Have  a  plan  •  Bring  your  use  cases  •  Bring  your  business  quesPons  

44

Page 45: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

45

One  typical  Hadoop  use  case.  

Index  GeneraPon/Serving  

Page 46: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

eBay  –  Cassini  Project  

•  June  2012  •  2B  page  views/day  •  250M  searches/day  •  9  PB  online  

•  Custom  search  indexes  •  Limited  by  field  or  Pme  period  

46

Page 47: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

eBay  –  Cassini  Project  

• MapReduce  to  generate  indexes  •  Customer  history  •  Item  fields:  name,  price,  descripPons,  etc  

•  Bulk  import  indexes  into  HBase,  served  •  15  TB  in  HBase,  1.2  TB  daily  import  into  Hbase  •  Ranking  algorithms  can  take  into  account  

•  More  history  •  More  fields  •  More  customer-­‐specific  details  

47

Page 48: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

48

Some  quick  examples.  

Search  Use  Cases  

Page 49: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Search  Use  Cases  

49

Offer  easy  access  to  non-­‐technical  resources  

Explore  data  prior  to  processing  and  modeling  

Gain  immediate  access  and  find  correlaPons  in  mission-­‐criPcal  data  

Powerful,  proven  search  capabili5es  that  let  organiza5ons:  

Page 50: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Monsanto  

50

Scalable,  efficient  image  search  for  analysis  and  research  

Track  plant  characterisPcs  throughout  their  lifecycle  

Before:  Manual  aTribute  extracPon  and  search  queries  within  database  

Now:  Parse  and  index  images  at  acquisiPon  and  on  demand,  index  archived  images  in  batch  

Page 51: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

51

Cloudera:  Internal  Field  Portal  

Custom  Aggregated  Search  

Page 52: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  –  Internal  Field  Portal  

•  Single  stop  for  field  engineers  •  Mailing  lists:  public,  private  •  Tickets:  support,  development,  public  ASF  •  Customer  data:  accounts,  clusters,  KB  arPcles  •  Customer  Clusters:  configs,  audits,  logs,  events  •  Books  and  papers  •  Discussion  forums  

•  Dogfooding,  yes  • Makes  my  life  easier  

52

Page 53: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  –  Internal  Field  Portal  

53  

Page 54: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Cloudera  –  Internal  Field  Portal  

•  Varied  fetchers/observers  for  web/API  content  •  Content  is  retrieved  via  Flume,  Sqoop  

•  Search  indexes  and  replicates  into  HBase  •  Each  collecPon  has  collecPon-­‐specific  filters/fields  •  Provides  Ptle,  content  snippet,  link  to  original  

• Morphlines  extracts  books  and  papers  using  Tika  •  Impala  for  analyPcs  

•  Future:  Use  MapReduce  to  ingest  logs  

54

Page 55: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

55

PaTerns  &  PredicPons:  Durkheim  Project  

Risk  ClassificaPon  &  PredicPve  Analysis  

Page 56: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

 

2012  

Page 57: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

57 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

 US  Military  Suicides  349  

2012  

Page 58: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

US  Military  Suicides  349  

 349  >  301  

2012  

Page 59: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Assessment  of  mental  health  risks  •  Correlate  veterans’  communicaPons  with  suicide  risk  

59

Page 60: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Build  machine  learning  algorithms  on  MapReduce  •  Train  using  expert  knowledge  

•  Keywords  •  PaTerns  

•  Algorithm  detects  and  assign  risk  scores  •  In  what  medium?  

60

Page 61: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/

Unstructured  Clinical  Notes  

Page 62: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Phase  1  •  3  cohorts:  non-­‐psychiatric,  psychiatric,  suicide-­‐posiPve  •  100  clinical  profiles  per  cohort  •  65%  accurate  in  predicPng  suicide  risk  in  control  group  

•  Phase  2    •  Text  analyPcs  of  clinical  records,  opt-­‐in  social  media  •  Goal  of  100,000  veteran  parPcipants  •  Represents  a  huge  increase  of  data  

•  TradiPonal  enterprise  search  couldn’t  scale  

62

Page 63: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Technologies  •  Hadoop  •  Search  

•  Indexing  of  machine  learning,  backed  by  HBase  for  performance  •  Hue  interface  for  non-­‐technical  users  •  Discovery  of  terms,  keywords,  risk  factors  in  numerous  facets  

•  Impala  •  Deep  SQL  queries  if/when  interesPng  deviaPons  are  found  •  e.g.  if  the  word  “Molly”  appeared  in  top  10  facets  •  Write  some  SQL  to  dig  in,  perhaps  revise  indexing  scheme  

63

Page 64: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Currently  •  Monitoring  •  Analysis  

•  Future  •  IntervenPonal  study  •  Back  our  hopes  with  data…  

• More  detailed  Case  Study  •  hTp://goo.gl/3ZJMwS  •  hTp://durkheimproject.org/  

64

Page 65: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

65

ParPng  thoughts…  in  no  parPcular  order.  

Summary  

Page 66: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Search  Simplifies  InteracPon  

66

Explore  

Navigate  

Correlate  Experts  know  MapReduce.  Savvy  people  know  SQL.    

Everyone  knows  Search.  

Page 67: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

Summary  

•  With  Hadoop,  it  depends.  •  The  tools  are  out  there.  •  Open  source  soKware  

•  Many  interconnected  pieces  •  Many  unexplored  opportuniPes  •  A  thriving  community  awaits  you…  

•  Data  can  make  a  difference.  •  Search  allows  everyone  to  interact  with  data.  

•  This  is  a  Big  Deal.  

67

Page 68: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

What’s  Next?  

•  Download  Hadoop!  •  Already  done  that?  Contribute…  

•  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  

•  hTp://Pny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hTps://github.com/cloudera  

68

Page 69: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

69

Preferably  related  to  the  talk…  

QuesPons?  

Page 70: Search in the Apache Hadoop Ecosystem: Thoughts from the Field

70

Thank  You!  Alex  Moundalexis  [email protected]  @technmsg    We’re  hiring,  kids!  Well,  not  kids.