search in the apache hadoop ecosystem: thoughts from the field

Post on 27-Jan-2015

103 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.

TRANSCRIPT

1

Search  in  the  Apache  Hadoop  Ecosystem:  Thoughts  from  the  Field  Open  Source  Search  Conference,  November  2013  Alex  Moundalexis  alexm@clouderagovt.com    @technmsg  

2

Thoughts  of  a  Former  SA  

3

Thoughts  of  a  Former  SA  Field  Guy  

Disclaimer  

•  Technologies,  not  products  •  Cloudera  builds  things  soKware  

•  most  donated  to  Apache  •  some  closed-­‐source  

•  I  will  likely  menPon  “Cloudera  Something”  •  Cloudera  “products”  I  reference  are  open  source  

•  Apache  Licensed  •  Source  code  is  on  GitHub  

•  hTps://github.com/cloudera  

4

What  This  Talk  Isn’t  About  

•  Deploying  •  Puppet,  Chef,  Ansible,  homegrown  scripts,  intern  labor  

•  Sizing  &  Tuning  •  Depends  heavily  on  data  and  workload  

•  Coding  •  Algorithms  

5

6  

“  The  answer  to  most  Hadoop  quesPons  is  it  

depends.”  

7

Quick  and  dirty,  more  Pme  for  use  cases.  

The  Apache  Hadoop  Ecosystem  

Why  “Ecosystem?”  

•  In  the  beginning,  just  Hadoop  •  HDFS  •  MapReduce  

•  Today,  dozens  of  interrelated  components  •  I/O  •  Processing  •  Specialty  ApplicaPons  •  ConfiguraPon  •  Workflow  

8

ParPal  Ecosystem  

9

Hadoop  

external  system  

RDBMS  /  DWH  

web  server  

device  logs  

API  access  

log  collecPon  

DB  table  import  

batch  processing  

machine  learning  

external  system  

API  access  

user  

RDBMS  /  DWH  

DB  table    export  

BI  tool  +  JDBC/ODBC  

Search  

SQL  

HDFS  

•  Distributed,  highly  fault-­‐tolerant  filesystem  •  OpPmized  for  large  streaming  access  to  data  •  Based  on  Google  File  System  

•  hTp://research.google.com/archive/gfs.html  

10

Lots  of  Commodity  Machines  

11

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce  (MR)  

•  Programming  paradigm  •  Batch  oriented,  not  realPme  •  Works  well  with  distributed  compuPng  •  Lots  of  Java,  but  other  languages  supported  •  Based  on  Google’s  paper  

•  hTp://research.google.com/archive/mapreduce.html  

12

Under  the  Covers  

You specify map() and reduce() functions. ���

���The framework does the

rest. 60

Apache  HBase  

•  Random,  realPme  read/write  access  •  Key/value  columnar  store  •  (b|tr)illions  of  rows/columns  •  Based  on  Google  BigTable  

•  hTp://research.google.com/archive/bigtable.html  

15

Apache  Accumulo  

•  Random,  realPme  read/write  access  •  Key/value  columnar  store  •  (b|tr)illions  of  rows/columns  •  Based  on  Google  BigTable  

•  hTp://research.google.com/archive/bigtable.html  

•  Adds  cell-­‐level  security  •  Implemented  by  NaPonal  Security  Agency  

•  Donated  to  ASF  

16

Apache  Hive  &  Pig  

•  AbstracPon  of  Hadoop’s  Java  API  •  Hive  is  SQL-­‐based  •  Pig  is  more  data-­‐flow  oriented  

•  Eases  analysis  using  MapReduce  

17

Cloudera  Impala  

•  SQL-­‐based,  but  interacPve  response  •  Backed  by  HDFS  or  HBase  •  Allows  for  fast  iteraPon/discovery  •  Not  as  fault-­‐tolerant  as  MapReduce  

18

Apache  Sqoop  &  Flume  

•  Get  your  data  in  and  out  of  HDFS  •  Sqoop  focuses  on  relaPonal  databases  •  Flume  focuses  on  log  files  

19

Cloudera  Hue  

•  Hadoop  User  Experience  •  Hadoop  is  largely  command  line  •  Hue  provides  a  UI  for  end-­‐users  •  SDK  to  build  your  own  apps  on  top  

20

Apache  Mahout  

•  Machine  learning  algorithms  that  run  on  MapReduce  •  Clustering  •  ClassificaPon  •  Filtering  

•  I  didn’t  study  these  algorithms  in  school  •  Data  science  people  are  excited  •  Math  people  are  excited  •  I’m  excited  for  them  

21

Apache  Tika  

•  Content  analysis  toolkit  •  Simply  put,  a  lot  of  parsers  •  Detect/extract  metadata/text  from  documents  

•  HTML  •  XML  •  Office  •  PDF  •  mbox  •  More…  

22

Apache  ZooKeeper  

•  Distributed  systems  are  HARD  •  Everyone  was  trying  to  implement  the  same  subsystems  •  Bugs  leads  to  race  condiPons,  other  bad  things  

•  ZK:  Highly  reliable  distributed  coordinaPon  services  •  ConfiguraPon  •  Naming  •  SynchronizaPon  •  Group  Services  

23

Apache  Oozie  

•  Workflow  scheduling  for  Hadoop  •  Like  cron,  but  in  directed  graph  fashion  •  Out  of  box  hooks:  

•  MR  •  Pig  •  Hive  •  Sqoop  •  Impala  

24

Sentry  (incubaPng)  

•  Role-­‐based  access  control  for  Hive/Impala/Solr  •  Regulatory/compliance  assurance  

25

Cloudera  Morphlines  

•  In-­‐memory  transformaPons  •  Load,  parse,  transform,  process  •  Records  as  name-­‐value  pairs  w/  opPonal  blob/pojo  objects  

•  Java  library,  embedded  in  your  codebase  •  Used  to  ETL  data  from  Flume  and  MR  into  Solr  

26

Apache  Lucene  

•  Java-­‐based  index  and  search  •  Spellchecking  •  Hit  highlighPng  •  TokenizaPon  

27

Apache  Solr  

•  Enterprise  search  plaoorm  •  Based  on  Apache  Lucene  

•  Full-­‐text  search  •  FacePng  •  NRT  indexing  

28

Apache  SolrCloud  

•  IntegraPon  of  Solr  +  ZooKeeper  •  Provides  for  shard  failover  

29

Cloudera  Search  

•  Based  on  Apache  Solr  (incl  Lucene  and  SolrCloud)  •  Fault-­‐tolerance:  collecPons  backed  by  HDFS  or  Hbase  •  IntegraPon  galore:  

•  HBase/Flume/MapReduce  w/  Lucene  •  Hue  w/  Solr  •  Avro  w/  Tika  •  HDFS  w/  Solr/Lucene  •  Sentry  w/  Solr    

30

Cloudera  Search  +  Hue  

31  

Cloudera  Search  +  Hue  

32  

33

Apologies,  I  swiped  some  preTy  slides  from  markePng…  

Why  Search?  

Search  Design  Strategy  

34

One  pool  of  data  

One  security  framework  

One  set  of  system  resources  

One  management  interface  

An  Integrated  Part  of  the  Hadoop  System  

Storage  

Integra5on  

Resource  Management  

Metad

ata  

Batch  Processing  MAPREDUCE,  HIVE  &  PIG  

HDFS   HBase  

TEXT,  RCFILE,  PARQUET,  AVRO,  ETC.   RECORDS  

Engines  

InteracPve  SQL  

CLOUDERA  IMPALA  

InteracPve  Search  CLOUDERA  SEARCH  

Machine  Learning  MAHOUT  

Math  &  Sta5s5cs  

SAS,  R    

Benefits  of  Search  IntegraPon  

35

Improved  Big  Data  ROI  §  An  interacPve  experience  without  technical  knowledge  §  Single  data  set  for  mulPple  compuPng  frameworks  

Faster  Time  to  Insight  §  Exploratory  analysis,  esp.  unstructured  data  §  Broad  range  of  indexing  opPons  to  accommodate  needs  

Cost  Efficiency  §  Single  scalable  plaoorm;  no  incremental  investment  §  No  need  for  separate  systems,  storage  

Solid  Founda5ons  &  Reliability  §  Solr  in  producPon  environments  for  years  §  Hadoop-­‐powered  reliability  and  scalability  

36

So  much  soKware…  

Making  Decisions  

That’s  a  Lot  of  SoKware  

•  21  packages,  depending  on  how  you  count  •  And  there’s  plenty  more…  

•  How  to  decide  what  to  use?  

37

38  

“  The  answer  to  most  Hadoop  quesPons  is  it  

depends.”  

Some  of  the  Big  Issues  

•  Response  Pme  •  User  interfaces  •  Programming  paradigm  •  Input/output  formats  •  Use  cases    

39

Response  Time  

•  MapReduce  is  batch  oriented  •  Resilient  to  hardware  failures  •  Robust  scheduling  opPons  

•  Impala  is  near-­‐realPme  •  HBase  is  realPme  

•  Key/values  are  cached  in  memory  

•  Search  can  be  (near-­‐)realPme.  

•  Hybrid  systems  are  common!  

40

User  Interfaces  

•  Java  •  MapReduce,  HBase  

•  SQL  •  Hive,  Impala  

•  Shell  •  Pig  

•  Natural  Language  /  Free  Text  •  Search  

41

Data  Constraints  

•  MapReduce  •  Paradigm  takes  some  getng  used  to  •  Processing  must  accommodate  format  

•  HBase  •  Columnar  key/value  store  •  Hue  makes  this  easier  

•  Search  •  Indexing  and  display  •  Hue  makes  this  easier  

42

Input/Output  Formats  

•  Know  what  they  are…  opPonal.  •  Don’t  know?  That’s  okay.  •  Schema  on  read.  

•  Be  able  to  extract  what  you  need  

43

Lack  of  Use  Case  

•  “Big  Data”  and  Hadoop  •  They  ENABLE  you  to  solve  problems  •  Won’t  solve  problems  for  you  •  Doesn’t  know  about  your  business  logic  •  “Big”  is  bigger  than  you’re  accustomed  to…  

•  Have  a  plan  •  Bring  your  use  cases  •  Bring  your  business  quesPons  

44

45

One  typical  Hadoop  use  case.  

Index  GeneraPon/Serving  

eBay  –  Cassini  Project  

•  June  2012  •  2B  page  views/day  •  250M  searches/day  •  9  PB  online  

•  Custom  search  indexes  •  Limited  by  field  or  Pme  period  

46

eBay  –  Cassini  Project  

• MapReduce  to  generate  indexes  •  Customer  history  •  Item  fields:  name,  price,  descripPons,  etc  

•  Bulk  import  indexes  into  HBase,  served  •  15  TB  in  HBase,  1.2  TB  daily  import  into  Hbase  •  Ranking  algorithms  can  take  into  account  

•  More  history  •  More  fields  •  More  customer-­‐specific  details  

47

48

Some  quick  examples.  

Search  Use  Cases  

Search  Use  Cases  

49

Offer  easy  access  to  non-­‐technical  resources  

Explore  data  prior  to  processing  and  modeling  

Gain  immediate  access  and  find  correlaPons  in  mission-­‐criPcal  data  

Powerful,  proven  search  capabili5es  that  let  organiza5ons:  

Monsanto  

50

Scalable,  efficient  image  search  for  analysis  and  research  

Track  plant  characterisPcs  throughout  their  lifecycle  

Before:  Manual  aTribute  extracPon  and  search  queries  within  database  

Now:  Parse  and  index  images  at  acquisiPon  and  on  demand,  index  archived  images  in  batch  

51

Cloudera:  Internal  Field  Portal  

Custom  Aggregated  Search  

Cloudera  –  Internal  Field  Portal  

•  Single  stop  for  field  engineers  •  Mailing  lists:  public,  private  •  Tickets:  support,  development,  public  ASF  •  Customer  data:  accounts,  clusters,  KB  arPcles  •  Customer  Clusters:  configs,  audits,  logs,  events  •  Books  and  papers  •  Discussion  forums  

•  Dogfooding,  yes  • Makes  my  life  easier  

52

Cloudera  –  Internal  Field  Portal  

53  

Cloudera  –  Internal  Field  Portal  

•  Varied  fetchers/observers  for  web/API  content  •  Content  is  retrieved  via  Flume,  Sqoop  

•  Search  indexes  and  replicates  into  HBase  •  Each  collecPon  has  collecPon-­‐specific  filters/fields  •  Provides  Ptle,  content  snippet,  link  to  original  

• Morphlines  extracts  books  and  papers  using  Tika  •  Impala  for  analyPcs  

•  Future:  Use  MapReduce  to  ingest  logs  

54

55

PaTerns  &  PredicPons:  Durkheim  Project  

Risk  ClassificaPon  &  PredicPve  Analysis  

56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

 

2012  

57 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

 US  Military  Suicides  349  

2012  

58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US  Combat  Deaths  AFG  301  

US  Military  Suicides  349  

 349  >  301  

2012  

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Assessment  of  mental  health  risks  •  Correlate  veterans’  communicaPons  with  suicide  risk  

59

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Build  machine  learning  algorithms  on  MapReduce  •  Train  using  expert  knowledge  

•  Keywords  •  PaTerns  

•  Algorithm  detects  and  assign  risk  scores  •  In  what  medium?  

60

PaTerns  &  PredicPons  –  Durkheim  Project  

61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/

Unstructured  Clinical  Notes  

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Phase  1  •  3  cohorts:  non-­‐psychiatric,  psychiatric,  suicide-­‐posiPve  •  100  clinical  profiles  per  cohort  •  65%  accurate  in  predicPng  suicide  risk  in  control  group  

•  Phase  2    •  Text  analyPcs  of  clinical  records,  opt-­‐in  social  media  •  Goal  of  100,000  veteran  parPcipants  •  Represents  a  huge  increase  of  data  

•  TradiPonal  enterprise  search  couldn’t  scale  

62

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Technologies  •  Hadoop  •  Search  

•  Indexing  of  machine  learning,  backed  by  HBase  for  performance  •  Hue  interface  for  non-­‐technical  users  •  Discovery  of  terms,  keywords,  risk  factors  in  numerous  facets  

•  Impala  •  Deep  SQL  queries  if/when  interesPng  deviaPons  are  found  •  e.g.  if  the  word  “Molly”  appeared  in  top  10  facets  •  Write  some  SQL  to  dig  in,  perhaps  revise  indexing  scheme  

63

PaTerns  &  PredicPons  –  Durkheim  Project  

•  Currently  •  Monitoring  •  Analysis  

•  Future  •  IntervenPonal  study  •  Back  our  hopes  with  data…  

• More  detailed  Case  Study  •  hTp://goo.gl/3ZJMwS  •  hTp://durkheimproject.org/  

64

65

ParPng  thoughts…  in  no  parPcular  order.  

Summary  

Search  Simplifies  InteracPon  

66

Explore  

Navigate  

Correlate  Experts  know  MapReduce.  Savvy  people  know  SQL.    

Everyone  knows  Search.  

Summary  

•  With  Hadoop,  it  depends.  •  The  tools  are  out  there.  •  Open  source  soKware  

•  Many  interconnected  pieces  •  Many  unexplored  opportuniPes  •  A  thriving  community  awaits  you…  

•  Data  can  make  a  difference.  •  Search  allows  everyone  to  interact  with  data.  

•  This  is  a  Big  Deal.  

67

What’s  Next?  

•  Download  Hadoop!  •  Already  done  that?  Contribute…  

•  CDH  available  at  www.cloudera.com  •  Cloudera  provides  pre-­‐loaded  VMs  

•  hTp://Pny.cloudera.com/quickstartvm  

•  Clone  our  repos!  •  hTps://github.com/cloudera  

68

69

Preferably  related  to  the  talk…  

QuesPons?  

70

Thank  You!  Alex  Moundalexis  alexm@clouderagovt.com  @technmsg    We’re  hiring,  kids!  Well,  not  kids.  

top related