clouderasearch) - meetupfiles.meetup.com/14454172/cloudera search.pdf ·...

30
Cloudera Search Chris Putnam | Cloudera Systems Engineer

Upload: others

Post on 16-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

Cloudera  Search  

Chris  Putnam    |    Cloudera  Systems  Engineer  

Page 2: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

2  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  at  a  Glance  Hint..  Not  a  cloud  hosAng  company  

Page 3: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

3  ©  Cloudera,  Inc.  All  rights  reserved.  

One  PlaEorm,  Many  Workloads  

Batch,  InteracAve,  and  Real-­‐Time.  Leading  performance  and  usability  in  one  plaEorm.  

•  End-­‐to-­‐end  analyAc  workflows  

•  Access  more  data  •  Work  with  data  in  new  ways  •  Enable  new  users  

System  and  Data  Management  

Process  Ingest  

Sqoop,  Flume  

Transform  MapReduce,  

Hive,  Pig,  Spark  

Discover  AnalyAc  Database  

Impala  

Search  Solr  

Model  Machine  Learning  SAS,  R,  Spark,  

Mahout  

Serve  NoSQL  Database  

HBase  

Streaming  Spark  Streaming  

Unlimited  Storage  HDFS,  HBase  

YARN,  Cloudera  Manager  Cloudera  Navigator  

Page 4: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

4  ©  Cloudera,  Inc.  All  rights  reserved.  

Open  Source,  Open  Standards  

Open  Standards  are  just  as  important  as  Open  Source.    Why  does  it  maZer?  •  Sustainable  Value  •  Vendor  Portability  •  Ecosystem  CompaAbility  

Every  project  in  CDH  is  an  Open  Standard.  

Vendor  Support  

Component  (Founder)   Cloudera   Pivotal   MapR   Amazon   IBM   Hortonworks  

Impala  (Cloudera)   ✔   ✖   ✔   ✔   ✖   ✖  

Spark  (UC  Berkeley)   ✔   ✔   ✔   ✔   ✔   ✔  

Hue  (Cloudera)   ✔   ✖   ✔   ✔   ✖   ✔  

Sentry  (Cloudera)   ✔   ✔   ✔   ✖   ✔   ✖  

Flume  (Cloudera)   ✔   ✔   ✔   ✖   ✔   ✔  

Parquet  (Cloudera/TwiEer)  

✔   ✔   ✔   ✔   ✔   ✖  

Sqoop  (Cloudera)   ✔   ✔   ✔   ✔   ✔   ✔  

Falcon  (Hortonworks)   ✖   ✖   ✖   ✖   ✖   ✔  

Knox  (Hortonworks)   ✖   ✖   ✖   ✖   ✖   ✔  

Tez  (Hortonworks)   ✖   ✖   ✔   ✖   ✖   ✔  

Ranger  (Hortonworks)   ✖   ✖   ✖   ✖   ✖   ✔  

ORCfile  (Hortonworks)   ✖   ✖   ✖   ✖   ✖   ✔  

Page 5: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

5  ©  Cloudera,  Inc.  All  rights  reserved.  

Try  It  With  Cloudera  Live  

cloudera.com/live  

Featuring  tutorials  on:  

CDH  

Page 6: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

6  ©  Cloudera,  Inc.  All  rights  reserved.  

Why  is  Search  Awesome  

• A  few  people  can  write  code  for  Spark  or  MapReduce  • A  larger  number  of  people  can  write  SQL  queries  • Nearly  everyone  can  use  a  search  engine  

Search  makes  your  organizaAons  data  accessible  to  everyone    

Page 7: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

7  ©  Cloudera,  Inc.  All  rights  reserved.  

Search  as  Part  of  a  Workflow  

With  search  on  Hadoop  users  can  find  data  and  do  something  with  it  –  in  the  same  plaEorm!  

Page 8: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

8  ©  Cloudera,  Inc.  All  rights  reserved.  

Common  Use  Cases  

• Threat  detecAon    • AcAve  archive  /  accessible  global  knowledge  base  • Data  accuracy  • Streamlined  cross-­‐data  type  aggregaAon    • Richer  customer  profiling  /  ecommerce  experience  •  InteracAve  market  segmenAng  /  customer  idenAficaAon  • Expedited  data  modeling  

Page 9: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

9  ©  Cloudera,  Inc.  All  rights  reserved.  

What  is  Cloudera  Search?  

Page 10: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

10  ©  Cloudera,  Inc.  All  rights  reserved.  

RelaAonship  Between  Cloudera  Search  and  Apache  Solr  

• Apache  Solr  is  the  foundaAon  of  Cloudera  Search  • Proven  technology  that  powers  much  of  the  internet  • AcAve  open  source  community    

• Cloudera  Search  adds  many  addiAonal  capabiliAes  •  IntegraAon  with  HDFS,  MapReduce,  HBase,  and  Flume  • Support  for  file  formats  widely  used  with  Hadoop  • Dynamic  Web-­‐based  dashboard  and  Search  interface  with  Hue  • Fine-­‐grained  access  control  through  integraAon  with  Apache  Sentry    

Page 11: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

11  ©  Cloudera,  Inc.  All  rights  reserved.  

The  Heritage  of  Solr  Search  

Zookeeper

Doug  Cukng    –  Cloudera  Chief  Architect  

Page 12: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

12  ©  Cloudera,  Inc.  All  rights  reserved.  

Cloudera  Search  Stack  

HDFS  

Lucene  

ExtracAon   Mapping  

Solr  

Zookeeper  

SolrCloud  

Querying  API   Indexing  API  

Storage  

Text  Search  Engine  Library  

NoSQL  Search  PlaEorm  

ConfiguraAon  &  SynchronizaAon  

Tika,  Morphlines  etc.  

Distributed  Search  Components  

User  Services  

Page 13: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

13  ©  Cloudera,  Inc.  All  rights  reserved.  

Documents,  Fields,  Queries  and  Terms  Common  Terms  and  Concepts  in  Solr  

Query  –  A  query  is  composed  of  terms  of  interest  which  the  user  is  interested  in.  

Document  –  Similar  to  a  row  in  a  database  table.    Documents  are  flexible  in  that  a  single  file  may  contain  mulAple  documents  

Title   Author   Date   Summary   Body  

Game  of     George  R.   8/6/1996   Long  ago     An  ancien  

Meta-­‐data  

Page 14: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

14  ©  Cloudera,  Inc.  All  rights  reserved.  

Index  

What  is  an  Index  

ID:   Name:   Title:   Bonus:  

A   Alice   Manager   $5,000  

Document  –     Index  –  Data  structures  opAmized  for  quick  lookups  

Name  

Alice:  (a)  

Bruce:  (b)  

Carol:  (c)  

David:  (d)  

Title  

Analyst:  (d)  

Engineer:  (b)  

Manager:  (a,  c)  

Id: string Name: string Title: string Bonus: int

Schema  -­‐    

Indexing    –  Process  of  capturing  meta  data  from  input  and  creaAng  documents  and  indexes  

Page 15: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

15  ©  Cloudera,  Inc.  All  rights  reserved.  

CollecAons  and  Shards  

ConfiguraAon   Index  

CollecAon  

Shard  1  

Index  

Shard  2  

Sharding    –  Breaking  the  index  into  pieces  which  are  then  distributed  amongst  the  cluster.  This  technique  improves  scalability  and  response  Ame.  

CollecRon    –  CollecAons  are  the  discrete  unit  of  search  deployments.    Nodes  can  host  mulAple  collecAons.  

Page 16: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

16  ©  Cloudera,  Inc.  All  rights  reserved.  

How  Queries  are  Served  

 1.  Client  request  is  given  to  any  of  the  cluster  members  

running  Solr  2.  The  node  receiving  the  request  distributes  query  to  other  

members  if  needed  (Each  node  consulted  during  query  returns  results  for  its  one  shard)  

3.  IniAal  nodes  returns  results  to  client  

Page 17: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

17  ©  Cloudera,  Inc.  All  rights  reserved.  

Data  Ingest  /  Index  CreaAon  

Page 18: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

18  ©  Cloudera,  Inc.  All  rights  reserved.  

Indexing  in  Cloudera  Search  • Near  Real-­‐Time  Indexing  • Batch  Indexing  • HBASE  Indexing  

ExtracAon  and  Mapping  

•  Flume  • Morphlines  • Tika  

Page 19: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

19  ©  Cloudera,  Inc.  All  rights  reserved.  

Near  Real-­‐Time  Indexing  

HDFS  

Events  

Morphline  Solr  Sink  

OpAonal  Raw  Event  Stored  in  HDFs  

As  events  occur  they  are  picked  up  by  a  Flume  agent  and  passed  to  the  Morphline  Solr  Sink  and  opAonally  also  to  HDFS  The  Morphline  Solr  Sink  updates  or  creates  or  a  Solr  Index  from  the  events  Events  are  searchable  aqer  being  added  to  the  Solr  Index  

Flume  Pipeline  

Page 20: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

20  ©  Cloudera,  Inc.  All  rights  reserved.  

Streamlined  ExtracAon  and  Mapping  

Cloudera  Morphlines  •  Simple  and  flexible  data  transformaAon    

•  Reusable  across  mulAple  index  workloads  

•  Over  Ame,  extend  and  re-­‐use  across  plaEorm  workloads  

 

syslog   Flume  Agent  

Solr  sink  

Command:  readLine  

Command:  grok  

Command:  loadSolr  

Solr  

Event  

Record  

Record  

Record  

Document  

Page 21: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

21  ©  Cloudera,  Inc.  All  rights  reserved.  

Batch  Indexing  

HDFS   Map  Reduce  Job  

1.  Data  is  stored  in  HDFS  2.  Data  is  read  by  Map  Reduce  Index  Job  3.  An  Index  is  created  4.  The  Index  is  stored  back  in  HDFS  as  

part  of  the  CollecAon  

1 2

34

*Note  Spark  can  be  used  in  lieu  of  Map  Reduce  with  the  Crunch  Indexer  Tool  

Page 22: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

22  ©  Cloudera,  Inc.  All  rights  reserved.  

Searchable  Real-­‐Time  Data  

HDFS  

HBase  

interacAve  load  

Solr  server  Solr  server  Solr  server  Solr  server  SolrCloud  

Event  Listener  +   =  

planet-­‐sized  tabular  data  immediate  access  &  updates  fast  &  flexible  info  discovery  

Secondary  Indexes  without  Performance  Impact  

Data  Updates  

Lily  HBase  Indexer  

Page 23: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

23  ©  Cloudera,  Inc.  All  rights  reserved.  

Simple,  Customizable  Search  Interface  

Hue  •  Simple  UI  •  Navigated,  faceted  drill  down  •  Customizable  display  •  Full  text  search,  standard  Solr  API  and  query  language  

•  Hadoop  data  types  •  Maps,  dashboards,  Amelines  •  Index  Designer  

Page 24: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

24  ©  Cloudera,  Inc.  All  rights  reserved.  

Architecture  Overview  Data   End  User  Client  App  (Hue)  

Flum

e  

HDFS  

Raw,  filtered,  or  annotated  data  

SolrCloud  Cluster(s)  Data  to  be  indexed  

Indexed  data  

MapReduce  Batch  Indexing  

GoLive  updates  

HBase  Cluster  ReplicaAon  Events  to  be  indexed  

Data  

Cloudera  Manager  

Search  queries  

Page 25: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

25  ©  Cloudera,  Inc.  All  rights  reserved.  

Use  Cases  

Page 26: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

26  ©  Cloudera,  Inc.  All  rights  reserved.  

Monsanto  

Scalable,  efficient  image  search  for  analysis  and  research  

Track  plant  characterisAcs  throughout  their  lifecycle  

Before:  Manual  aZribute  extracAon  and  search  queries  within  database  

Now:  Parse  and  index  images  at  acquisiAon  and  on  demand,  index  archived  images  in  batch  

Page 27: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

27  ©  Cloudera,  Inc.  All  rights  reserved.  

PaZerns  and  PredicAons  

ProacRve  healthcare  for  returning  military  veterans  

IdenAfy  paZerns  in  social  media  and  perform  analyAcs  on  term  usage  to  improve  mental  health  predicAve  capabiliAes  

Before:  Social  media  data  sets  too  large;  tradiAonal  enterprise  search  

Now:  Near  real-­‐Ame  correlaAon  of  medical  records,  notes,  social  media  

Page 28: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

28  ©  Cloudera,  Inc.  All  rights  reserved.  

Manufacturing  and  Supply  Chain  

Improving  efficiency  by  idenRfying  and  addressing  issues  in  near  real-­‐Rme  Search-­‐driven  enterprise  data  hub  empowering  360-­‐degree  view  of  product  quality  and  performance  across  the  supply  chain    

Before:  Diverse,  disparate,  and  inconsistent  quality  data  incompaAble  with  RDBMS    

Now:  Rapidly  index  all  raw  data;  relevant,  interacAve  analysis  in  seconds;  1.5B+  documents  for  one  customer;  annual  aggregate  savings  of  USD    15-­‐25M    

Page 29: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

29  ©  Cloudera,  Inc.  All  rights  reserved.  

Near  Real-­‐Time  Indexing  

HDFS  

Tweets  

Morphline  Solr  Sink  

Flume  Pipeline  

Flume  TwiZer  API  source  

agent.sources.twiZerSrc.type  =  org.apache.flume.source.twiZer.TwiZerSource  

Page 30: ClouderaSearch) - Meetupfiles.meetup.com/14454172/Cloudera Search.pdf · ©)Cloudera,)Inc.)All)rights)reserved.) 10 Relaonship)Between)ClouderaSearch)and)Apache)Solr) • Apache)Solr)is)the)foundaon)of)ClouderaSearch)

Thank  You