clouderasearch) - meetupfiles.meetup.com/14454172/cloudera search.pdf ·...

Cloudera Search

Chris Putnam | Cloudera Systems Engineer

2 © Cloudera, Inc. All rights reserved.

Cloudera at a Glance Hint.. Not a cloud hosAng company


One PlaEorm, Many Workloads

Batch, InteracAve, and Real-‐Time. Leading performance and usability in one plaEorm.

•  End-‐to-‐end analyAc workflows

•  Access more data •  Work with data in new ways •  Enable new users

System and Data Management

Process Ingest

Sqoop, Flume

Transform MapReduce,

Hive, Pig, Spark

Discover AnalyAc Database

Impala

Search Solr

Model Machine Learning SAS, R, Spark,

Mahout

Serve NoSQL Database

HBase

Streaming Spark Streaming

Unlimited Storage HDFS, HBase

YARN, Cloudera Manager Cloudera Navigator


Open Source, Open Standards

Open Standards are just as important as Open Source. Why does it maZer? •  Sustainable Value •  Vendor Portability •  Ecosystem CompaAbility

Every project in CDH is an Open Standard.

Vendor Support

Component (Founder) Cloudera Pivotal MapR Amazon IBM Hortonworks

Impala (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖

Spark (UC Berkeley) ✔ ✔ ✔ ✔ ✔ ✔

Hue (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔

Sentry (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖

Flume (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔

Parquet (Cloudera/TwiEer)

✔ ✔ ✔ ✔ ✔ ✖

Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔

Falcon (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔

Knox (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔

Tez (Hortonworks) ✖ ✖ ✔ ✖ ✖ ✔

Ranger (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔

ORCfile (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔


Try It With Cloudera Live

cloudera.com/live

Featuring tutorials on:

CDH


Why is Search Awesome

• A few people can write code for Spark or MapReduce • A larger number of people can write SQL queries • Nearly everyone can use a search engine

Search makes your organizaAons data accessible to everyone


Search as Part of a Workflow

With search on Hadoop users can find data and do something with it – in the same plaEorm!


Common Use Cases

• Threat detecAon • AcAve archive / accessible global knowledge base • Data accuracy • Streamlined cross-‐data type aggregaAon • Richer customer profiling / ecommerce experience •  InteracAve market segmenAng / customer idenAficaAon • Expedited data modeling


What is Cloudera Search?


RelaAonship Between Cloudera Search and Apache Solr

• Apache Solr is the foundaAon of Cloudera Search • Proven technology that powers much of the internet • AcAve open source community

• Cloudera Search adds many addiAonal capabiliAes •  IntegraAon with HDFS, MapReduce, HBase, and Flume • Support for file formats widely used with Hadoop • Dynamic Web-‐based dashboard and Search interface with Hue • Fine-‐grained access control through integraAon with Apache Sentry


The Heritage of Solr Search

Zookeeper

Doug Cukng – Cloudera Chief Architect


Cloudera Search Stack

HDFS

Lucene

ExtracAon Mapping

Solr

Zookeeper

SolrCloud

Querying API Indexing API

Storage

Text Search Engine Library

NoSQL Search PlaEorm

ConfiguraAon & SynchronizaAon

Tika, Morphlines etc.

Distributed Search Components

User Services


Documents, Fields, Queries and Terms Common Terms and Concepts in Solr

Query – A query is composed of terms of interest which the user is interested in.

Document – Similar to a row in a database table. Documents are flexible in that a single file may contain mulAple documents

Title Author Date Summary Body

Game of George R. 8/6/1996 Long ago An ancien

Meta-‐data


Index

What is an Index

ID: Name: Title: Bonus:

A Alice Manager $5,000

Document – Index – Data structures opAmized for quick lookups

Name

Alice: (a)

Bruce: (b)

Carol: (c)

David: (d)

Title

Analyst: (d)

Engineer: (b)

Manager: (a, c)

Id: string Name: string Title: string Bonus: int

Schema -‐

Indexing – Process of capturing meta data from input and creaAng documents and indexes


CollecAons and Shards

ConfiguraAon Index

CollecAon

Shard 1

Index

Shard 2

Sharding – Breaking the index into pieces which are then distributed amongst the cluster. This technique improves scalability and response Ame.

CollecRon – CollecAons are the discrete unit of search deployments. Nodes can host mulAple collecAons.


How Queries are Served

1.  Client request is given to any of the cluster members

running Solr 2.  The node receiving the request distributes query to other

members if needed (Each node consulted during query returns results for its one shard)

3.  IniAal nodes returns results to client


Data Ingest / Index CreaAon


Indexing in Cloudera Search • Near Real-‐Time Indexing • Batch Indexing • HBASE Indexing

ExtracAon and Mapping

•  Flume • Morphlines • Tika


Near Real-‐Time Indexing

HDFS

Events

Morphline Solr Sink

OpAonal Raw Event Stored in HDFs

As events occur they are picked up by a Flume agent and passed to the Morphline Solr Sink and opAonally also to HDFS The Morphline Solr Sink updates or creates or a Solr Index from the events Events are searchable aqer being added to the Solr Index

Flume Pipeline


Streamlined ExtracAon and Mapping

Cloudera Morphlines •  Simple and flexible data transformaAon

•  Reusable across mulAple index workloads

•  Over Ame, extend and re-‐use across plaEorm workloads

syslog Flume Agent

Solr sink

Command: readLine

Command: grok

Command: loadSolr

Solr

Event

Record

Record

Record

Document


Batch Indexing

HDFS Map Reduce Job

1.  Data is stored in HDFS 2.  Data is read by Map Reduce Index Job 3.  An Index is created 4.  The Index is stored back in HDFS as

part of the CollecAon

1 2

34

*Note Spark can be used in lieu of Map Reduce with the Crunch Indexer Tool


Searchable Real-‐Time Data

HDFS

HBase

interacAve load

Solr server Solr server Solr server Solr server SolrCloud

Event Listener + =

planet-‐sized tabular data immediate access & updates fast & flexible info discovery

Secondary Indexes without Performance Impact

Data Updates

Lily HBase Indexer


Simple, Customizable Search Interface

Hue •  Simple UI •  Navigated, faceted drill down •  Customizable display •  Full text search, standard Solr API and query language

•  Hadoop data types •  Maps, dashboards, Amelines •  Index Designer


Architecture Overview Data End User Client App (Hue)

Flum

e

HDFS

Raw, filtered, or annotated data

SolrCloud Cluster(s) Data to be indexed

Indexed data

MapReduce Batch Indexing

GoLive updates

HBase Cluster ReplicaAon Events to be indexed

Data

Cloudera Manager

Search queries


Use Cases


Monsanto

Scalable, efficient image search for analysis and research

Track plant characterisAcs throughout their lifecycle

Before: Manual aZribute extracAon and search queries within database

Now: Parse and index images at acquisiAon and on demand, index archived images in batch


PaZerns and PredicAons

ProacRve healthcare for returning military veterans

IdenAfy paZerns in social media and perform analyAcs on term usage to improve mental health predicAve capabiliAes

Before: Social media data sets too large; tradiAonal enterprise search

Now: Near real-‐Ame correlaAon of medical records, notes, social media


Manufacturing and Supply Chain

Improving efficiency by idenRfying and addressing issues in near real-‐Rme Search-‐driven enterprise data hub empowering 360-‐degree view of product quality and performance across the supply chain

Before: Diverse, disparate, and inconsistent quality data incompaAble with RDBMS

Now: Rapidly index all raw data; relevant, interacAve analysis in seconds; 1.5B+ documents for one customer; annual aggregate savings of USD 15-‐25M


Near Real-‐Time Indexing

HDFS

Tweets

Morphline Solr Sink

Flume Pipeline

Flume TwiZer API source

agent.sources.twiZerSrc.type = org.apache.flume.source.twiZer.TwiZerSource

Thank You