clouderasearch) - meetupfiles.meetup.com/14454172/cloudera search.pdf ·...
TRANSCRIPT
Cloudera Search
Chris Putnam | Cloudera Systems Engineer
2 © Cloudera, Inc. All rights reserved.
Cloudera at a Glance Hint.. Not a cloud hosAng company
3 © Cloudera, Inc. All rights reserved.
One PlaEorm, Many Workloads
Batch, InteracAve, and Real-‐Time. Leading performance and usability in one plaEorm.
• End-‐to-‐end analyAc workflows
• Access more data • Work with data in new ways • Enable new users
System and Data Management
Process Ingest
Sqoop, Flume
Transform MapReduce,
Hive, Pig, Spark
Discover AnalyAc Database
Impala
Search Solr
Model Machine Learning SAS, R, Spark,
Mahout
Serve NoSQL Database
HBase
Streaming Spark Streaming
Unlimited Storage HDFS, HBase
YARN, Cloudera Manager Cloudera Navigator
4 © Cloudera, Inc. All rights reserved.
Open Source, Open Standards
Open Standards are just as important as Open Source. Why does it maZer? • Sustainable Value • Vendor Portability • Ecosystem CompaAbility
Every project in CDH is an Open Standard.
Vendor Support
Component (Founder) Cloudera Pivotal MapR Amazon IBM Hortonworks
Impala (Cloudera) ✔ ✖ ✔ ✔ ✖ ✖
Spark (UC Berkeley) ✔ ✔ ✔ ✔ ✔ ✔
Hue (Cloudera) ✔ ✖ ✔ ✔ ✖ ✔
Sentry (Cloudera) ✔ ✔ ✔ ✖ ✔ ✖
Flume (Cloudera) ✔ ✔ ✔ ✖ ✔ ✔
Parquet (Cloudera/TwiEer)
✔ ✔ ✔ ✔ ✔ ✖
Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔
Falcon (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔
Knox (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔
Tez (Hortonworks) ✖ ✖ ✔ ✖ ✖ ✔
Ranger (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔
ORCfile (Hortonworks) ✖ ✖ ✖ ✖ ✖ ✔
5 © Cloudera, Inc. All rights reserved.
Try It With Cloudera Live
cloudera.com/live
Featuring tutorials on:
CDH
6 © Cloudera, Inc. All rights reserved.
Why is Search Awesome
• A few people can write code for Spark or MapReduce • A larger number of people can write SQL queries • Nearly everyone can use a search engine
Search makes your organizaAons data accessible to everyone
7 © Cloudera, Inc. All rights reserved.
Search as Part of a Workflow
With search on Hadoop users can find data and do something with it – in the same plaEorm!
8 © Cloudera, Inc. All rights reserved.
Common Use Cases
• Threat detecAon • AcAve archive / accessible global knowledge base • Data accuracy • Streamlined cross-‐data type aggregaAon • Richer customer profiling / ecommerce experience • InteracAve market segmenAng / customer idenAficaAon • Expedited data modeling
9 © Cloudera, Inc. All rights reserved.
What is Cloudera Search?
10 © Cloudera, Inc. All rights reserved.
RelaAonship Between Cloudera Search and Apache Solr
• Apache Solr is the foundaAon of Cloudera Search • Proven technology that powers much of the internet • AcAve open source community
• Cloudera Search adds many addiAonal capabiliAes • IntegraAon with HDFS, MapReduce, HBase, and Flume • Support for file formats widely used with Hadoop • Dynamic Web-‐based dashboard and Search interface with Hue • Fine-‐grained access control through integraAon with Apache Sentry
11 © Cloudera, Inc. All rights reserved.
The Heritage of Solr Search
Zookeeper
Doug Cukng – Cloudera Chief Architect
12 © Cloudera, Inc. All rights reserved.
Cloudera Search Stack
HDFS
Lucene
ExtracAon Mapping
Solr
Zookeeper
SolrCloud
Querying API Indexing API
Storage
Text Search Engine Library
NoSQL Search PlaEorm
ConfiguraAon & SynchronizaAon
Tika, Morphlines etc.
Distributed Search Components
User Services
13 © Cloudera, Inc. All rights reserved.
Documents, Fields, Queries and Terms Common Terms and Concepts in Solr
Query – A query is composed of terms of interest which the user is interested in.
Document – Similar to a row in a database table. Documents are flexible in that a single file may contain mulAple documents
Title Author Date Summary Body
Game of George R. 8/6/1996 Long ago An ancien
Meta-‐data
14 © Cloudera, Inc. All rights reserved.
Index
What is an Index
ID: Name: Title: Bonus:
A Alice Manager $5,000
Document – Index – Data structures opAmized for quick lookups
Name
Alice: (a)
Bruce: (b)
Carol: (c)
David: (d)
Title
Analyst: (d)
Engineer: (b)
Manager: (a, c)
Id: string Name: string Title: string Bonus: int
Schema -‐
Indexing – Process of capturing meta data from input and creaAng documents and indexes
15 © Cloudera, Inc. All rights reserved.
CollecAons and Shards
ConfiguraAon Index
CollecAon
Shard 1
Index
Shard 2
Sharding – Breaking the index into pieces which are then distributed amongst the cluster. This technique improves scalability and response Ame.
CollecRon – CollecAons are the discrete unit of search deployments. Nodes can host mulAple collecAons.
16 © Cloudera, Inc. All rights reserved.
How Queries are Served
1. Client request is given to any of the cluster members
running Solr 2. The node receiving the request distributes query to other
members if needed (Each node consulted during query returns results for its one shard)
3. IniAal nodes returns results to client
17 © Cloudera, Inc. All rights reserved.
Data Ingest / Index CreaAon
18 © Cloudera, Inc. All rights reserved.
Indexing in Cloudera Search • Near Real-‐Time Indexing • Batch Indexing • HBASE Indexing
ExtracAon and Mapping
• Flume • Morphlines • Tika
19 © Cloudera, Inc. All rights reserved.
Near Real-‐Time Indexing
HDFS
Events
Morphline Solr Sink
OpAonal Raw Event Stored in HDFs
As events occur they are picked up by a Flume agent and passed to the Morphline Solr Sink and opAonally also to HDFS The Morphline Solr Sink updates or creates or a Solr Index from the events Events are searchable aqer being added to the Solr Index
Flume Pipeline
20 © Cloudera, Inc. All rights reserved.
Streamlined ExtracAon and Mapping
Cloudera Morphlines • Simple and flexible data transformaAon
• Reusable across mulAple index workloads
• Over Ame, extend and re-‐use across plaEorm workloads
syslog Flume Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
21 © Cloudera, Inc. All rights reserved.
Batch Indexing
HDFS Map Reduce Job
1. Data is stored in HDFS 2. Data is read by Map Reduce Index Job 3. An Index is created 4. The Index is stored back in HDFS as
part of the CollecAon
1 2
34
*Note Spark can be used in lieu of Map Reduce with the Crunch Indexer Tool
22 © Cloudera, Inc. All rights reserved.
Searchable Real-‐Time Data
HDFS
HBase
interacAve load
Solr server Solr server Solr server Solr server SolrCloud
Event Listener + =
planet-‐sized tabular data immediate access & updates fast & flexible info discovery
Secondary Indexes without Performance Impact
Data Updates
Lily HBase Indexer
23 © Cloudera, Inc. All rights reserved.
Simple, Customizable Search Interface
Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
• Hadoop data types • Maps, dashboards, Amelines • Index Designer
24 © Cloudera, Inc. All rights reserved.
Architecture Overview Data End User Client App (Hue)
Flum
e
HDFS
Raw, filtered, or annotated data
SolrCloud Cluster(s) Data to be indexed
Indexed data
MapReduce Batch Indexing
GoLive updates
HBase Cluster ReplicaAon Events to be indexed
Data
Cloudera Manager
Search queries
25 © Cloudera, Inc. All rights reserved.
Use Cases
26 © Cloudera, Inc. All rights reserved.
Monsanto
Scalable, efficient image search for analysis and research
Track plant characterisAcs throughout their lifecycle
Before: Manual aZribute extracAon and search queries within database
Now: Parse and index images at acquisiAon and on demand, index archived images in batch
27 © Cloudera, Inc. All rights reserved.
PaZerns and PredicAons
ProacRve healthcare for returning military veterans
IdenAfy paZerns in social media and perform analyAcs on term usage to improve mental health predicAve capabiliAes
Before: Social media data sets too large; tradiAonal enterprise search
Now: Near real-‐Ame correlaAon of medical records, notes, social media
28 © Cloudera, Inc. All rights reserved.
Manufacturing and Supply Chain
Improving efficiency by idenRfying and addressing issues in near real-‐Rme Search-‐driven enterprise data hub empowering 360-‐degree view of product quality and performance across the supply chain
Before: Diverse, disparate, and inconsistent quality data incompaAble with RDBMS
Now: Rapidly index all raw data; relevant, interacAve analysis in seconds; 1.5B+ documents for one customer; annual aggregate savings of USD 15-‐25M
29 © Cloudera, Inc. All rights reserved.
Near Real-‐Time Indexing
HDFS
Tweets
Morphline Solr Sink
Flume Pipeline
Flume TwiZer API source
agent.sources.twiZerSrc.type = org.apache.flume.source.twiZer.TwiZerSource
Thank You