search in the apache hadoop ecosystem: thoughts from the field

Search in the Apache Hadoop Ecosystem: Thoughts from the Field Open Source Search Conference, November 2013 Alex Moundalexis alexm@clouderagovt.com @technmsg

Thoughts of a Former SA

Thoughts of a Former SA Field Guy

Disclaimer

•  Technologies, not products •  Cloudera builds things soKware

•  most donated to Apache •  some closed-‐source

•  I will likely menPon “Cloudera Something” •  Cloudera “products” I reference are open source

•  Apache Licensed •  Source code is on GitHub

•  hTps://github.com/cloudera

What This Talk Isn’t About

•  Deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  Sizing & Tuning •  Depends heavily on data and workload

•  Coding •  Algorithms

“ The answer to most Hadoop quesPons is it

depends.”

Quick and dirty, more Pme for use cases.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaPons •  ConfiguraPon •  Workflow

ParPal Ecosystem

Hadoop

external system

RDBMS / DWH

web server

device logs

API access

log collecPon

DB table import

batch processing

machine learning

external system

API access

RDBMS / DWH

DB table export

BI tool + JDBC/ODBC

Search

•  Distributed, highly fault-‐tolerant filesystem •  OpPmized for large streaming access to data •  Based on Google File System

•  hTp://research.google.com/archive/gfs.html

Lots of Commodity Machines

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realPme •  Works well with distributed compuPng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hTp://research.google.com/archive/mapreduce.html

Under the Covers

You specify map() and reduce() functions. ��

��The framework does the

rest. 60

Apache HBase

•  Random, realPme read/write access •  Key/value columnar store •  (b|tr)illions of rows/columns •  Based on Google BigTable

•  hTp://research.google.com/archive/bigtable.html

Apache Accumulo

•  Random, realPme read/write access •  Key/value columnar store •  (b|tr)illions of rows/columns •  Based on Google BigTable

•  hTp://research.google.com/archive/bigtable.html

•  Adds cell-‐level security •  Implemented by NaPonal Security Agency

•  Donated to ASF

Apache Hive & Pig

•  AbstracPon of Hadoop’s Java API •  Hive is SQL-‐based •  Pig is more data-‐flow oriented

•  Eases analysis using MapReduce

Cloudera Impala

•  SQL-‐based, but interacPve response •  Backed by HDFS or HBase •  Allows for fast iteraPon/discovery •  Not as fault-‐tolerant as MapReduce

Apache Sqoop & Flume

•  Get your data in and out of HDFS •  Sqoop focuses on relaPonal databases •  Flume focuses on log files

Cloudera Hue

•  Hadoop User Experience •  Hadoop is largely command line •  Hue provides a UI for end-‐users •  SDK to build your own apps on top

Apache Mahout

•  Machine learning algorithms that run on MapReduce •  Clustering •  ClassificaPon •  Filtering

•  I didn’t study these algorithms in school •  Data science people are excited •  Math people are excited •  I’m excited for them

Apache Tika

•  Content analysis toolkit •  Simply put, a lot of parsers •  Detect/extract metadata/text from documents

•  HTML •  XML •  Office •  PDF •  mbox •  More…

Apache ZooKeeper

•  Distributed systems are HARD •  Everyone was trying to implement the same subsystems •  Bugs leads to race condiPons, other bad things

•  ZK: Highly reliable distributed coordinaPon services •  ConfiguraPon •  Naming •  SynchronizaPon •  Group Services

Apache Oozie

•  Workflow scheduling for Hadoop •  Like cron, but in directed graph fashion •  Out of box hooks:

•  MR •  Pig •  Hive •  Sqoop •  Impala

Sentry (incubaPng)

•  Role-‐based access control for Hive/Impala/Solr •  Regulatory/compliance assurance

Cloudera Morphlines

•  In-‐memory transformaPons •  Load, parse, transform, process •  Records as name-‐value pairs w/ opPonal blob/pojo objects

•  Java library, embedded in your codebase •  Used to ETL data from Flume and MR into Solr

Apache Lucene

•  Java-‐based index and search •  Spellchecking •  Hit highlighPng •  TokenizaPon

Apache Solr

•  Enterprise search plaoorm •  Based on Apache Lucene

•  Full-‐text search •  FacePng •  NRT indexing

Apache SolrCloud

•  IntegraPon of Solr + ZooKeeper •  Provides for shard failover

Cloudera Search

•  Based on Apache Solr (incl Lucene and SolrCloud) •  Fault-‐tolerance: collecPons backed by HDFS or Hbase •  IntegraPon galore:

•  HBase/Flume/MapReduce w/ Lucene •  Hue w/ Solr •  Avro w/ Tika •  HDFS w/ Solr/Lucene •  Sentry w/ Solr

Cloudera Search + Hue

Apologies, I swiped some preTy slides from markePng…

Why Search?

Search Design Strategy

One pool of data

One security framework

One set of system resources

One management interface

An Integrated Part of the Hadoop System

Storage

Integra5on

Resource Management

Batch Processing MAPREDUCE, HIVE & PIG

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Engines

InteracPve SQL

CLOUDERA IMPALA

InteracPve Search CLOUDERA SEARCH

Machine Learning MAHOUT

Math & Sta5s5cs

SAS, R

Benefits of Search IntegraPon

Improved Big Data ROI §  An interacPve experience without technical knowledge §  Single data set for mulPple compuPng frameworks

Faster Time to Insight §  Exploratory analysis, esp. unstructured data §  Broad range of indexing opPons to accommodate needs

Cost Efficiency §  Single scalable plaoorm; no incremental investment §  No need for separate systems, storage

Solid Founda5ons & Reliability §  Solr in producPon environments for years §  Hadoop-‐powered reliability and scalability

So much soKware…

Making Decisions

That’s a Lot of SoKware

•  21 packages, depending on how you count •  And there’s plenty more…

•  How to decide what to use?

“ The answer to most Hadoop quesPons is it

depends.”

Some of the Big Issues

•  Response Pme •  User interfaces •  Programming paradigm •  Input/output formats •  Use cases

Response Time

•  MapReduce is batch oriented •  Resilient to hardware failures •  Robust scheduling opPons

•  Impala is near-‐realPme •  HBase is realPme

•  Key/values are cached in memory

•  Search can be (near-‐)realPme.

•  Hybrid systems are common!

User Interfaces

•  Java •  MapReduce, HBase

•  SQL •  Hive, Impala

•  Shell •  Pig

•  Natural Language / Free Text •  Search

Data Constraints

•  MapReduce •  Paradigm takes some getng used to •  Processing must accommodate format

•  HBase •  Columnar key/value store •  Hue makes this easier

•  Search •  Indexing and display •  Hue makes this easier

Input/Output Formats

•  Know what they are… opPonal. •  Don’t know? That’s okay. •  Schema on read.

•  Be able to extract what you need

Lack of Use Case

•  “Big Data” and Hadoop •  They ENABLE you to solve problems •  Won’t solve problems for you •  Doesn’t know about your business logic •  “Big” is bigger than you’re accustomed to…

•  Have a plan •  Bring your use cases •  Bring your business quesPons

One typical Hadoop use case.

Index GeneraPon/Serving

eBay – Cassini Project

•  June 2012 •  2B page views/day •  250M searches/day •  9 PB online

•  Custom search indexes •  Limited by field or Pme period

eBay – Cassini Project

• MapReduce to generate indexes •  Customer history •  Item fields: name, price, descripPons, etc

•  Bulk import indexes into HBase, served •  15 TB in HBase, 1.2 TB daily import into Hbase •  Ranking algorithms can take into account

•  More history •  More fields •  More customer-‐specific details

Some quick examples.

Search Use Cases

Offer easy access to non-‐technical resources

Explore data prior to processing and modeling

Gain immediate access and find correlaPons in mission-‐criPcal data

Powerful, proven search capabili5es that let organiza5ons:

Monsanto

Scalable, efficient image search for analysis and research

Track plant characterisPcs throughout their lifecycle

Before: Manual aTribute extracPon and search queries within database

Now: Parse and index images at acquisiPon and on demand, index archived images in batch

Cloudera: Internal Field Portal

Custom Aggregated Search

Cloudera – Internal Field Portal

•  Single stop for field engineers •  Mailing lists: public, private •  Tickets: support, development, public ASF •  Customer data: accounts, clusters, KB arPcles •  Customer Clusters: configs, audits, logs, events •  Books and papers •  Discussion forums

•  Dogfooding, yes • Makes my life easier

Cloudera – Internal Field Portal

•  Varied fetchers/observers for web/API content •  Content is retrieved via Flume, Sqoop

•  Search indexes and replicates into HBase •  Each collecPon has collecPon-‐specific filters/fields •  Provides Ptle, content snippet, link to original

• Morphlines extracts books and papers using Tika •  Impala for analyPcs

•  Future: Use MapReduce to ingest logs

PaTerns & PredicPons: Durkheim Project

Risk ClassificaPon & PredicPve Analysis

56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US Combat Deaths AFG 301

US Military Suicides 349

349 > 301

PaTerns & PredicPons – Durkheim Project

•  Assessment of mental health risks •  Correlate veterans’ communicaPons with suicide risk

•  Build machine learning algorithms on MapReduce •  Train using expert knowledge

•  Keywords •  PaTerns

•  Algorithm detects and assign risk scores •  In what medium?

61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/

Unstructured Clinical Notes

•  Phase 1 •  3 cohorts: non-‐psychiatric, psychiatric, suicide-‐posiPve •  100 clinical profiles per cohort •  65% accurate in predicPng suicide risk in control group

•  Phase 2 •  Text analyPcs of clinical records, opt-‐in social media •  Goal of 100,000 veteran parPcipants •  Represents a huge increase of data

•  TradiPonal enterprise search couldn’t scale

•  Technologies •  Hadoop •  Search

•  Indexing of machine learning, backed by HBase for performance •  Hue interface for non-‐technical users •  Discovery of terms, keywords, risk factors in numerous facets

•  Impala •  Deep SQL queries if/when interesPng deviaPons are found •  e.g. if the word “Molly” appeared in top 10 facets •  Write some SQL to dig in, perhaps revise indexing scheme

•  Currently •  Monitoring •  Analysis

•  Future •  IntervenPonal study •  Back our hopes with data…

• More detailed Case Study •  hTp://goo.gl/3ZJMwS •  hTp://durkheimproject.org/

ParPng thoughts… in no parPcular order.

Summary

Search Simplifies InteracPon

Explore

Navigate

Correlate Experts know MapReduce. Savvy people know SQL.

Everyone knows Search.

Summary

•  With Hadoop, it depends. •  The tools are out there. •  Open source soKware

•  Many interconnected pieces •  Many unexplored opportuniPes •  A thriving community awaits you…

•  Data can make a difference. •  Search allows everyone to interact with data.

•  This is a Big Deal.

What’s Next?

•  Download Hadoop! •  Already done that? Contribute…

•  CDH available at www.cloudera.com •  Cloudera provides pre-‐loaded VMs

•  hTp://Pny.cloudera.com/quickstartvm

•  Clone our repos! •  hTps://github.com/cloudera

Preferably related to the talk…

QuesPons?

Thank You! Alex Moundalexis alexm@clouderagovt.com @technmsg We’re hiring, kids! Well, not kids.

search in the apache hadoop ecosystem: thoughts from the field

apachehbase random

apacheaccumulo random

yahoo hadoop cluster

html xml oce pdf mbox

mr pig hive sqoop impala

november2013 alexmoundalexis

disclaimer technologies

Technology

apache hadoop java api

big data: apache hadoop

apache hadoop today & tomorrow - snia€¦ · apache hadoop...

spring for apache hadoop - reference documentation ·...

apache hadoop crash course

mapreduce programming with apache hadoop -...

apache hadoop ecosystem - lias (lab · apache hadoop...

spring for apache hadoop - reference documentation ·...

apache hadoop and hive

apache hadoop ingestion patterns & apache flume

apache hadoop tutorial -...

apache hadoop security - ranger

apache hadoop filesystem internals - snia · apache hadoop...

refcardz - apache hadoop

apache hadoop hbase

20100130 hadoop apache

apache hadoop

apache hadoop 1.1

introduccion apache hadoop

apache hadoop technology : beginners