search in the apache hadoop ecosystem: thoughts from the field

1

Search in the Apache Hadoop Ecosystem: Thoughts from the Field Open Source Search Conference, November 2013 Alex Moundalexis [email protected] @technmsg

2

Thoughts of a Former SA

3

Thoughts of a Former SA Field Guy

Disclaimer

•  Technologies, not products •  Cloudera builds things soKware

•  most donated to Apache •  some closed-‐source

•  I will likely menPon “Cloudera Something” •  Cloudera “products” I reference are open source

•  Apache Licensed •  Source code is on GitHub

•  hTps://github.com/cloudera

4

What This Talk Isn’t About

•  Deploying •  Puppet, Chef, Ansible, homegrown scripts, intern labor

•  Sizing & Tuning •  Depends heavily on data and workload

•  Coding •  Algorithms

5

6

“ The answer to most Hadoop quesPons is it

depends.”

7

Quick and dirty, more Pme for use cases.

The Apache Hadoop Ecosystem

Why “Ecosystem?”

•  In the beginning, just Hadoop •  HDFS •  MapReduce

•  Today, dozens of interrelated components •  I/O •  Processing •  Specialty ApplicaPons •  ConfiguraPon •  Workflow

8

ParPal Ecosystem

9

Hadoop

external system

RDBMS / DWH

web server

device logs

API access

log collecPon

DB table import

batch processing

machine learning

external system

API access

user

RDBMS / DWH

DB table export

BI tool + JDBC/ODBC

Search

SQL

HDFS

•  Distributed, highly fault-‐tolerant filesystem •  OpPmized for large streaming access to data •  Based on Google File System

•  hTp://research.google.com/archive/gfs.html

10

Lots of Commodity Machines

11

Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]



Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

•  Programming paradigm •  Batch oriented, not realPme •  Works well with distributed compuPng •  Lots of Java, but other languages supported •  Based on Google’s paper

•  hTp://research.google.com/archive/mapreduce.html

12

Under the Covers

You specify map() and reduce() functions. ��

��The framework does the

rest. 60

Apache HBase

•  Random, realPme read/write access •  Key/value columnar store •  (b|tr)illions of rows/columns •  Based on Google BigTable

•  hTp://research.google.com/archive/bigtable.html

15

Apache Accumulo

•  Random, realPme read/write access •  Key/value columnar store •  (b|tr)illions of rows/columns •  Based on Google BigTable

•  hTp://research.google.com/archive/bigtable.html

•  Adds cell-‐level security •  Implemented by NaPonal Security Agency

•  Donated to ASF

16

Apache Hive & Pig

•  AbstracPon of Hadoop’s Java API •  Hive is SQL-‐based •  Pig is more data-‐flow oriented

•  Eases analysis using MapReduce

17

Cloudera Impala

•  SQL-‐based, but interacPve response •  Backed by HDFS or HBase •  Allows for fast iteraPon/discovery •  Not as fault-‐tolerant as MapReduce

18

Apache Sqoop & Flume

•  Get your data in and out of HDFS •  Sqoop focuses on relaPonal databases •  Flume focuses on log files

19

Cloudera Hue

•  Hadoop User Experience •  Hadoop is largely command line •  Hue provides a UI for end-‐users •  SDK to build your own apps on top

20

Apache Mahout

•  Machine learning algorithms that run on MapReduce •  Clustering •  ClassificaPon •  Filtering

•  I didn’t study these algorithms in school •  Data science people are excited •  Math people are excited •  I’m excited for them

21

Apache Tika

•  Content analysis toolkit •  Simply put, a lot of parsers •  Detect/extract metadata/text from documents

•  HTML •  XML •  Office •  PDF •  mbox •  More…

22

Apache ZooKeeper

•  Distributed systems are HARD •  Everyone was trying to implement the same subsystems •  Bugs leads to race condiPons, other bad things

•  ZK: Highly reliable distributed coordinaPon services •  ConfiguraPon •  Naming •  SynchronizaPon •  Group Services

23

Apache Oozie

•  Workflow scheduling for Hadoop •  Like cron, but in directed graph fashion •  Out of box hooks:

•  MR •  Pig •  Hive •  Sqoop •  Impala

24

Sentry (incubaPng)

•  Role-‐based access control for Hive/Impala/Solr •  Regulatory/compliance assurance

25

Cloudera Morphlines

•  In-‐memory transformaPons •  Load, parse, transform, process •  Records as name-‐value pairs w/ opPonal blob/pojo objects

•  Java library, embedded in your codebase •  Used to ETL data from Flume and MR into Solr

26

Apache Lucene

•  Java-‐based index and search •  Spellchecking •  Hit highlighPng •  TokenizaPon

27

Apache Solr

•  Enterprise search plaoorm •  Based on Apache Lucene

•  Full-‐text search •  FacePng •  NRT indexing

28

Apache SolrCloud

•  IntegraPon of Solr + ZooKeeper •  Provides for shard failover

29

Cloudera Search

•  Based on Apache Solr (incl Lucene and SolrCloud) •  Fault-‐tolerance: collecPons backed by HDFS or Hbase •  IntegraPon galore:

•  HBase/Flume/MapReduce w/ Lucene •  Hue w/ Solr •  Avro w/ Tika •  HDFS w/ Solr/Lucene •  Sentry w/ Solr

30

Cloudera Search + Hue

31

Cloudera Search + Hue

32

33

Apologies, I swiped some preTy slides from markePng…

Why Search?

Search Design Strategy

34

One pool of data

One security framework

One set of system resources

One management interface

An Integrated Part of the Hadoop System

Storage

Integra5on

Resource Management

Metad

ata

Batch Processing MAPREDUCE, HIVE & PIG

…

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Engines

InteracPve SQL

CLOUDERA IMPALA

InteracPve Search CLOUDERA SEARCH

Machine Learning MAHOUT

Math & Sta5s5cs

SAS, R

Benefits of Search IntegraPon

35

Improved Big Data ROI §  An interacPve experience without technical knowledge §  Single data set for mulPple compuPng frameworks

Faster Time to Insight §  Exploratory analysis, esp. unstructured data §  Broad range of indexing opPons to accommodate needs

Cost Efficiency §  Single scalable plaoorm; no incremental investment §  No need for separate systems, storage

Solid Founda5ons & Reliability §  Solr in producPon environments for years §  Hadoop-‐powered reliability and scalability

36

So much soKware…

Making Decisions

That’s a Lot of SoKware

•  21 packages, depending on how you count •  And there’s plenty more…

•  How to decide what to use?

37

38

“ The answer to most Hadoop quesPons is it

depends.”

Some of the Big Issues

•  Response Pme •  User interfaces •  Programming paradigm •  Input/output formats •  Use cases

39

Response Time

•  MapReduce is batch oriented •  Resilient to hardware failures •  Robust scheduling opPons

•  Impala is near-‐realPme •  HBase is realPme

•  Key/values are cached in memory

•  Search can be (near-‐)realPme.

•  Hybrid systems are common!

40

User Interfaces

•  Java •  MapReduce, HBase

•  SQL •  Hive, Impala

•  Shell •  Pig

•  Natural Language / Free Text •  Search

41

Data Constraints

•  MapReduce •  Paradigm takes some getng used to •  Processing must accommodate format

•  HBase •  Columnar key/value store •  Hue makes this easier

•  Search •  Indexing and display •  Hue makes this easier

42

Input/Output Formats

•  Know what they are… opPonal. •  Don’t know? That’s okay. •  Schema on read.

•  Be able to extract what you need

43

Lack of Use Case

•  “Big Data” and Hadoop •  They ENABLE you to solve problems •  Won’t solve problems for you •  Doesn’t know about your business logic •  “Big” is bigger than you’re accustomed to…

•  Have a plan •  Bring your use cases •  Bring your business quesPons

44

45

One typical Hadoop use case.

Index GeneraPon/Serving

eBay – Cassini Project

•  June 2012 •  2B page views/day •  250M searches/day •  9 PB online

•  Custom search indexes •  Limited by field or Pme period

46

eBay – Cassini Project

• MapReduce to generate indexes •  Customer history •  Item fields: name, price, descripPons, etc

•  Bulk import indexes into HBase, served •  15 TB in HBase, 1.2 TB daily import into Hbase •  Ranking algorithms can take into account

•  More history •  More fields •  More customer-‐specific details

47

48

Some quick examples.

Search Use Cases

Search Use Cases

49

Offer easy access to non-‐technical resources

Explore data prior to processing and modeling

Gain immediate access and find correlaPons in mission-‐criPcal data

Powerful, proven search capabili5es that let organiza5ons:

Monsanto

50

Scalable, efficient image search for analysis and research

Track plant characterisPcs throughout their lifecycle

Before: Manual aTribute extracPon and search queries within database

Now: Parse and index images at acquisiPon and on demand, index archived images in batch

51

Cloudera: Internal Field Portal

Custom Aggregated Search

Cloudera – Internal Field Portal

•  Single stop for field engineers •  Mailing lists: public, private •  Tickets: support, development, public ASF •  Customer data: accounts, clusters, KB arPcles •  Customer Clusters: configs, audits, logs, events •  Books and papers •  Discussion forums

•  Dogfooding, yes • Makes my life easier

52


53


•  Varied fetchers/observers for web/API content •  Content is retrieved via Flume, Sqoop

•  Search indexes and replicates into HBase •  Each collecPon has collecPon-‐specific filters/fields •  Provides Ptle, content snippet, link to original

• Morphlines extracts books and papers using Tika •  Impala for analyPcs

•  Future: Use MapReduce to ingest logs

54

55

PaTerns & PredicPons: Durkheim Project

Risk ClassificaPon & PredicPve Analysis

56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/

US Combat Deaths AFG 301

2012



US Military Suicides 349

2012



US Military Suicides 349

349 > 301

2012

PaTerns & PredicPons – Durkheim Project

•  Assessment of mental health risks •  Correlate veterans’ communicaPons with suicide risk

59


•  Build machine learning algorithms on MapReduce •  Train using expert knowledge

•  Keywords •  PaTerns

•  Algorithm detects and assign risk scores •  In what medium?

60


61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/

Unstructured Clinical Notes


•  Phase 1 •  3 cohorts: non-‐psychiatric, psychiatric, suicide-‐posiPve •  100 clinical profiles per cohort •  65% accurate in predicPng suicide risk in control group

•  Phase 2 •  Text analyPcs of clinical records, opt-‐in social media •  Goal of 100,000 veteran parPcipants •  Represents a huge increase of data

•  TradiPonal enterprise search couldn’t scale

62


•  Technologies •  Hadoop •  Search

•  Indexing of machine learning, backed by HBase for performance •  Hue interface for non-‐technical users •  Discovery of terms, keywords, risk factors in numerous facets

•  Impala •  Deep SQL queries if/when interesPng deviaPons are found •  e.g. if the word “Molly” appeared in top 10 facets •  Write some SQL to dig in, perhaps revise indexing scheme

63


•  Currently •  Monitoring •  Analysis

•  Future •  IntervenPonal study •  Back our hopes with data…

• More detailed Case Study •  hTp://goo.gl/3ZJMwS •  hTp://durkheimproject.org/

64

65

ParPng thoughts… in no parPcular order.

Summary

Search Simplifies InteracPon

66

Explore

Navigate

Correlate Experts know MapReduce. Savvy people know SQL.

Everyone knows Search.

Summary

•  With Hadoop, it depends. •  The tools are out there. •  Open source soKware

•  Many interconnected pieces •  Many unexplored opportuniPes •  A thriving community awaits you…

•  Data can make a difference. •  Search allows everyone to interact with data.

•  This is a Big Deal.

67

What’s Next?

•  Download Hadoop! •  Already done that? Contribute…

•  CDH available at www.cloudera.com •  Cloudera provides pre-‐loaded VMs

•  hTp://Pny.cloudera.com/quickstartvm

•  Clone our repos! •  hTps://github.com/cloudera

68

69

Preferably related to the talk…

QuesPons?

70

Thank You! Alex Moundalexis [email protected] @technmsg We’re hiring, kids! Well, not kids.

search in the apache hadoop ecosystem: thoughts from the field

Technology

apachehbase random

apacheaccumulo random

yahoo hadoop cluster

html xml oce pdf mbox

mr pig hive sqoop impala

november2013 alexmoundalexis

disclaimer technologies