search in the apache hadoop ecosystem: thoughts from the field
DESCRIPTION
This presentation describes the Hadoop ecosystem and gives examples of how these open source tools are combined and used to solve specific and sometimes very complex problems. Drawing upon case studies from the field, Mr. Moundalexis demonstrates that one-size, rigid traditional systems don’t fit all, but that combinations of tools in the Apache Hadoop ecosystem provide a versatile and flexible platform for integrating, finding, and analyzing information.TRANSCRIPT
1
Search in the Apache Hadoop Ecosystem: Thoughts from the Field Open Source Search Conference, November 2013 Alex Moundalexis [email protected] @technmsg
2
Thoughts of a Former SA
3
Thoughts of a Former SA Field Guy
Disclaimer
• Technologies, not products • Cloudera builds things soKware
• most donated to Apache • some closed-‐source
• I will likely menPon “Cloudera Something” • Cloudera “products” I reference are open source
• Apache Licensed • Source code is on GitHub
• hTps://github.com/cloudera
4
What This Talk Isn’t About
• Deploying • Puppet, Chef, Ansible, homegrown scripts, intern labor
• Sizing & Tuning • Depends heavily on data and workload
• Coding • Algorithms
5
6
“ The answer to most Hadoop quesPons is it
depends.”
7
Quick and dirty, more Pme for use cases.
The Apache Hadoop Ecosystem
Why “Ecosystem?”
• In the beginning, just Hadoop • HDFS • MapReduce
• Today, dozens of interrelated components • I/O • Processing • Specialty ApplicaPons • ConfiguraPon • Workflow
8
ParPal Ecosystem
9
Hadoop
external system
RDBMS / DWH
web server
device logs
API access
log collecPon
DB table import
batch processing
machine learning
external system
API access
user
RDBMS / DWH
DB table export
BI tool + JDBC/ODBC
Search
SQL
HDFS
• Distributed, highly fault-‐tolerant filesystem • OpPmized for large streaming access to data • Based on Google File System
• hTp://research.google.com/archive/gfs.html
10
Lots of Commodity Machines
11
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ] Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
Image:Yahoo! Hadoop cluster [ OSCON ’07 ]
MapReduce (MR)
• Programming paradigm • Batch oriented, not realPme • Works well with distributed compuPng • Lots of Java, but other languages supported • Based on Google’s paper
• hTp://research.google.com/archive/mapreduce.html
12
Under the Covers
You specify map() and reduce() functions. ���
���The framework does the
rest. 60
Apache HBase
• Random, realPme read/write access • Key/value columnar store • (b|tr)illions of rows/columns • Based on Google BigTable
• hTp://research.google.com/archive/bigtable.html
15
Apache Accumulo
• Random, realPme read/write access • Key/value columnar store • (b|tr)illions of rows/columns • Based on Google BigTable
• hTp://research.google.com/archive/bigtable.html
• Adds cell-‐level security • Implemented by NaPonal Security Agency
• Donated to ASF
16
Apache Hive & Pig
• AbstracPon of Hadoop’s Java API • Hive is SQL-‐based • Pig is more data-‐flow oriented
• Eases analysis using MapReduce
17
Cloudera Impala
• SQL-‐based, but interacPve response • Backed by HDFS or HBase • Allows for fast iteraPon/discovery • Not as fault-‐tolerant as MapReduce
18
Apache Sqoop & Flume
• Get your data in and out of HDFS • Sqoop focuses on relaPonal databases • Flume focuses on log files
19
Cloudera Hue
• Hadoop User Experience • Hadoop is largely command line • Hue provides a UI for end-‐users • SDK to build your own apps on top
20
Apache Mahout
• Machine learning algorithms that run on MapReduce • Clustering • ClassificaPon • Filtering
• I didn’t study these algorithms in school • Data science people are excited • Math people are excited • I’m excited for them
21
Apache Tika
• Content analysis toolkit • Simply put, a lot of parsers • Detect/extract metadata/text from documents
• HTML • XML • Office • PDF • mbox • More…
22
Apache ZooKeeper
• Distributed systems are HARD • Everyone was trying to implement the same subsystems • Bugs leads to race condiPons, other bad things
• ZK: Highly reliable distributed coordinaPon services • ConfiguraPon • Naming • SynchronizaPon • Group Services
23
Apache Oozie
• Workflow scheduling for Hadoop • Like cron, but in directed graph fashion • Out of box hooks:
• MR • Pig • Hive • Sqoop • Impala
24
Sentry (incubaPng)
• Role-‐based access control for Hive/Impala/Solr • Regulatory/compliance assurance
25
Cloudera Morphlines
• In-‐memory transformaPons • Load, parse, transform, process • Records as name-‐value pairs w/ opPonal blob/pojo objects
• Java library, embedded in your codebase • Used to ETL data from Flume and MR into Solr
26
Apache Lucene
• Java-‐based index and search • Spellchecking • Hit highlighPng • TokenizaPon
27
Apache Solr
• Enterprise search plaoorm • Based on Apache Lucene
• Full-‐text search • FacePng • NRT indexing
28
Apache SolrCloud
• IntegraPon of Solr + ZooKeeper • Provides for shard failover
29
Cloudera Search
• Based on Apache Solr (incl Lucene and SolrCloud) • Fault-‐tolerance: collecPons backed by HDFS or Hbase • IntegraPon galore:
• HBase/Flume/MapReduce w/ Lucene • Hue w/ Solr • Avro w/ Tika • HDFS w/ Solr/Lucene • Sentry w/ Solr
30
Cloudera Search + Hue
31
Cloudera Search + Hue
32
33
Apologies, I swiped some preTy slides from markePng…
Why Search?
Search Design Strategy
34
One pool of data
One security framework
One set of system resources
One management interface
An Integrated Part of the Hadoop System
Storage
Integra5on
Resource Management
Metad
ata
Batch Processing MAPREDUCE, HIVE & PIG
…
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
Engines
InteracPve SQL
CLOUDERA IMPALA
InteracPve Search CLOUDERA SEARCH
Machine Learning MAHOUT
Math & Sta5s5cs
SAS, R
Benefits of Search IntegraPon
35
Improved Big Data ROI § An interacPve experience without technical knowledge § Single data set for mulPple compuPng frameworks
Faster Time to Insight § Exploratory analysis, esp. unstructured data § Broad range of indexing opPons to accommodate needs
Cost Efficiency § Single scalable plaoorm; no incremental investment § No need for separate systems, storage
Solid Founda5ons & Reliability § Solr in producPon environments for years § Hadoop-‐powered reliability and scalability
36
So much soKware…
Making Decisions
That’s a Lot of SoKware
• 21 packages, depending on how you count • And there’s plenty more…
• How to decide what to use?
37
38
“ The answer to most Hadoop quesPons is it
depends.”
Some of the Big Issues
• Response Pme • User interfaces • Programming paradigm • Input/output formats • Use cases
39
Response Time
• MapReduce is batch oriented • Resilient to hardware failures • Robust scheduling opPons
• Impala is near-‐realPme • HBase is realPme
• Key/values are cached in memory
• Search can be (near-‐)realPme.
• Hybrid systems are common!
40
User Interfaces
• Java • MapReduce, HBase
• SQL • Hive, Impala
• Shell • Pig
• Natural Language / Free Text • Search
41
Data Constraints
• MapReduce • Paradigm takes some getng used to • Processing must accommodate format
• HBase • Columnar key/value store • Hue makes this easier
• Search • Indexing and display • Hue makes this easier
42
Input/Output Formats
• Know what they are… opPonal. • Don’t know? That’s okay. • Schema on read.
• Be able to extract what you need
43
Lack of Use Case
• “Big Data” and Hadoop • They ENABLE you to solve problems • Won’t solve problems for you • Doesn’t know about your business logic • “Big” is bigger than you’re accustomed to…
• Have a plan • Bring your use cases • Bring your business quesPons
44
45
One typical Hadoop use case.
Index GeneraPon/Serving
eBay – Cassini Project
• June 2012 • 2B page views/day • 250M searches/day • 9 PB online
• Custom search indexes • Limited by field or Pme period
46
eBay – Cassini Project
• MapReduce to generate indexes • Customer history • Item fields: name, price, descripPons, etc
• Bulk import indexes into HBase, served • 15 TB in HBase, 1.2 TB daily import into Hbase • Ranking algorithms can take into account
• More history • More fields • More customer-‐specific details
47
48
Some quick examples.
Search Use Cases
Search Use Cases
49
Offer easy access to non-‐technical resources
Explore data prior to processing and modeling
Gain immediate access and find correlaPons in mission-‐criPcal data
Powerful, proven search capabili5es that let organiza5ons:
Monsanto
50
Scalable, efficient image search for analysis and research
Track plant characterisPcs throughout their lifecycle
Before: Manual aTribute extracPon and search queries within database
Now: Parse and index images at acquisiPon and on demand, index archived images in batch
51
Cloudera: Internal Field Portal
Custom Aggregated Search
Cloudera – Internal Field Portal
• Single stop for field engineers • Mailing lists: public, private • Tickets: support, development, public ASF • Customer data: accounts, clusters, KB arPcles • Customer Clusters: configs, audits, logs, events • Books and papers • Discussion forums
• Dogfooding, yes • Makes my life easier
52
Cloudera – Internal Field Portal
53
Cloudera – Internal Field Portal
• Varied fetchers/observers for web/API content • Content is retrieved via Flume, Sqoop
• Search indexes and replicates into HBase • Each collecPon has collecPon-‐specific filters/fields • Provides Ptle, content snippet, link to original
• Morphlines extracts books and papers using Tika • Impala for analyPcs
• Future: Use MapReduce to ingest logs
54
55
PaTerns & PredicPons: Durkheim Project
Risk ClassificaPon & PredicPve Analysis
56 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US Combat Deaths AFG 301
2012
57 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US Combat Deaths AFG 301
US Military Suicides 349
2012
58 Image: http://www.flickr.com/photos/soldiersmediacenter/4598169027/
US Combat Deaths AFG 301
US Military Suicides 349
349 > 301
2012
PaTerns & PredicPons – Durkheim Project
• Assessment of mental health risks • Correlate veterans’ communicaPons with suicide risk
59
PaTerns & PredicPons – Durkheim Project
• Build machine learning algorithms on MapReduce • Train using expert knowledge
• Keywords • PaTerns
• Algorithm detects and assign risk scores • In what medium?
60
PaTerns & PredicPons – Durkheim Project
61 Image: http://www.flickr.com/photos/42586873@N00/3770782889/
Unstructured Clinical Notes
PaTerns & PredicPons – Durkheim Project
• Phase 1 • 3 cohorts: non-‐psychiatric, psychiatric, suicide-‐posiPve • 100 clinical profiles per cohort • 65% accurate in predicPng suicide risk in control group
• Phase 2 • Text analyPcs of clinical records, opt-‐in social media • Goal of 100,000 veteran parPcipants • Represents a huge increase of data
• TradiPonal enterprise search couldn’t scale
62
PaTerns & PredicPons – Durkheim Project
• Technologies • Hadoop • Search
• Indexing of machine learning, backed by HBase for performance • Hue interface for non-‐technical users • Discovery of terms, keywords, risk factors in numerous facets
• Impala • Deep SQL queries if/when interesPng deviaPons are found • e.g. if the word “Molly” appeared in top 10 facets • Write some SQL to dig in, perhaps revise indexing scheme
63
PaTerns & PredicPons – Durkheim Project
• Currently • Monitoring • Analysis
• Future • IntervenPonal study • Back our hopes with data…
• More detailed Case Study • hTp://goo.gl/3ZJMwS • hTp://durkheimproject.org/
64
65
ParPng thoughts… in no parPcular order.
Summary
Search Simplifies InteracPon
66
Explore
Navigate
Correlate Experts know MapReduce. Savvy people know SQL.
Everyone knows Search.
Summary
• With Hadoop, it depends. • The tools are out there. • Open source soKware
• Many interconnected pieces • Many unexplored opportuniPes • A thriving community awaits you…
• Data can make a difference. • Search allows everyone to interact with data.
• This is a Big Deal.
67
What’s Next?
• Download Hadoop! • Already done that? Contribute…
• CDH available at www.cloudera.com • Cloudera provides pre-‐loaded VMs
• hTp://Pny.cloudera.com/quickstartvm
• Clone our repos! • hTps://github.com/cloudera
68
69
Preferably related to the talk…
QuesPons?
70
Thank You! Alex Moundalexis [email protected] @technmsg We’re hiring, kids! Well, not kids.