the first class integration of solr with hadoop

35

Upload: lucenerevolution

Post on 27-Jan-2015

112 views

Category:

Technology


1 download

DESCRIPTION

Presented by Mark Miller, Software Developer, Cloudera Apache Lucene/Solr committer Mark Miller talks about how Solr has been integrated into the Hadoop ecosystem to provide full text search at "Big Data" scale. This talk will give an overview of how Cloudera has tackled integrating Solr into the Hadoop ecosystem and highlights some of the design decisions and future plans. Learn how Solr is getting 'cozy' with Hadoop, which contributions are going to what project, and how you can take advantage of these integrations to use Solr efficiently at "Big Data" scale. Learn how you can run Solr directly on HDFS, build indexes with Map/Reduce, load Solr via Flume in 'Near Realtime' and much more.

TRANSCRIPT

Page 1: The First Class Integration of Solr with Hadoop
Page 2: The First Class Integration of Solr with Hadoop

THE FIRST CLASS INTEGRATION OF SOLR WITH HADOOPMark Miller (Cloudera)

Page 3: The First Class Integration of Solr with Hadoop

WHO AM I?Cloudera employee, Lucene/Solr committer, Lucene PMC member, Apache member !First job out of college was in the Newspaper archiving business. !First full time employee at LucidWorks - a startup around Lucene/Solr. !Spent a couple years as “Core” engineering manager, reporting to the VP of engineering.

Page 4: The First Class Integration of Solr with Hadoop

• Very fast and feature rich ‘core’ search engine library. !

• Compact and powerful, Lucene is an extremely popular full-text search library. !

• Provides low level API’s for analyzing, indexing, and searching text, along with a myriad of related features. !

• Just the core - either you write the ‘glue’ or use a higher level search engine built with Lucene.

Page 5: The First Class Integration of Solr with Hadoop

• Solr (pronounced "solar") is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. - Wikipedia

Page 6: The First Class Integration of Solr with Hadoop

• Katta • Blur • SolBase • HBASE-3529 • SOLR-1301 • SOLR-1045 • Ad-Hoc

SEARCH ON HADOOP HISTORY

Page 7: The First Class Integration of Solr with Hadoop

...

Page 8: The First Class Integration of Solr with Hadoop

• No need to build something radically new - we have the pieces we need. !

• Focus on integration points. !

• Create high quality, first class integrations and contribute the work to the projects involved. !

• Focus on integration and quality first - then performance and scale.

THE PLAN: STRENGTHEN THE FAMILY BONDS

Page 9: The First Class Integration of Solr with Hadoop

SOLRCLOUD

Page 10: The First Class Integration of Solr with Hadoop

• Read and Write directly to HDFS !

• First Class Custom Directory Support in Solr • Support Solr Replication on HDFS !

• Other improvements around usability and configuration

SOLR INTEGRATION

Page 11: The First Class Integration of Solr with Hadoop

• Lucene did not historically support append only file system !

• “Flexible Indexing” brought around support for append only filesystem support !

• Lucene support append only filesystem by default since 4.2

READ AND WRITE DIRECTLY TO HDFS

Page 12: The First Class Integration of Solr with Hadoop

• It’s how Lucene interacts with index files. • Solr uses the Lucene library and offers DirectoryFactory !

• Class Directory { • listAll(); • createOutput(file, context); • openInput(file, context); • deleteFile(file); • makeLock(file); • clearLock(file); • …

LUCENE DIRECTORY ABSTRACTION

Page 13: The First Class Integration of Solr with Hadoop

• Solr relies on the filesystem cache to operate at full speed. !

• HDFS not known for it’s random access speed. !

• Apache Blur has already solved this with an HdfsDirectory that works on top of a BlockDirectory. !

• The “block cache” caches the hot blocks of the index off heap (direct byte array) and takes the place of the filesystem cache. !

• We contributed back optional ‘write’ caching. !!

PUTTING THE INDEX IN HDFS

Page 14: The First Class Integration of Solr with Hadoop

• HdfsUpdateLog added - extends UpdateLog !

• Triggered by setting the UpdateLog dataDir to something that starts with hdfs:/ - no additional configuration necessary. !

• Same extensive testing as used on UpdateLog

PUTTING THE TRANSACTIONLOG IN HDFS

Page 15: The First Class Integration of Solr with Hadoop

• Set DirectoryFactory to HdfsDirectoryFactory and set the dataDir to a location in hdfs. !

• Set LockType to ‘hdfs’ !

• Use an UpdateLog dataDir location that begins with ‘hdfs:/’ !

• Or java -Dsolr.directoryFactory=HdfsDirectoryFactory • -Dsolr.lockType=solr.HdfsLockFactory • -Dsolr.updatelog=hdfs://host:port/path -jar start.jar

RUNNING SOLR ON HDFS

Page 16: The First Class Integration of Solr with Hadoop

!• While Solr has exposed a plug-able DirectoryFactory for a long time now, it was

really quite limited. !

• Most glaring, only a local file system based Directory would work with replication. !

• There where also other more minor areas that relied on a local filesystem Directory implementation.

SOLR REPLICATION ON HDFS

Page 17: The First Class Integration of Solr with Hadoop

• Take advantage of “distributed filesystem” and allow for something similar to HBase regions. !

• If a node goes down, the data is still available in HDFS - allow for that index to be automatically served by a node that is still up if it has the capacity.

Solr Node

Solr Node

Solr Node

HDFS

FUTURE SOLR REPLICATION ON HDFS

Page 18: The First Class Integration of Solr with Hadoop

• Leader reads and writes index files to HDFS • Replicas only read from HDFS, write to /dev/null

Leader Replica Replica

HDFS

Page 19: The First Class Integration of Solr with Hadoop

• Scalable index creation via map-reduce !

• Many initial ‘homegrown’ implementations sent documents from reducer to SolrCloud over http !

• To really scale, you want the reducers to create the indexes in HDFS and then load them up with Solr !

• The ideal impl will allow using as many reducers as are available in your hadoop cluster, and then merge the indexes down to the correct number of ‘shards’

MAP REDUCE INDEX BUILDING

Page 20: The First Class Integration of Solr with Hadoop

Mapper: Parse input

Mapper: Parse input

Mapper: Parse input

Index Index

Arbitrary reducing steps of indexing and merging

End-Reducer End-Reducer

MR INDEX BUILDING

Page 21: The First Class Integration of Solr with Hadoop

• Can ‘inspect’ ZooKeeper to learn about Solr cluster. !

• What URL’s to GoLive to. !

• The Schema to use when building indexes. !

• Match hash -> shard assignments of a Solr cluster.

SOLRCLOUD AWARE

Page 22: The First Class Integration of Solr with Hadoop

!• After building your indexes with map-reduce, how do you deploy them to

your Solr cluster? • We want it to be easy - so we built the GoLive option. • GoLive allows you to easily merge the indexes you have created

atomically into a live running Solr cluster. • Paired with the ZooKeeper Aware ability, this allows you to simply point

your map-reduce job to your Solr cluster and it will automatically discover how many shards to build and what locations to deliver the final indexes to in HDFS.

GOLIVE

Page 23: The First Class Integration of Solr with Hadoop

• Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.

FLUME SOLR SYNC

Page 24: The First Class Integration of Solr with Hadoop

OtherLogs

HDFS

Flume Flume

Solr

FLUME SOLR SYNC

Page 25: The First Class Integration of Solr with Hadoop

• Can ‘inspect’ ZooKeeper to learn about Solr cluster. !

• What URL’s to send data to. !

• The Schema for the collection being indexed to.

SOLRCLOUD AWARE

Page 26: The First Class Integration of Solr with Hadoop

• Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported • Replication updates trigger indexing of updates (rows) • Integrates Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata

HBASE INTEGRATION

Page 27: The First Class Integration of Solr with Hadoop

HDFS

HBase

inte

ract

ive

load

Indexer(s)

Trig

gers

on

upda

tes

Solr serverSolr serverSolr serverSolr serverSolr server

Page 28: The First Class Integration of Solr with Hadoop

• A morphline is a configuration file that allows you to define ETL transformation pipelines !

• Extract content from input files, transform content, load content (eg to Solr) !

• Uses Tika to extract content from a large variety of input documents !

• Part of the CDK (Cloudera Development Kit)

MORPHLINES

Page 29: The First Class Integration of Solr with Hadoop

• Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/cloudera/cdk • Similar to Unix pipelines • Configuration over coding • Supports common Hadoop formats • Avro • Sequence file • Text • Etc… !

syslog Flume Agent

Solr Sink

Command: readLine

Command: grok

Command: loadSolr

Solr

Page 30: The First Class Integration of Solr with Hadoop

• Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika

Page 31: The First Class Integration of Solr with Hadoop

• Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file

formats • Etc…

Page 32: The First Class Integration of Solr with Hadoop

MORPHLINES EXAMPLE CONFIGmorphlines : [  {    id : morphline1    importCommands : ["com.cloudera.**", "org.apache.solr.**"]    commands : [      { readLine {} }                          {        grok {          dictionaryFiles : [/tmp/grok-dictionaries]                                         expressions : {            message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:\[%{POSINT:syslog_pid}\])?: %{GREEDYDATA:syslog_message}"""          }        }      }      { loadSolr {} }          ]  } ]

Example Input <164>Feb  4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb  4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.

Page 33: The First Class Integration of Solr with Hadoop

• Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr

API and query language

HUE INTEGRATION

Page 34: The First Class Integration of Solr with Hadoop

• https://ccp.cloudera.com/display/SUPPORT/Downloads !

• Or Google !

• “cloudera search download”

CLOUDERA SEARCH

Page 35: The First Class Integration of Solr with Hadoop

Mark Miller, Cloudera @heismark