solr + hadoop: interactive search for hadoop

Click here to load reader

Post on 27-Aug-2014

359 views

Category:

Software

2 download

Embed Size (px)

DESCRIPTION

Solr + Hadoop: Interactive Search for Hadoop

TRANSCRIPT

  • 1 Solr + Hadoop: Interactive Search for Hadoop Gregory Chanan (gchanan AT cloudera.com) OC Big Data Meetup 07/16/14
  • Agenda Big Data and Search setting the stage Cloudera Search Architecture Component Deep Dive Security Conclusion
  • Agenda Big Data and Search setting the stage Cloudera Search Architecture Component Deep Dive Security Conclusion
  • Why Search? Hadoop for everyone Typical case: Ingest data to storage engine (HDFS, HBase, etc) Process data (MapReduce, Hive, Impala) Experts know MapReduce Savvy people know SQL Everyone knows Search!
  • Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  • Benefits of Search Improved Big Data ROI An interactive experience without technical knowledge Faster time to insight Exploratory analysis, esp. unstructured data Broad range of indexing options to accommodate needs Cost efficiency Single scalable platform; no incremental investment No need for separate systems, storage
  • What is Cloudera Search? Full-text, interactive search with faceted navigation Apache Solr integrated with CDH Established, mature search with vibrant community In production environments for years Open Source 100% Apache, 100% Solr Standard Solr APIs Batch, near real-time, and on-demand indexing Available for CDH4 and CDH5
  • Agenda Big Data and Search setting the stage Cloudera Search Architecture Component Deep Dive Security Conclusion
  • Apache Hadoop Apache HDFS Distributed file system High reliability High throughput Apache MapReduce Parallel, distributed programming model Allows processing of large datasets Fault tolerant
  • Apache Lucene Full text search library Indexing Querying Traditional inverted index Batch and Incremental indexing We are using version 4.4 in current release
  • Apache Solr Search service built using Lucene Ships with Lucene (same TLP at Apache) Provides XML/HTTP/JSON/Python/Ruby/ APIs Indexing Query Administrative interface Also rich web admin GUI via HTTP
  • Apache SolrCloud Provides distributed Search capability Part of Solr (not a separate library/codebase) Shards provide scalability partition index for size replicate for query performance Uses ZooKeeper for coordination No split-brain issues Simplifies operations
  • SolrCloud Architecture Updates automatically sent to the correct shard Replicas handle queries, forward updates to the leader Leader indexes the document for the shard, and forwards the index notation to itself and any replicas.
  • SolrCloud Architecture Visual representation via admin UI
  • Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index ZK
  • Agenda Big Data and Search setting the stage Cloudera Search Architecture Component Deep Dive Indexing ETL - morphlines Querying Security Conclusion
  • Indexing Near Real Time (NRT) Flume HBase Indexer Batch MapReduceIndexerTool HBaseBatchIndexer
  • Near Real Time Indexing with Flume Log File Solr and Flume Data ingest at scale Flexible extraction and mapping Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 18
  • Apache Flume - MorphlineSolrSink A Flume Source Receives/gathers events A Flume Channel Carries the event MemoryChannel or reliable FileChannel A Flume Sink Sends the events on to the next location Flume MorphlineSolrSink Integrates Cloudera Morphlines library ETL, more on that in a bit Does batching Results sent to Solr for indexing
  • Indexing Near Real Time (NRT) Flume HBase Indexer Batch MapReduceIndexerTool HBaseBatchIndexer
  • Near Real Time Indexing of Apache HBase HDFS HBase interactiveload HBase Indexer(s) Replication Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • Lily HBase Indexer Collaboration between NGData & Cloudera NGData are creators of the Lily data management platform Lily HBase Indexer Service which acts as a HBase replication listener HBase replication features, such as filtering, supported Replication updates trigger indexing of updates (rows) Integrates Cloudera Morphlines library for ETL of rows AL2 licensed on github https://github.com/ngdata
  • Indexing Near Real Time (NRT) Flume HBase Indexer Batch MapReduceIndexerTool HBaseBatchIndexer
  • Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Indexer Solr server 24 HDFS Solr and MapReduce Flexible, scalable batch indexing Start serving new indices with no downtime On-demand indexing, cost- efficient re-indexing
  • MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed Much like Unix find see HADOOP-8989 Output is NLineInputFormated file 2) Mapper/Reducer indexing step Mapper extracts content via Cloudera Morphlines Reducer indexes documents via embedded Solr server Originally based on SOLR-1301 Many modifications to enable linear scalability
  • MapReduce Indexer golive Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster No downtime for users No NRT expense Linear scale out to the size of your MR cluster
  • Indexing Near Real Time (NRT) Flume HBase Indexer Batch MapReduceIndexerTool HBaseBatchIndexer
  • HBase + MapReduce Run MapReduce job over HBase tables Same architecture as running over HDFS Similar to HBases CopyTable Support for go-live
  • Agenda Big Data and Search setting the stage Cloudera Search Architecture Component Deep Dive Indexing ETL - morphlines Querying Security Conclusion
  • Cloudera Morphlines Open Source framework for simple ETL Simplify ETL Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) Configuration over coding Standardize ETL Ships as part of Kite SDK, formerly Cloudera Developer Kit (CDK) Its a Java library AL2 licensed on github https://github.com/kite-sdk
  • Cloudera Morphlines Architecture Solr Solr Solr SolrCloud Logs, tweets, social media, html, images, pdf, text. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Morphlines can be embedded in any application
  • Extraction and Mapping Modeled after Unix pipelines (records instead of lines) Simple and flexible data transformation Reusable across multiple index workloads Over time, extend and re- use across platform workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document MorphlineLibrary
  • Morphline Example syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ] Example Input Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  • Current Command Library Integrate with and load into Apache Solr Flexible log file analysis Single-line record, multi-line records, CSV files Regex based pattern matching and extraction Integrati