discover hdp 2.1: apache solr for hadoop search

17
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Solr for Hadoop Search Hortonworks. We do Hadoop.

Upload: hortonworks

Post on 27-Aug-2014

903 views

Category:

Software


6 download

DESCRIPTION

Apache Solr is the open source platform for searching data stored in Hadoop. Solr powers search on many of the world's largest Internet sites, enabling powerful full-text search and near real-time indexing. Whether users search for tabular, text, geo-location or sensor data in Hadoop, they find it quickly with Apache Solr. Hortonworks Data Platform 2.1 includes Apache Solr. In this deck from their 30-minute webinar, Rohit Bakhshi, Hortonworks product manager, and Paul Codding, Hortonworks solution engineer describe how Solr works within HDP's YARN-based architecture.

TRANSCRIPT

Page 1: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.1 Apache Solr for Hadoop Search

Hortonworks. We do Hadoop.

Page 2: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 2 © Hortonworks Inc. 2014

Speakers

Justin Sears

Hortonworks Product Marketing Manager

Rohit Bakhshi

Hortonworks Senior Product Manager & PM for Apache Hadoop & Apache Solr in Hortonworks Data Platform

Paul Codding

Hortonworks Solution Engineer, focused on customer success with Apache Storm & Apache Solr

Page 3: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 3 © Hortonworks Inc. 2014

Agenda

•  Overview of Apache Solr and Hadoop Search

•  Hadoop Search Demo

•  Q & A

Page 4: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 4 © Hortonworks Inc. 2014

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

A Modern Data Architecture AP

PLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

RDBMS   EDW   MPP  

Business    Analy<cs  

Custom  Applica<ons  

Packaged  Applica<ons  

Gov

erna

nce

&

Inte

grat

ion

ENTERPRISE HADOOP

Secu

rity

Ope

ratio

ns

Data Access

Data Management

SOURC

ES  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks  

Machine  Generated  

Sensor  Data  

GeolocaCon  Data  

Page 5: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 5 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  YARN  :  Data  Opera<ng  System  

DATA    MANAGEMENT  

DATA    ACCESS  GOVERNANCE  &  INTEGRATION   OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm  

     

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

Page 6: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 6 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  

DATA    MANAGEMENT  

GOVERNANCE  &  INTEGRATION   OPERATIONS  

Script    Pig      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm  

     

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

YARN  :  Data  Opera<ng  System  

DATA    ACCESS  

Search    

Solr      

Page 7: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 7 © Hortonworks Inc. 2014

Agenda

Overview Features Q & A

Page 8: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 8 © Hortonworks Inc. 2014

Search: Overview Expanded Data Access Interfaces to Hadoop

BATCH  MapReduce  

INTERACTIVE  Tez  

STREAMING  Storm  

ONLINE  HBase,  Accumulo  

HDFS:  Redundant,  Reliable  Storage  

YARN:  Cluster  Resource  Management      

SEARCH  Solr  

Page 9: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 9 © Hortonworks Inc. 2014

Search: Overview

Apache Solr Open source enterprise search for Hadoop and HDP

•  Open architecture: In the community, for the community

•  Simple, powerful UI for advanced search applications

•  High performance indexing & sub-second search times over billions of documents

•  Deep Integration Roadmap with HDP

LucidWorks Hortonworks partner for search

•  Enterprise support provided as partnership with LucidWorks

•  9 committers total (7 PMC) for Apache Solr

Page 10: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 10 © Hortonworks Inc. 2014

Agenda

Overview Features Q & A

Page 11: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 11 © Hortonworks Inc. 2014

Open Source Components for HDP Search

Comprehensive  enterprise  search  using  open  source  technologies:  

+ •  High-Performance Indexing •  Powerful, Accurate & Efficient Search Algorithms •  Ranked & Field searching •  Flexible faceting, highlighting, joins & result

grouping •  Pluggable ranking models

•  Advanced Full-Text Search Capabilities •  Optimized for High Volume Web Traffic •  Standards Based Open Interfaces - XML, JSON

and HTTP •  Comprehensive HTML Administration Interfaces •  Server statistics exposed over JMX for monitoring •  Linearly scalable

Page 12: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 12 © Hortonworks Inc. 2014

Scalable Indexing of Data in HDFS •  Ingest: MapReduce job

– CSV – Microsoft Office files – Grok (log data) –  Zip – Solr XML – Seq files – WARC

•  Processing: Apache Pig – Write your own pig

scripts to index content –  Pig for preprocessing

and joining

–  Output the resulting datasets to Solr

HDFS

MapReduce or Pig Job

Solr

Raw Documents Lucene Indexes

Page 13: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 13 © Hortonworks Inc. 2014

Search: Reference Architecture

HDFS    (Hadoop  Distributed  File  System)  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

°  

MapReduce  Indexing  Job  

Page 14: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 14 © Hortonworks Inc. 2014

HDP Search Demo Paul Codding

Page 15: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 15 © Hortonworks Inc. 2014

Agenda

Overview Features Q & A

Page 16: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 16 © Hortonworks Inc. 2014

Learn More About Hadoop Search

Hortonworks.com/hadoop/solr/

Register for the remaining 2 Discover HDP 2.1 Webinars

Hortonworks.com/webinars

Next Webinar:

Apache Storm for Stream Data

Processing in Hadoop Thursday, June 19, 10am Pacific

Page 17: Discover HDP 2.1: Apache Solr for Hadoop Search

Page 17 © Hortonworks Inc. 2014

Thank you!