new-age search through apache solr

24
www.edureka.co/apache-solr New-Age Search through Apache Solr View Apache Solr course details at www.edureka.co/apache-solr For Queries: Post on Twitter @edurekaIN: #askEdureka Post on Facebook /edurekaIN For more details please contact us: US : 1800 275 9730 (toll free) INDIA : +91 88808 62004 Email Us : [email protected]

Upload: edureka

Post on 13-Aug-2015

300 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: New-Age Search through Apache Solr

wwwedurekacoapache-solr

New-Age Search through Apache Solr

View Apache Solr course details at wwwedurekacoapache-solr

For QueriesPost on Twitter edurekaIN askEdurekaPost on Facebook edurekaIN

For more details please contact us US 1800 275 9730 (toll free)INDIA +91 88808 62004Email Us salesedurekaco

Slide 2

LIVE Online Class

Class Recording in LMS

247 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

wwwedurekacoapache-solr

How it Works

Slide 3 wwwedurekacoapache-solr

Objectives

At the end of this module you will be able to understand

The need for search engine for enterprise grade applications

The objectives amp challenges of search engine

How is Indexing amp Searching Handled in Lucene

Solr and its Architecture

Near Real Time Search with Solr

Leveraging Solr Capabilities with Hadoop

Solr with YARN

About job opportunity for Solr Developers

Slide 4Slide 4Slide 4 wwwedurekacoapache-solr

Why Do I Need Search Engines

Slide 5Slide 5Slide 5 wwwedurekacoapache-solr

Search Engine Why do I need them

1 Text Based Search

2 Filter

3 Documents

1

2

3

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 2: New-Age Search through Apache Solr

Slide 2

LIVE Online Class

Class Recording in LMS

247 Post Class Support

Module Wise Quiz

Project Work

Verifiable Certificate

wwwedurekacoapache-solr

How it Works

Slide 3 wwwedurekacoapache-solr

Objectives

At the end of this module you will be able to understand

The need for search engine for enterprise grade applications

The objectives amp challenges of search engine

How is Indexing amp Searching Handled in Lucene

Solr and its Architecture

Near Real Time Search with Solr

Leveraging Solr Capabilities with Hadoop

Solr with YARN

About job opportunity for Solr Developers

Slide 4Slide 4Slide 4 wwwedurekacoapache-solr

Why Do I Need Search Engines

Slide 5Slide 5Slide 5 wwwedurekacoapache-solr

Search Engine Why do I need them

1 Text Based Search

2 Filter

3 Documents

1

2

3

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 3: New-Age Search through Apache Solr

Slide 3 wwwedurekacoapache-solr

Objectives

At the end of this module you will be able to understand

The need for search engine for enterprise grade applications

The objectives amp challenges of search engine

How is Indexing amp Searching Handled in Lucene

Solr and its Architecture

Near Real Time Search with Solr

Leveraging Solr Capabilities with Hadoop

Solr with YARN

About job opportunity for Solr Developers

Slide 4Slide 4Slide 4 wwwedurekacoapache-solr

Why Do I Need Search Engines

Slide 5Slide 5Slide 5 wwwedurekacoapache-solr

Search Engine Why do I need them

1 Text Based Search

2 Filter

3 Documents

1

2

3

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 4: New-Age Search through Apache Solr

Slide 4Slide 4Slide 4 wwwedurekacoapache-solr

Why Do I Need Search Engines

Slide 5Slide 5Slide 5 wwwedurekacoapache-solr

Search Engine Why do I need them

1 Text Based Search

2 Filter

3 Documents

1

2

3

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 5: New-Age Search through Apache Solr

Slide 5Slide 5Slide 5 wwwedurekacoapache-solr

Search Engine Why do I need them

1 Text Based Search

2 Filter

3 Documents

1

2

3

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 6: New-Age Search through Apache Solr

Slide 6Slide 6Slide 6 wwwedurekacoapache-solr

Search Engine ndash What it should be

If you need a storage engine to search records documents using text-based keywords it should support following

features

1 Should be optimized for faster text searches

2 Should have flexible schema

3 Should support sorting of documents

4 Web Scale - Should be optimized for reads

5 Should be document oriented

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 7: New-Age Search through Apache Solr

Slide 7Slide 7Slide 7 wwwedurekacoapache-solr

Cleartrip Spatial Search

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 8: New-Age Search through Apache Solr

Slide 8Slide 8Slide 8 wwwedurekacoapache-solr

What is Lucene

Lucene is a powerful Java search library that lets you easily add search or Information Retrieval (IR) to applications

Used by LinkedIn Twitter hellip and many more (see httpwikiapacheorglucene-javaPoweredBy )

Scalable amp High-performance Indexing

Powerful Accurate and Efficient Search Algorithms

Cross-Platform Solution

raquo Open Source amp 100 pure Java

raquo Implementations in other programming languages available that are index-compatible

Doug Cutting ldquoCreatorrdquo

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 9: New-Age Search through Apache Solr

Slide 9Slide 9Slide 9 wwwedurekacoapache-solr

Indexing ndash How it works

I like edureka coursesEdureka teaches big

data coursesEdureka helps learn new

technologies easily

Document - 1 (ldquoD1rdquo) Document - 2 (ldquoD2rdquo) Document - 3 (ldquoD3rdquo)

ldquoedurekardquo = D1 D2 D3ldquocoursesrdquo = D1 D2ldquoteachesrdquo = D2ldquobigrdquo = D2ldquodatardquo = D2ldquohelpsrdquo = D3

ldquoedurekardquo

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 10: New-Age Search through Apache Solr

Slide 10Slide 10Slide 10 wwwedurekacoapache-solr

Lucene ndash Writing to Index

Field

Field

Field

Field

Analyzer IndexWriter Directory

Document

Classes used when indexing documents with Lucene

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 11: New-Age Search through Apache Solr

Slide 11Slide 11Slide 11 wwwedurekacoapache-solr

Lucene ndash Searching In Index

QueryParser

Analyzer

IndexSearcherExpressionQuery object

Text fragments

Query Parser translates a textual expression from the end into an arbitrarily complex query for searching

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 12: New-Age Search through Apache Solr

Slide 12Slide 12Slide 12 wwwedurekacoapache-solr

Solr is an open source enterprise search server web application

Solr Uses the Lucene Search Library and extends it

Solr exposes lucene Java APIrsquos as RESTful services

You put documents in it (called indexing) via XML JSON CSV or binary over HTTP

You query it via HTTP GET and receive XML JSON CSV or binary results

What is Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 13: New-Age Search through Apache Solr

Slide 13Slide 13Slide 13 wwwedurekacoapache-solr

Advanced Full-Text Search Capabilities

Optimized for High Volume Web Traffic

Standards Based Open Interfaces - XML JSON and HTTP

Comprehensive HTML Administration Interfaces

Server statistics exposed over JMX for monitoring

Near Real-time indexing and Adaptable with XML Configuration

Linearly scalable auto index replication auto Extensible Plugin Architecture

Solr Key Features

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 14: New-Age Search through Apache Solr

Slide 14Slide 14Slide 14 wwwedurekacoapache-solr

Solr Architecture

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 15: New-Age Search through Apache Solr

Slide 15Slide 15Slide 15 wwwedurekacoapache-solr

Request Handler

Query ParserResponse

Writer

Index

qt selects a RequestHandler for a query usingselect(by default the DisMaxRequestHandler is used)

defType selects a query parser for the query(by default uses whatever has been configured for the RequestHandler)

qf selects which fields to queryin the index(by default all fields are required)

wt selects a response writer for formatting the query response

fq filters query by applying an additional query to the initial queryrsquos results caches the results

Rows specifies the number of rows to be displayed at one time

Start specifies an offset(by default 0) into the query results where the returned response should begin

Solr Search Process

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 16: New-Age Search through Apache Solr

Slide 16Slide 16Slide 16 wwwedurekacoapache-solr

Near Real-Time Search

Near Real Time (NRT) search means that documents are available for search almost immediately after being indexed additions and updates to documents are seen in near real time

httplocalhost8983solrupdatestreambody=ltaddgtltdocgtltfieldname=idgttestdocltfieldgtltdocgtltaddgtampcommit=true

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 17: New-Age Search through Apache Solr

Slide 17Slide 17Slide 17 wwwedurekacoapache-solr

Real-Time Get

The realtime get feature allows retrieval (by unique-key) of the latest version of any documents without the associated cost of reopening a searcher

This is primarily useful when using Solr as a NoSQL data store and not just a search index

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 18: New-Age Search through Apache Solr

Slide 18Slide 18Slide 18 wwwedurekacoapache-solr

Leveraging Solr Capabilities with Hadoop

Solr provides us fast efficient powerful full-text search and near real-time indexing and SolrCloud is flexible

distributed search and indexing and will do things like automatic fail over etc

Hence its very suitable as NoSQL replacement for traditional databases in many situations especially when the size of

the data exceeds what is reasonable with a typical RDBMS

We can do scalable indexing using Hadoop MapReduce or PIG job and then load the indexed data in Solr

In all the major Hadoop distribution like Cloudera Hortonworks MapR you can integrate Solr easily

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 19: New-Age Search through Apache Solr

Slide 19Slide 19Slide 19 wwwedurekacoapache-solr

PDF

Word

HTML

Raw Files

Lucene

SolR SolR SolR

Query Response

Search Web App

MapReduce Indexing Job

Raw Files Indexed

HDFS(Hadoop Distributed File System)

Scalable Indexing

Input Data

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 20: New-Age Search through Apache Solr

Slide 20Slide 20Slide 20 wwwedurekacoapache-solr

Solr with YARN

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 21: New-Age Search through Apache Solr

Slide 21Slide 21Slide 21 wwwedurekacoapache-solr

Job trends for Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 22: New-Age Search through Apache Solr

Slide 22Slide 22Slide 22 wwwedurekacoapache-solr

Disclaimer

Criteria and guidelines mentioned in this presentation may change Please visit our website for latest and additional information on Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 23: New-Age Search through Apache Solr

Slide 23Slide 23Slide 23 wwwedurekacoapache-solr

Course Topics

Module 5

raquo Solr Searching

Module 6

raquo Solr Extended Features

Module 7

raquo Solr Cloud amp Administration

Module 8

raquo Final Project

Module 1

raquo Introduction to Apache Lucene

Module 2

raquo Exploring Lucene

Module 3

raquo Introduction to Apache Solr

Module 4

raquo Solr Indexing

Page 24: New-Age Search through Apache Solr