full-text search with nosql...

39
16.05.2013 Full-text Search with NoSQL Technologies NoSQL Search Roadshow 2013, Berlin Kai Spichale

Upload: trinhdieu

Post on 06-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

16.05.2013

Full-text Search with NoSQL Technologies

NoSQL Search Roadshow 2013, Berlin

Kai Spichale

Page 2: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

About me

► Kai Spichale

► Software Engineer at adesso AG

► NoSQL, Full-text searching, Spring, Java EE

16.05.2013 1

► adesso is among Germany‘s top IT service providers

► Consulting and software development focus

► More than 1,000 members of staff

► Some of the most important customers are Allianz, Hannover Rück, Westdeutsche Lotterie, Zurich Versicherung, DEVK, and DAK

Page 3: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Motivation

NoSQL

► Exponential data growth

► Semi-structured data

► More connections

► 80 percent of business-relevant information is in unstructured form

Search

► Shift in data access:

> More full-text search

> Higher user expectations

► Keyword search and link directories become impractical

16.05.2013 2

Page 4: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 3

Page 5: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Full-text search

► Techniques for searching documents in collections

► grep-like naive approach:

> Serial scanning is slow

> No negation

> No distinction between phrase and keyword search

► Build inverted index

> Term Document

> Contains references to documents for each token

16.05.2013 4

Page 6: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Lucene

► Java lib for full-text searches

► De facto standard for open source software

► Attributes:

> Application-agnostic

> Scalable, high performance

► Features:

> Ranked searching

> Multiple query types, faceting

> Sorting

> Multi-Index searching

16.05.2013 5

Page 7: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Text Analysis

16.05.2013 6

Extraction

Parsing

Character

Filter

Tokenizer

Token Filter

Documents

de.GermanAnalyzer:

StandardTokenizer > StandardFilter

> LowerCaseFilter > StopFilter > GermanStemFilter

Inverteted

Index

Page 8: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Text Analysis

16.05.2013 7

ID Term Document

1 come 2

2 dog 1

3 eat 1

4 exception 3

5 first 2

5 food 1

6 own 1

7 prove 3

8 rule 3

9 serve 2

10 your 1

Eat your

own dog

food.

First

come,

first

served.

The

exception

proves the

rule.

a

and

around

every

for

from

in

is

it

not

on

one

the

to

under

Stop word List

Page 9: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Query types

Type Example

Term

(MUST, MUST_NOT, SHOULD)

+adesso –italy

Phrase „foo bar“

Wildcard fo*a?

Fuzzy fobar~

Range [A TO Z]

16.05.2013 8

Page 10: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 9

Page 11: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

NoSQL and Search

One size fits all approach

► Which NoSQL store satisfies our requirements best?

► Is full-text search supported?

5/16/2013 10

Data Structure Access Patterns

Volume Performance

Availability Updates

Consistency

Page 12: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

NoSQL and Search

5/16/2013 11

Let‘s take a closer look on:

► MongoDB

► Neo4j

► Apache Cassandra

► Apache Hadoop

Page 13: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Document-oriented Databases

{ "_id" : ObjectId(„42"),

"firstname" : "John",

"lastname" : "Lennon",

"address" : { "city" : "Liverpool",

"street" : "251 Menlove Avenue“ }

}

5/16/2013 12

► Designed for storing and retrieving documents

► Semi-structured content such as BSON documents

Page 14: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

MongoDB

► Supports ad-hoc CRUD operations

db.things.find({firstname:"John"})

► Server-side execution of JavaScript

► Aggregations, MapReduce

► Simple keyword search with multikey indexes:

> Index array content as separate entries

16.05.2013 13

{ article : “some long text",

_keywords : [ “some" , “long" , “text“]

}

Page 15: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

MongoDB

► Version 2.4 supports text indexes

► Language-specific stemming based on Snowball

► Still a beta feature

16.05.2013 14

db.foo.runCommand(“text“, {search: “adesso –italy”, language: “english”})

Page 16: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

MongoDB

► Mongo Connector integrates MongoDB with another system (backup MongoDB cluster, Solr, elasticsearch,)

► System architecture with separate search engine possible

16.05.2013 15

MongoDB SolrMongo

Connector

1 2 3 4 5

update synccreatedocument index search

Page 17: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Choosing the Right Approach

MongoDB MongoDB

+ Search Engine

Search Engine

No result set merging

Complex queries with

aggregations

Simple text search

(but experimental text

index)

Full-text search with

faceting

Complex queries with

aggregations

Result set merging

Increased complexity

(ops, dev)

No result set merging

Full-text search with

faceting

Backup?

Aggregations?

16.05.2013 16

Page 18: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Graph Databases

► Stored data is represented as graph structures

> Nodes

> Edges (Relationships)

> Properties

► Universal datamodel

► Traversing

► Example: Neo4j

5/16/2013 17

id=1

name=“John“

id=2

name=“George“

id=3

name=“Paul“

friend friend

Page 19: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Neo4j

► Traversing

> Visiting nodes by following relationships

> Breadth- and depth-first traversing

> Gremlin, Cypher

Result = George

5/16/2013 18

START john=node:peoplesearch(name=‘John’)

MATCH john<-[:friend]->afriend RETURN afriend

Page 20: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Neo4j

► Database itself is a natural index consisting of its edges and nodes

> Example: „name“, „city“

► Auto indexing keeps track of property changes

16.05.2013 19

personRepository.findByPropertyValue("name", "John");

Page 21: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Neo4j

► The default separate index engine used is Apache Lucene

16.05.2013 20

Index<PropertyContainer> index = template.getIndex("peoplesearch");

index.query("name", "Jo*");

@NodeEntity class Person {

@Indexed(indexName="peoplesearch", indexType=IndexType.FULLTEXT) private String name;

..

}

Page 22: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Wide Column Store

► Google BigTable: „a sparse, distributed multi-dimensional sorted map“

► Data is organized in rows, column families, and columns

► Ideal for sharding (horizontal partitioning)

16.05.2013 21

jlennon

pmccart

gharris

name

„Lennon“

name

„McCartney“

name

„Harrison“

state

„UK“

address

„Liverpool ..“

address

„Liverpool ..“

different columns per row

unique

row keys

Page 23: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Cassandra

► BigTable clone

► Distributed Hash Table (Amazon Dynamo)

► Eventual consistency (configurable levels)

► Cassandra Query Language (CQL) = SQL dialect without joins

► Hadoop integration

5/16/2013 22

SELECT name FROM user WHERE firstname=„John“;

Page 24: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Cassandra

► Solandra = Solr using Cassandra as backend

► DataStax Enterprise Search

> One local Solr instance per Cassandra node

> Integration is based on secondary index API

> CQL supports Solr Queries

> Cassandra’s ring information is used to

construct Solr distributed search queries

16.05.2013 23

SELECT title FROM solr WHERE solr_query=‘name:jo*';

Page 25: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 24

Page 26: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Hadoop

► Hadoop:

> Framework for distributed processing of large data sets in computer clusters

> Distributed filesystem + MapReduce implementation

► Scalable and reliable platform of a comprehensive data analysis ecosystem

16.05.2013 25

Page 27: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Hadoop MapReduce

5/16/2013 26

Persistent Data

Map Map Map Map

Transient Data

Persistent Data

Reduce Reduce Reduce

► Map Phase:

> Records are processed by map function

► Shuffle Phase:

> Distributed sort and grouping

► Reduce Phase:

> Intermediate results are processed by reduce function

Page 28: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Hadoop MapReduce

► Data is processed by mappers and reducers

5/16/2013 27

map(k, v) -> [(K1,V1), (K2,V2), ... ]

reduce(Kn, [Vi, Vj, …]) ->

(Km, R)

Data

Mapper

Shuffle Reducer Result

Page 29: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

What kind of problems does MapReduce solve?

► Problems processed without reducer

> Searching

> File converting

> Sorting

> Map-side join

► Problem processed with reducer

> Grouping and aggregation

> Reduce-side join

► More complex problems:

> Solved by combinations of multiple MapReduce jobs

5/16/2013 28

Page 30: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Hadoop MapReduce: Searching

► Search document including „A“

5/16/2013 29

1

2 1

Documents

1: A,B,C

3

4

5

Mapper emits only documents that fit

the searching criteria

4

5

2: D,E

3: B,E

4: A,D

5: A,C,E

Result = 1, 4, 5

Page 31: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Hadoop MapReduce: Indexing

16.05.2013 30

HDFS

MapReduce

Job

Lucene Lucene

Index

► HDFS:

> Stores raw data

► Mapper:

> Extracts text (creates e.g. SolrInputDocument)

> Calls Lucene for indexing (calls e.g. StreamingUpdateSolrServer)

Page 32: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Hadoop MapReduce: Indexing

16.05.2013 31

1

2

Daten

1: text

3

4

5

2: text

3: text

4: text

5: text

Mapper

@Override public void map( LongWritable key, Text val, OutputCollector<NullWritable, NullWritable> output, Reporter reporter) throws IOException { st = new StringTokenizer(val.toString()); lineCounter = 0; while (st.hasMoreTokens()) { doc= new SolrInputDocument(); doc.addField("id", fileName + key.toString() + lineCounter++); doc.addField("txt", st.nextToken()); try { server.add(doc); } catch (Exception exp) { … } }}

Page 33: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Tika

16.05.2013 32

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

► Extracts metadata and structured text content

> HTML, MS Office documents, PDF, etc.

► Stream parser can process large files

Page 34: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Solr / elasticsearch

5/16/2013 33

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

ela

sticse

arc

h

► Lucene is only a libary, not a standalone search engine

► Complete search engines:

> Solr

> ElasticSearch

Page 35: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Flume

5/16/2013 34

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

Flume

Web Server,Applikations

ela

sticse

arc

h

► Distributed service for collecting, aggregating and moving large amounts of data (e.g. log data)

► Streaming techniques

► Fault tolerant

Page 36: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Alternatives

5/16/2013 35

HDFS

MapReduce

Job

Lucene Lucene

Index

Tika

Flume

Web Server,Apps, DBs

Crawler DistCp Sqoop

ela

sticse

arc

h

► Nutch Crawler creates one entry in CrawlDB per URL

► Hadoop DistCp copies data within and between hadoop systems

► Apache Sqoop transfers bulk data between Hadoop and RDMBS

Page 37: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Apache Hadoop

5/16/2013 36

Web Content,Intranet

Loading

Tools

Hadoop

Search Analysis Export Visualization

► Fundamental mismatch:

> MapReduce for batch processing

> Lucene for interactive searching

► MapReduce for indexing large datasets

► Basis for (offline) BigData solutions

Page 38: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Summary

► More semi-structured data

► Increasing relevance of full-text searching

► Combination of NoSQL and Lucene:

> MongoDB: integration via MongoDB Connector

> Neo4j: native Lucene integration

> Cassandra: Datastax‘s Solr integration

> Hadoop: indexing large datasets with MapReduce

► Alternative: search engine as document-oriented database

5/16/2013 37

Page 39: Full-text Search with NoSQL Technologiesnosqlroadshow.com/dl/NoSQL-Berlin-2013/Presentations/Fulltext... · Full-text Search with NoSQL Technologies ... grep-like naive approach:

Thank you for your attention!

16.05.2013 38