full-text search with nosql...

16.05.2013

Full-text Search with NoSQL Technologies

NoSQL Search Roadshow 2013, Berlin

Kai Spichale

About me

► Kai Spichale

► Software Engineer at adesso AG

► NoSQL, Full-text searching, Spring, Java EE

16.05.2013 1

► adesso is among Germany‘s top IT service providers

► Consulting and software development focus

► More than 1,000 members of staff

► Some of the most important customers are Allianz, Hannover Rück, Westdeutsche Lotterie, Zurich Versicherung, DEVK, and DAK

Motivation

► Exponential data growth

► Semi-structured data

► More connections

► 80 percent of business-relevant information is in unstructured form

Search

► Shift in data access:

> More full-text search

> Higher user expectations

► Keyword search and link directories become impractical

16.05.2013 2

Agenda

► Lucene full-text search

► NoSQL:

> Architectural drivers

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 3

Full-text search

► Techniques for searching documents in collections

► grep-like naive approach:

> Serial scanning is slow

> No negation

> No distinction between phrase and keyword search

► Build inverted index

> Term Document

> Contains references to documents for each token

16.05.2013 4

Apache Lucene

► Java lib for full-text searches

► De facto standard for open source software

► Attributes:

> Application-agnostic

> Scalable, high performance

► Features:

> Ranked searching

> Multiple query types, faceting

> Sorting

> Multi-Index searching

16.05.2013 5

Text Analysis

16.05.2013 6

Extraction

Parsing

Character

Filter

Tokenizer

Token Filter

Documents

de.GermanAnalyzer:

StandardTokenizer > StandardFilter

> LowerCaseFilter > StopFilter > GermanStemFilter

Inverteted

Text Analysis

16.05.2013 7

ID Term Document

1 come 2

2 dog 1

3 eat 1

4 exception 3

5 first 2

5 food 1

6 own 1

7 prove 3

8 rule 3

9 serve 2

10 your 1

Eat your

own dog

served.

exception

proves the

around

Stop word List

Query types

Type Example

(MUST, MUST_NOT, SHOULD)

+adesso –italy

Phrase „foo bar“

Wildcard fo*a?

Fuzzy fobar~

Range [A TO Z]

16.05.2013 8

Agenda

► NoSQL:

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 9

NoSQL and Search

One size fits all approach

► Which NoSQL store satisfies our requirements best?

► Is full-text search supported?

5/16/2013 10

Data Structure Access Patterns

Volume Performance

Availability Updates

Consistency

NoSQL and Search

5/16/2013 11

Let‘s take a closer look on:

► MongoDB

► Neo4j

► Apache Cassandra

► Apache Hadoop

Document-oriented Databases

{ "_id" : ObjectId(„42"),

"firstname" : "John",

"lastname" : "Lennon",

"address" : { "city" : "Liverpool",

"street" : "251 Menlove Avenue“ }

5/16/2013 12

► Designed for storing and retrieving documents

► Semi-structured content such as BSON documents

MongoDB

► Supports ad-hoc CRUD operations

db.things.find({firstname:"John"})

► Server-side execution of JavaScript

► Aggregations, MapReduce

► Simple keyword search with multikey indexes:

> Index array content as separate entries

16.05.2013 13

{ article : “some long text",

_keywords : [ “some" , “long" , “text“]

MongoDB

► Version 2.4 supports text indexes

► Language-specific stemming based on Snowball

► Still a beta feature

16.05.2013 14

db.foo.runCommand(“text“, {search: “adesso –italy”, language: “english”})

MongoDB

► Mongo Connector integrates MongoDB with another system (backup MongoDB cluster, Solr, elasticsearch,)

► System architecture with separate search engine possible

16.05.2013 15

MongoDB SolrMongo

Connector

1 2 3 4 5

update synccreatedocument index search

Choosing the Right Approach

MongoDB MongoDB

+ Search Engine

Search Engine

No result set merging

Complex queries with

aggregations

Simple text search

(but experimental text

index)

Full-text search with

faceting

Complex queries with

aggregations

Result set merging

Increased complexity

(ops, dev)

No result set merging

Full-text search with

faceting

Backup?

Aggregations?

16.05.2013 16

Graph Databases

► Stored data is represented as graph structures

> Nodes

> Edges (Relationships)

> Properties

► Universal datamodel

► Traversing

► Example: Neo4j

5/16/2013 17

name=“John“

name=“George“

name=“Paul“

friend friend

► Traversing

> Visiting nodes by following relationships

> Breadth- and depth-first traversing

> Gremlin, Cypher

Result = George

5/16/2013 18

START john=node:peoplesearch(name=‘John’)

MATCH john<-[:friend]->afriend RETURN afriend

► Database itself is a natural index consisting of its edges and nodes

> Example: „name“, „city“

► Auto indexing keeps track of property changes

16.05.2013 19

personRepository.findByPropertyValue("name", "John");

► The default separate index engine used is Apache Lucene

16.05.2013 20

Index<PropertyContainer> index = template.getIndex("peoplesearch");

index.query("name", "Jo*");

@NodeEntity class Person {

@Indexed(indexName="peoplesearch", indexType=IndexType.FULLTEXT) private String name;

Wide Column Store

► Google BigTable: „a sparse, distributed multi-dimensional sorted map“

► Data is organized in rows, column families, and columns

► Ideal for sharding (horizontal partitioning)

16.05.2013 21

jlennon

pmccart

gharris

„Lennon“

„McCartney“

„Harrison“

„UK“

address

„Liverpool ..“

address

„Liverpool ..“

different columns per row

unique

row keys

Apache Cassandra

► BigTable clone

► Distributed Hash Table (Amazon Dynamo)

► Eventual consistency (configurable levels)

► Cassandra Query Language (CQL) = SQL dialect without joins

► Hadoop integration

5/16/2013 22

SELECT name FROM user WHERE firstname=„John“;

Apache Cassandra

► Solandra = Solr using Cassandra as backend

► DataStax Enterprise Search

> One local Solr instance per Cassandra node

> Integration is based on secondary index API

> CQL supports Solr Queries

> Cassandra’s ring information is used to

construct Solr distributed search queries

16.05.2013 23

SELECT title FROM solr WHERE solr_query=‘name:jo*';

Agenda

► NoSQL:

> MongoDB

> Neo4j

> Apache Cassandra

> Apache Hadoop

► Summary

16.05.2013 24

Apache Hadoop

► Hadoop:

> Framework for distributed processing of large data sets in computer clusters

> Distributed filesystem + MapReduce implementation

► Scalable and reliable platform of a comprehensive data analysis ecosystem

16.05.2013 25

Hadoop MapReduce

5/16/2013 26

Persistent Data

Map Map Map Map

Transient Data

Persistent Data

Reduce Reduce Reduce

► Map Phase:

> Records are processed by map function

► Shuffle Phase:

> Distributed sort and grouping

► Reduce Phase:

> Intermediate results are processed by reduce function

Hadoop MapReduce

► Data is processed by mappers and reducers

5/16/2013 27

map(k, v) -> [(K1,V1), (K2,V2), ... ]

reduce(Kn, [Vi, Vj, …]) ->

(Km, R)

Mapper

Shuffle Reducer Result

What kind of problems does MapReduce solve?

► Problems processed without reducer

> Searching

> File converting

> Sorting

> Map-side join

► Problem processed with reducer

> Grouping and aggregation

> Reduce-side join

► More complex problems:

> Solved by combinations of multiple MapReduce jobs

5/16/2013 28

Hadoop MapReduce: Searching

► Search document including „A“

5/16/2013 29

Documents

1: A,B,C

Mapper emits only documents that fit

the searching criteria

2: D,E

3: B,E

4: A,D

5: A,C,E

Result = 1, 4, 5

Hadoop MapReduce: Indexing

16.05.2013 30

MapReduce

Lucene Lucene

► HDFS:

> Stores raw data

► Mapper:

> Extracts text (creates e.g. SolrInputDocument)

> Calls Lucene for indexing (calls e.g. StreamingUpdateSolrServer)

Hadoop MapReduce: Indexing

16.05.2013 31

1: text

2: text

3: text

4: text

5: text

Mapper

@Override public void map( LongWritable key, Text val, OutputCollector<NullWritable, NullWritable> output, Reporter reporter) throws IOException { st = new StringTokenizer(val.toString()); lineCounter = 0; while (st.hasMoreTokens()) { doc= new SolrInputDocument(); doc.addField("id", fileName + key.toString() + lineCounter++); doc.addField("txt", st.nextToken()); try { server.add(doc); } catch (Exception exp) { … } }}

Apache Tika

16.05.2013 32

MapReduce

Lucene Lucene

► Extracts metadata and structured text content

> HTML, MS Office documents, PDF, etc.

► Stream parser can process large files

Apache Solr / elasticsearch

5/16/2013 33

MapReduce

Lucene Lucene

sticse

► Lucene is only a libary, not a standalone search engine

► Complete search engines:

> Solr

> ElasticSearch

Apache Flume

5/16/2013 34

MapReduce

Lucene Lucene

Web Server,Applikations

sticse

► Distributed service for collecting, aggregating and moving large amounts of data (e.g. log data)

► Streaming techniques

► Fault tolerant

Alternatives

5/16/2013 35

MapReduce

Lucene Lucene

Web Server,Apps, DBs

Crawler DistCp Sqoop

sticse

► Nutch Crawler creates one entry in CrawlDB per URL

► Hadoop DistCp copies data within and between hadoop systems

► Apache Sqoop transfers bulk data between Hadoop and RDMBS

Apache Hadoop

5/16/2013 36

Web Content,Intranet

Loading

Hadoop

Search Analysis Export Visualization

► Fundamental mismatch:

> MapReduce for batch processing

> Lucene for interactive searching

► MapReduce for indexing large datasets

► Basis for (offline) BigData solutions

Summary

► More semi-structured data

► Increasing relevance of full-text searching

► Combination of NoSQL and Lucene:

> MongoDB: integration via MongoDB Connector

> Neo4j: native Lucene integration

> Cassandra: Datastax‘s Solr integration

> Hadoop: indexing large datasets with MapReduce

► Alternative: search engine as document-oriented database

5/16/2013 37

Thank you for your attention!

16.05.2013 38

full-text search with nosql...

Documents

arabic text search

full text search in postgresql

full-text search in postgresql - sai.msu.supgcon 2007,...

how the bmw group visualizes and interacts with …...data...

full-text search in postgresql

sql server - full text search

text search and fuzzy matching

web search and text mining

hhs operational data warehouse - marklogic€¦ ·...

linux a first-class citizen in windows azure · mongodb the...

couchconf israel 2013_full text search

full-text search with lucene

couchconf_full text search

full-text search -...

full text search throwdown

teamcenter full text search - siemens digital industries...

search vs text classification

text search dictionary nfa and text search k a r

nosql search roadshow zurich 2013 - polyglot persistence...

search in medical text