searching large xml databases using lucene

27
1 © Copyright 2012 EMC Corporation. All rights reserved. Searching Large XML Databases using Lucene Amsterdam, September 19, 2012 Petr Pleshachkov, EMC [email protected], September 19, 2012

Upload: others

Post on 03-Feb-2022

10 views

Category:

Documents


0 download

TRANSCRIPT

1 © Copyright 2012 EMC Corporation. All rights reserved.

Searching Large XML Databases using Lucene

Amsterdam, September 19, 2012

Petr Pleshachkov, EMC [email protected], September 19, 2012

2 © Copyright 2012 EMC Corporation. All rights reserved.

My Background

Petr Pleshachkov, Principal Software Engineer

xDB/xPlore team in Rotterdam – My site: EMC Netherlands

– Other xPlore/xDB sites: Pleasanton (California), Shanghai (China), and Grenoble (France)

Areas of expertise: – Semistructured data management

– Databases: transaction management, query optimization, full-text search

Academia & Research: – PhD in Computer Science, ISP RAS

3 © Copyright 2012 EMC Corporation. All rights reserved.

Agenda

Overview of EMC Documentum xDB/xPlore

Integration of Lucene into xDB

xDB transaction model & lucene transaction management

Performance analysis

Future optimizations

4 © Copyright 2012 EMC Corporation. All rights reserved.

Introducing Documentum xPlore

• EMC Documentum is a leading

supplier of Enterprise Content

Management software

• xPlore Provides ‘Integrated

Search’ for Documentum

– but is built as a standalone search

engine to replace FAST Instream

– Highly deployed across

Documentum environments

worldwide (over 70+ countries)

• xPlore Search Engine built over

EMC xDB, Lucene, and leading

content extraction and linguistic

analysis software

5 © Copyright 2012 EMC Corporation. All rights reserved.

Key values which xDB brings for xPlore

Flexible, hierarchical query & data models

Joins

High throughput, low-latency indexing –See documents within secs after saving

Leverage B-tree indexes when appropriate

–Lucene doesn’t fit all uses

Rich, innovative query language

Enterprise, single unified database

Why build a search engine over an XML database?

6 © Copyright 2012 EMC Corporation. All rights reserved.

Documentum xDB Formerly XHive database

– 100% Java

– XML stored in persistent DOM format

▪ Each XML node can be located through a 64 bit identifier

▪ Structure mapped to pages

▪ Easy to operate on GB XML files

Full Transactional Database

Query Language: XQuery

Indexing & Optimization – Palette of index options optimizer can pick from

– At it simplest: indexLookup(key) -> node id

Backup/Restore, scalability, multi-node architecture

7 © Copyright 2012 EMC Corporation. All rights reserved.

xDB Data Storage Model

A

B C

D

E

Database page

This node structure can be represented as a tree - DOM model

An XML Document can be thought of as a collection of elements, attributes (or ‘xml nodes’)‏

A B C D E

8 © Copyright 2012 EMC Corporation. All rights reserved.

Libraries & Indexes

Scope of index

covers all xml files in

all sub-libraries

A

B C

A

B

C

= X-Hive Library

= X-Hive Index

= X-Hive xml file

= xDB Library

= xDB Index

= xDB xml file

9 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Integration

Both value and full-text queries supported – XML SubPaths mapped into lucene fields

– Tokenized and value based indexes available

Composite key queries supported – Lucene index is much more flexible than B-

tree composite indexes

– Skip Lists

10 © Copyright 2012 EMC Corporation. All rights reserved.

Multipath Index Definition <PLAY> <ACT> <SCENE> <SPEECH> <SPEAKER>BRUTUS</SPEAKER> <LINE>I am not gamesome: I do lack some part</LINE> <p><LINE>Listen great things</LINE></p> </SPEECH> <SPEECH> <SPEAKER>CASSIUS</SPEAKER> <LINE>Then, Brutus, I have much mistook your passion;</LINE> <LINE>By means whereof this breast of mine hath buried</LINE> <p><LINE>Thoughts of great value, worthy cogitations.</LINE></p> </SPEECH> </SCENE> </ACT> </PLAY>

INDEX ROOT PATH: //SPEECH

SubPath1: (/SPEAKER, VALUE_COMPARISON)

SubPath2: (//LINE, FULL_TEXT_SEARCH)

11 © Copyright 2012 EMC Corporation. All rights reserved.

Mapping to Native Lucene Structures

/SPEAKER /txt /LINE /tkn /p/LINE/tkn

XHIVE_NODE

NOT_ANALYZED

STORE.NO

(BRUTUS)

ANALYZED

STORE.NO

(I am not

gamesome: I do

lack some part)

ANALYZED

STORE.NO

(Listen great things)

NOT_ANALYZED

STORE.YES

(1430532)

Lucene Document 1

/SPEAKER

/txt

/LINE /tkn /LINE /tkn /p/LINE/tkn

XHIVE_NODE

NOT_ANALYZED

STORE.NO

(CASSIUS)

ANALYZED

STORE.NO

(By means

whereof this

breast of

mine hath

buried)

ANALYZED

STORE.NO

(Then, Brutus, I

have much

mistook your

passion;)

ANALYZED

STORE.NO

(Thoughts of

great value,

worthy

cogitations.)

NOT_ANALYZED

STORE.YES

(1430537)

Lucene Document 2

12 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Inverted List

/LINE/tkn

/p/LINE/tkn

/SPEAKER/txt

XHIVE_NODE

great

lack

passion …..

great …..

…..

brutus

cassius

1430532

1430537

{1}

{1}

{2}

{1, 2}

{1}

{2}

{2}

{1}

Term Dictionary Document Store

Doc ID XHIVE_NODE 1 1430532

2 1430537

13 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Query Mapping

for $SPEECH score $s in collection(‘col1’)//SPEECH[SPEAKER=’CASSIUS’ and //LINE contains text ‘great’] order by $s return $SPEECH

BooleanQuery (TermQuery1, BooleanQuery(TermQuery2, TermQuery3, BooleanClause.Occur.SHOULD), BooleanClause.Occur.MUST) TermQuery1= TermQuery(new Term(‘/speaker/txt’, ‘CASSIUS’)) TermQuery2=TermQuery(new Term(‘/line/tkn’, ‘great’) TermQuery3=TermQuery(new Term(‘/p/line/tkn’, ‘great’))

14 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene SubIndexes

Each user transaction creates a separate Lucene subIndex

Transaction performs all the updates in its own index

The delete operation does not physically touch subIndexes created by other transactions

A pair (minLSN, maxLSN) is associated with each subIndex, which is used to construct a global index snapshot

.

15 © Copyright 2012 EMC Corporation. All rights reserved.

Blacklists

The delete operation of transaction: – Physically deletes document from

transaction’s own subIndex

– Adds a pair (subIndexMinLSN, NODE_ID) to the blacklist structure

Index view constructor applies blacklists to eliminate deleted documents

Periodically merge operation merges small subIndexes into bigger one and physically deletes documents.

16 © Copyright 2012 EMC Corporation. All rights reserved.

xDB transaction management

ARIES-based ACID transactions – Every page has a Log Sequence Number

(pageLSN)

– Buffer manager tracks dirty pages using RecLSNs

– Log ALL updates on per page basis, including updates performed during rollbacks

– Periodically asynchronous thread runs checkpoint procedure

– The recovery procedure: ▪ Repeat the history. Redo all the updates since the

last successful checkpoint

▪ Undo not complete transactions

17 © Copyright 2012 EMC Corporation. All rights reserved.

xDB transaction isolation

READ_WRITE transaction follow two-phase-locking rule:

– Expanding phase: locks are acquired and no locks are released

– Shrinking phase: locks are released and no locks are acquired

READ_ONLY transaction does not acquire any locks!

– The data snapshot at the moment of transaction start is used

– Using log records we undo recent changes on the page level

18 © Copyright 2012 EMC Corporation. All rights reserved.

How to integrate Lucene into transactional xDB database ?

Old Solution (xDB 10.1/10.2 releases) – All lucene files are stored in separate directory

– New transaction model for lucene indexes is implemented

– Lucene does not use xDB buffer pool

– Backup/restore and replication do not use xDB mechanisms

New Solution (xDB 10.3) – All lucene files are stored in xDB data segment

– xDB transaction model is used since all the updates go through xDB data pages

– Backup/restore and replication are supported

automatically

19 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Index Access Model

New LIDirectoryImpl class is implemented (extends Directory class)

LIDirectory class stores all files in xDB blob objects

LIIndexInput class extends BufferedIndexInput – void readInternal(byte[] b, int offset, int len)

▪ Reads data from the blob

▪ The blob object is buffered on the xdb buffer management level

LIIndexOutput class extends BufferedIndexOutput

– void flushBuffer(byte[] b, int offset, int len) ▪ Writes lucene data to the blob object

▪ The operation is logged automatically on the buffer manager level

20 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Index Access Model (con’t)

readInternal flushBuffer

Lucene Blob Objects

IndexReader IndexWriter

buffered data pages

LIDirectoryImpl

LIIndexOutput LIIndexInput

Indexer Queries

Lucene Caches

21 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene SubIndex Storage Model

LIDirectoryStore

LiFileEntryStore

LiFileEntryStore

Directory page

BlobStore page BlobStore page

Blob Tail

Blob Tail

Blob page

Blob page

Blob page

Blob page

Blob page

Blob page

22 © Copyright 2012 EMC Corporation. All rights reserved.

Lucene Index Master Record (MIR)

SI_1 SI_2 SI_3 … SI_N

Directory object

Directory Object

Blob objects

• Tracks information

about all subindexes

and their state

• Represented as a B-

tree concurrent index

• Used for lucene index

view construction

• Updated concurrently

by Ingest transactions

and merging/cleaning

tasks

• Periodically

asynchronous tasks

merges subIndexes

into bigger one

23 © Copyright 2012 EMC Corporation. All rights reserved.

SubIndexes Merging

Final Index

New Final Index

C

D

E

F

B

H

G F

24 © Copyright 2012 EMC Corporation. All rights reserved.

Ingest performance analysis (in seconds)

180,956

1009,459

2149,636

205,068

1015,937

2526,601

0

500

1000

1500

2000

2500

3000

Ingest 10000 docs Ingest 50000 docs Ingest 100000 docs

xDB 10.3 (pre-release) xDB 10.2

25 © Copyright 2012 EMC Corporation. All rights reserved.

Query performance analysis

(response time in ms.)

7,088

10,08

7,713

14,013

0

2

4

6

8

10

12

14

16

Q1 serie: queries with range and 3 value

comparison conditions

Q2 serie: queries with full-text and 2

value-comparison conditions

xDB 10.3 (pre-release) xDB 10.2

26 © Copyright 2012 EMC Corporation. All rights reserved.

Future optimizations

Reduce number of separate subIndexes

Final/NonFinal merge optimizations

Advanced buffer management techniques

Concurrent Lucene MultiPath Index