comparing apache lucene and perfect search …...comparing oracle and perfect search technologies 4...

W H I T E P A P E R

BY DANIEL HARDMAN

AND NATHAN GEORGE

2011

Comparing Oracle and Perfect Search Technologies

SUMMARY: Explores ways that Oracle 11g's full text

engine and Perfect Search's indexing technology

complement one another. Compares and contrasts

performance of each solution on massive set of

structured and unstructured data.

Disclaimer

© 2011 Perfect Search Corporation. All rights reserved. This white paper is for informational purposes only and may contain

typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any

kind. Reproduction of this material in any manner whatsoever without the express written permission of Perfect Search Corporation is

strictly forbidden. Perfect Search and the Perfect Search logo are trademarks of Perfect Search Corporation. Other trademarks and

trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Perfect

Search Corporation disclaims any proprietary interest in trademarks and trade names other than its own.

COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES

2

Background

Oracle has a highly sophisticated, feature-rich full text engine. Perfect Search has index and

query technology that radically improves performance and scalability, especially for

unstructured data. Because Perfect Search's technology is not currently packaged with some

supporting components that Oracle offers, this evaluation is based on the premise that the two

technologies are not natural competitors, but rather useful complements to one another. This

paper shows how Perfect Search's speed adds significant value to an Oracle platform.

While a large number of advanced optimizations are possible using advanced configuration

techniques, this paper focuses on several practical basic optimizations. Accordingly, this study

organized data in Oracle using well-known best practices, and made only cursory changes to a

standard Perfect Search Appliance. Details about assumptions and how they affected

performance are provided later in this paper.

Test Environment

For simplicity and maximum congruence, all tests for both Oracle and Perfect Search were done

on the same computer. The tests utilized a 64-bit Linux machine with 32 GB of RAM. It had the

latest version of Cent OS 64-bit, and all recommended updates. The system had eight 1TB 7200

RPM disks in three RAID0 arrays and two quad-core, hyper-threaded 2.27 GHZ Xeon processors.

The computers did not have any third-party applications installed other than some text editors

and basic scripting and development tools.

While we were analyzing Oracle, no Perfect Search software was running; while we were

analyzing Perfect Search, we stopped Oracle.

Test Parameters

We compared data query performance from two separate data sets:

1. Patent grants

This unstructured corpus contains 5.5 million xml documents from the USPTO (3.5 million patent

grants and 2 million patent applications between 1978 and 2006). Each document contains

categorical information, document tracking data, an abstract, inventor and examiner names, a

description of the invention, and formal claims. Documents are commonly 50-100k; 1 MB and

larger documents appear occasionally. The combined size of all documents is approximately

500 GB.

A sample document, edited for brevity, follows:


3

This data was organized into a single table, patent_data as follows:

CREATE TABLE "PSA"."PATENT_DATA" ( "ID" NUMBER, "PATH" VARCHAR2(100), "DATA" "XMLTYPE"."XMLTYPE", CONSTRAINT "PK_ID" PRIMARY KEY ("ID") VALIDATE ) TABLESPACE "PSADATA" PCTFREE 10 INITRANS 1 MAXTRANS 255 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) LOGGING NOCOMPRESS XMLTYPE COLUMN "DATA" STORE AS CLOB ( TABLESPACE "PSADATA" CHUNK 8192 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) PCTVERSION 10 NOCACHE LOGGING )

The PATH field contains the fully qualified path to the original xml file in the file system. The

DATA field contains the full xml document. We used Oracle Text to index the DATA field.

Two tests were performed. First, using Oracle 11.2.0.1, all 3.5 million patent grants were

indexed and compared to a Perfect Search index of the same 3.5 million documents. A

second set of tests compared Oracle 11.2.0.2 with the big IO option to Perfect Search using all

5.5 million patent documents. The second set of tests also compared single threaded

performance to multithreaded performance. The multithreaded tests were run by spawning

10 parallel threads in the test application with each thread running one-tenth of the query file.

Oracle 11.2.0.1 with default settings took about 38 hours to create the text index of 3.5 million

patent grants. After building the index ctx_adm‟s „MAX_INDEX_MEMORY‟ parameter was

increased to 2147483648, and we ran a full index optimization using the command:

exec ctx_ddl.optimize_index( ‘IDX_PATENT_DATA’, ‘FULL’ );


4

This command took 61 hours to complete. Using the Oracle 11.2.0.2 Big IO setting, Oracle took

about 76 hours to create the text index for both the grants and applications. Creating an

index using the big IO option includes the index optimization, so an extra optimization pass

was not necessary. Perfect Search took 26 hours to both import and index all 5.5 million patent

documents. (Perfect Search data import and indexing were not separated, so an import vs.

index metric is not available.) Build speeds for Perfect Search could be increased by

improving the indexing parallelization.

Without the big IO option, the data size was roughly comparable. With the Big IO option

Oracle used 68% more disk space.

Build, Patents

Oracle 11.2.0.1

rows (millions) size (GB) load time

3.5 354.8 21:30:39

indexes size (GB) index time

$I 176.5 39:17:02

$K 0.09

$R 0.05

full optimize 61:11:48

index total 176.7 100:28:50

532 121:59:39

Oracle 11.2.0.2 with Big IO

5.5 million rows size (GB) load time

Id and path 0.88 >24 hours

Xml data 668.78

Data total 669.66

indexes size (GB) index time

$I 119.75

$R less than 1 Meg

$X 7.04

Index total 126.79 76:00:00

796.45 >100 hours

Perfect Search rows (millions) size (GB) load + index time

5.5 362.0 (combined)

indexes size (GB)

combined 110.0

472 26:00:00


5

When creating the search parameters, we created the test queries with terms that would likely

be used when searching the USPTO web site.

2. SSDI (Social Security Death Index)

This structured corpus contains information on about 80 million deceased individuals, including

their full names, SSN, birth and death dates, and place(s) of residence. The data is relatively

normalized, but not perfect; for example, there are a few last names that begin with

punctuation characters, a few first names that include "Sr." (which should be parsed as a suffix

instead), and so forth.

The original data is in tab-delimited text files. We imported this into a single table in Oracle:

CREATE TABLE "PSA"."SSDI_DATA" ( "SSN" CHAR(9), "FIRST_NAME" VARCHAR2(50), "MIDDLE_INITIAL" CHAR(1), "LAST_NAME" VARCHAR2(50), "SUFFIX" VARCHAR2(10), "BIRTH_DATE" CHAR(10), "DEATH_DATE" CHAR(10), "STATE_ISSUED" CHAR(2), "RES_ZIP" CHAR(5), "RES_STATE" CHAR(2), "PAY_ZIP" CHAR(5), "PAY_STATE" CHAR(2), CONSTRAINT "SSN_PK" PRIMARY KEY ("SSN") VALIDATE ) ORGANIZATION INDEX TABLESPACE "PSADATA" INITRANS 2 MAXTRANS 255 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) LOGGING NOCOMPRESS

Importantly, we did not attempt to fully normalize the SSDI data. A 3NF schema would require

location fields to be handled differently because zip codes and states are somewhat

redundant, but for maximum performance we did not want to require Oracle to do a join.

Additionally, to improve Oracle's performance, the handling of dates was also somewhat non-

standard. In the original SSDI corpus, date components are separate fields (year, month, day).

Oracle has excellent date composition/decomposition functions, but they impose a

performance penalty over raw field value matching, and they require a fully specified value,

whereas some SSDI date values are only approximate ("January 1896" or even "1896") and

therefore unsuitable for standard manipulation. On the other hand, keeping date components

separate complicates ordering and range calculations. The best compromise was to store dates

as strings in ISO 8601 format (yyyy-mm-dd), truncating them as necessary. This enabled ordering

using a simple text sort and made range calculation a simple matter.

As with the patent data, this data set required roughly the same amount of time to import and

was roughly the same size, on both systems:


6

Build, SSDI

Oracle

rows (millions) size (GB) load time

80 15:05:18 Indexes size (GB) index time

pk_ssn 09:22.0

ssdi_clustered 02:58.6

ssdi_by_birth_date 03:09.3

ssdi_by_lower_ln_fn 02:27.5

ssdi_by_pay_zip 01:08.4

ssdi_by_res_zip 01:48.7

ssdi_by_ln~_fn~ 02:49.6

ssdi_by_death_date 02:20.5

ssdi_by_soundex 02:00.0

ssdi_by_lower_ln 02:30.4

index total 0:30:35

18.7 15:35:53

Perfect Search rows (millions) size (GB) load + index time

80 34.0 (combined)

Indexes size (GB)

Combined 3.4

37.4 40:00:001

It is worth noting that the ratio of data-to-index-size is quite different. Perfect Search's larger raw

data size is probably caused by a lack of compression in its repository; its small index reflects the

fact that we made no attempt to create composite keys or otherwise optimize.

This particular data set resembles data that's routinely searched by genealogists. Accordingly,

we based our experiment plan for the SSDI data on actual logs from worldvitalrecords.com.

Because genealogical searches are typically oriented around last name, first name, and date

ranges, we chose to use a clustered index on last name, first name, birth date, and death date.

This time is from an appliance tuned for retrieval. In an indexer-only situation, the current time is 3:00:00, and has been dropped lower with some features that are in beta.


7

Except for the clustered indexes, we placed all other indexes in a separate file group from the

data for ease of analysis.

3. Generated Data (GenData)

An attempt was made at indexing a third computer generated data set of 55 million xml

records. Perfect search finished the index in approximately 72 hours, but the Oracle Index with

the big IO option was slated to take over a month to complete.

Query Specifications

We wanted to compare query performance in both types of data, controlling for factors such as

the following:

Complexity of the query

Compilation versus the use of stored procedure

Existence of a cache

Ranked versus arbitrary search results

Accordingly, we developed a set of query batches for each data set, and stored these batches

as script files. We built large enough batches to place a load on each system for at least a few

seconds at a time; typically batches consisted of hundreds or thousands of queries. After running

through these query sets it became apparent that the test overhead was dominating the

performance in many cases, so query files in the range of tens-of-thousands of queries were

used to ensure that overhead was at least ten times smaller than the query time, even for multi-

threaded query cases. Generally, each query in a given batch was different (although we did

not prove perfect uniqueness). For SSDI, we also produced a mixed batch, in which queries of all

types were randomly ordered to reflect likely real-world usage patterns.

To control for caching, we ran each batch under two scenarios. To capture clean cache

(uncached) performance, we stopped all database services, sync‟ed the disks, and forced the

Linux kernel to drop all its caches. This also eliminated the possibility that the OS had cached

parts of the file system before the query run. Uncached performance is thus a worst-case

scenario.

After we had run a batch uncached, we re-ran the same batch again, to get performance with

pre-populated operating system and program caches (best-case performance). As expected,

we saw speed boosts in the perfect search system, but on the Oracle system large query runs

seemed to negate much of the performance gains.

In the case of the patent data, we scanned docs and produced sets of words and phrases that

we knew occurred. For example, one query from our 3-term patent query batch (arbitrary rather

than ranked) was:

SELECT * FROM ( SELECT /* FIRST_ROWS(10) */ id, path FROM patent_data

WHERE CONTAINS( data, ‘filler AND bridge AND adapter’) > 0 ) WHERE

ROWNUM < 11;


8

The 1-term, 2-term, 10-term, and 20-term equivalents of this query take a similar form. The ranked

phrase search batch included queries like this:

SELECT * FROM ( SELECT /* FIRST_ROWS(10) */ id, path, score(1) FROM

patent_data WHERE CONTAINS( data, ‘about(polymer backbone at least one

free isocynate group)’, 1) > 0 ORDER BY score(1) ) WHERE ROWNUM < 11;

The use of the nested query construct was intended to limit exhaustive table scans, as

recommended in Tuning Oracle Text.

SSDI queries were generated by parsing actual logs from worldvitalrecords.com (a Perfect

Search customer). We wrote scripts to generate SQL equivalents of the most common query

types in those logs, and used actual values from those logs as input parameters. This means that

a fair number of the queries could return either no results or an overly large result set, as in "real

world" usage. Also, a few query values are repeated, presumably because a customer double-

clicked a submit button on the WVR web site or searched for the same thing in more than one

repository (WVR has ~13000 "databases" with data similar to SSDI).

Some representative SSDI queries include:

SELECT * FROM SSDI WHERE metaphone.genprimkey(first_name, 5) =

metaphone.genprimkey('Edward', 5)

AND metaphone.genprimkey(last_name, 5) =

metaphone.genprimkey('Martel', 5) AND rownum < 11;

SELECT * FROM SSDI

WHERE ssn = '236225018';

SELECT * FROM SSDI WHERE (res_zip = '02148' OR pay_zip = '02148') AND ((birth_date > '1976' AND birth_date < '1979') OR (death_date > '1976' AND death_date < '1979')) WHERE rownum <

11;

SELECT * FROM SSDI WHERE lower(first_name) = 'jacob' AND lower(last_name) = 'oskins'

AND (state_issued = 'IN' OR res_state = 'IN' OR pay_state = 'IN')

WHERE rownum < 11;

We analyzed the estimated execution plans for the queries to confirm that Oracle was using

available indexes in reasonable ways. However, we did not optimize every case--only the

preponderance of common ones.

The SSDI and first set of Patent results were executed using scripts in sqlplus (sql files). We used the

following wrapper script to run each sql script and parsed the elapsed times reported by Oracle

from its log as the basis of our performance numbers.

set timing on;

http://download.oracle.com/docs/cd/B28359_01/text.111/b28303/aoptim.htm


9

set autotrace traceonly; spool &1.out.txt @&1 spool off exit

For the second set of Oracle tests, including the multi-threaded tests, a simple JDBC program

was used to execute the queries.

Perfect Search scripts (*.txt) were executed using psutil.exe, a testing tool that calls Perfect

Search's query engine over its web service interface. Psutil processes queries serially and

measures elapsed clock time with microsecond precision.

Assumptions

A few ambiguities required us to make assumptions in our analysis. On the Oracle side, the

resolution of timing data was a tenth of a second. Where elapsed time was reported as

“00:00:00.00”, rather than estimate a millisecond query time for such queries, Oracle was

credited with a query that took 0 ms. For the second set of queries we used the end-to-end

timing where the clock was started when the test tool began, and stopped when the final query

finished executing. This was to avoid high-precision clock problems across multiple cores.

Testing Observations

When running the small query sets (less than 1000 queries), Perfect Search returned results in less

time than it took for the testing tool to create the benchmarking tool process and establish a

network connection. To combat this, we moved to larger query runs where the query time

would be at least ten times larger than the overhead time. This gave us a new problem: what

took less than 3.5 hours on Perfect Search took Oracle 13.5 days to finish.

Oracle appeared to do well caching data on small query runs, but as the query runs became

larger the caching effect was much less noticeable. For example, the complete clean-cache

run took Oracle just over 6.9 days, and the populated cache run took 6.5 days, only a 6%

performance increase. Caching continued to increase Perfect Search performance on the

50,000 query sets. Though operating system statistics revealed that some index sets started to be

swapped in and out of memory, most sets were still cached. On the 500 query runs operating

system statistics showed that all sets continued to be in the operating system‟s file system cache.

Results

In general, we found that Perfect Search's query performance on unstructured data was at least

ten times faster than the current Oracle text engine. On a non-ranked search for a single patent

term, Oracle returned nearly 18 queries per second. However, Perfect Search achieved 299

queries per second on the same batch (Patent Graph 3, 1 term).

Both engines paid a cost for increased complexity, but the ratio remained similar. For 10-term

queries in arbitrary order, Oracle handled 0.4 qps, while Perfect Search returned 40 (Patent

Graph 1, 10 terms). The narrowest gap was on ranked searches for 10-term queries: 0.181 qps for

Oracle versus 2 for Perfect Search(Patent Graph 2, 10 terms).


10

The first two graphs illustrate Oracle 11.2.0.1 numbers without the Big IO feature. The next four

graphs show Oracle 11.2.0.2 with big IO.

Perfect Search might also add some value in the traditional RDBMS sweet spot of structured

data, but the performance difference is less dramatic. Perfect Search was roughly twice as fast

on a simple last name match (466 versus 1079 qps)--the most common query in the WVR logs

(SSDI Graph, *ln). However, Perfect Search is dramatically faster on some of the more exotic,

criteria-heavy searches. Across all query types in the mixed batch of 10,000 queries from actual

customer logs, Perfect Search outperformed Oracle without the big IO options by more than 50

times (1 versus 64 qps). With the big IO option Oracle performance improves, but Perfect Search

still outperforms it by at least an order of magnitude.


11

Patent Graph 1: Ranked search results on 3.5 million patent grants comparing Oracle 11.2.0.1 to Perfect Search


12

Patent Graph 2: Unranked (Boolean) search results on 3.5 million patent grants comparing Oracle 11.2.0.1 to

Perfect Search


13

Patent Graph 3: Comparing Oracle 11.2.0.2 with Big IO to Perfect Search on 5.5 million patent documents starting

each query set with a clean file system cache.


14

Patent Graph 4: Comparing Oracle 11.2.0.2 with Big IO to Perfect Search on 5.5 million patent documents. Each

query set was run once prior to timing the second run in order to populate the file system cache.


15

Patent Graph 5: Comparing queries that return both the first 10 results and total count on 5.5 million patent

documents starting each query set with a clean file system cache.


16

Patent Graph 6: Comparing queries that return both the first 10 results and total count on 5.5 million patent

documents. Each query set was run once prior to timing the second run in order to populate the file system cache.