comparing apache lucene and perfect search …...comparing oracle and perfect search technologies 4...
TRANSCRIPT
W H I T E P A P E R
BY DANIEL HARDMAN
AND NATHAN GEORGE
2011
Comparing Oracle and Perfect Search Technologies
SUMMARY: Explores ways that Oracle 11g's full text
engine and Perfect Search's indexing technology
complement one another. Compares and contrasts
performance of each solution on massive set of
structured and unstructured data.
Disclaimer
© 2011 Perfect Search Corporation. All rights reserved. This white paper is for informational purposes only and may contain
typographical errors and technical inaccuracies. The content is provided as is, without express or implied warranties of any
kind. Reproduction of this material in any manner whatsoever without the express written permission of Perfect Search Corporation is
strictly forbidden. Perfect Search and the Perfect Search logo are trademarks of Perfect Search Corporation. Other trademarks and
trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Perfect
Search Corporation disclaims any proprietary interest in trademarks and trade names other than its own.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
2
Background
Oracle has a highly sophisticated, feature-rich full text engine. Perfect Search has index and
query technology that radically improves performance and scalability, especially for
unstructured data. Because Perfect Search's technology is not currently packaged with some
supporting components that Oracle offers, this evaluation is based on the premise that the two
technologies are not natural competitors, but rather useful complements to one another. This
paper shows how Perfect Search's speed adds significant value to an Oracle platform.
While a large number of advanced optimizations are possible using advanced configuration
techniques, this paper focuses on several practical basic optimizations. Accordingly, this study
organized data in Oracle using well-known best practices, and made only cursory changes to a
standard Perfect Search Appliance. Details about assumptions and how they affected
performance are provided later in this paper.
Test Environment
For simplicity and maximum congruence, all tests for both Oracle and Perfect Search were done
on the same computer. The tests utilized a 64-bit Linux machine with 32 GB of RAM. It had the
latest version of Cent OS 64-bit, and all recommended updates. The system had eight 1TB 7200
RPM disks in three RAID0 arrays and two quad-core, hyper-threaded 2.27 GHZ Xeon processors.
The computers did not have any third-party applications installed other than some text editors
and basic scripting and development tools.
While we were analyzing Oracle, no Perfect Search software was running; while we were
analyzing Perfect Search, we stopped Oracle.
Test Parameters
We compared data query performance from two separate data sets:
1. Patent grants
This unstructured corpus contains 5.5 million xml documents from the USPTO (3.5 million patent
grants and 2 million patent applications between 1978 and 2006). Each document contains
categorical information, document tracking data, an abstract, inventor and examiner names, a
description of the invention, and formal claims. Documents are commonly 50-100k; 1 MB and
larger documents appear occasionally. The combined size of all documents is approximately
500 GB.
A sample document, edited for brevity, follows:
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
3
This data was organized into a single table, patent_data as follows:
CREATE TABLE "PSA"."PATENT_DATA" ( "ID" NUMBER, "PATH" VARCHAR2(100), "DATA" "XMLTYPE"."XMLTYPE", CONSTRAINT "PK_ID" PRIMARY KEY ("ID") VALIDATE ) TABLESPACE "PSADATA" PCTFREE 10 INITRANS 1 MAXTRANS 255 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) LOGGING NOCOMPRESS XMLTYPE COLUMN "DATA" STORE AS CLOB ( TABLESPACE "PSADATA" CHUNK 8192 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) PCTVERSION 10 NOCACHE LOGGING )
The PATH field contains the fully qualified path to the original xml file in the file system. The
DATA field contains the full xml document. We used Oracle Text to index the DATA field.
Two tests were performed. First, using Oracle 11.2.0.1, all 3.5 million patent grants were
indexed and compared to a Perfect Search index of the same 3.5 million documents. A
second set of tests compared Oracle 11.2.0.2 with the big IO option to Perfect Search using all
5.5 million patent documents. The second set of tests also compared single threaded
performance to multithreaded performance. The multithreaded tests were run by spawning
10 parallel threads in the test application with each thread running one-tenth of the query file.
Oracle 11.2.0.1 with default settings took about 38 hours to create the text index of 3.5 million
patent grants. After building the index ctx_adm‟s „MAX_INDEX_MEMORY‟ parameter was
increased to 2147483648, and we ran a full index optimization using the command:
exec ctx_ddl.optimize_index( ‘IDX_PATENT_DATA’, ‘FULL’ );
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
4
This command took 61 hours to complete. Using the Oracle 11.2.0.2 Big IO setting, Oracle took
about 76 hours to create the text index for both the grants and applications. Creating an
index using the big IO option includes the index optimization, so an extra optimization pass
was not necessary. Perfect Search took 26 hours to both import and index all 5.5 million patent
documents. (Perfect Search data import and indexing were not separated, so an import vs.
index metric is not available.) Build speeds for Perfect Search could be increased by
improving the indexing parallelization.
Without the big IO option, the data size was roughly comparable. With the Big IO option
Oracle used 68% more disk space.
Build, Patents
Oracle 11.2.0.1
rows (millions) size (GB) load time
3.5 354.8 21:30:39
indexes size (GB) index time
$I 176.5 39:17:02
$K 0.09
$R 0.05
full optimize 61:11:48
index total 176.7 100:28:50
532 121:59:39
Oracle 11.2.0.2 with Big IO
5.5 million rows size (GB) load time
Id and path 0.88 >24 hours
Xml data 668.78
Data total 669.66
indexes size (GB) index time
$I 119.75
$R less than 1 Meg
$X 7.04
Index total 126.79 76:00:00
796.45 >100 hours
Perfect Search rows (millions) size (GB) load + index time
5.5 362.0 (combined)
indexes size (GB)
combined 110.0
472 26:00:00
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
5
When creating the search parameters, we created the test queries with terms that would likely
be used when searching the USPTO web site.
2. SSDI (Social Security Death Index)
This structured corpus contains information on about 80 million deceased individuals, including
their full names, SSN, birth and death dates, and place(s) of residence. The data is relatively
normalized, but not perfect; for example, there are a few last names that begin with
punctuation characters, a few first names that include "Sr." (which should be parsed as a suffix
instead), and so forth.
The original data is in tab-delimited text files. We imported this into a single table in Oracle:
CREATE TABLE "PSA"."SSDI_DATA" ( "SSN" CHAR(9), "FIRST_NAME" VARCHAR2(50), "MIDDLE_INITIAL" CHAR(1), "LAST_NAME" VARCHAR2(50), "SUFFIX" VARCHAR2(10), "BIRTH_DATE" CHAR(10), "DEATH_DATE" CHAR(10), "STATE_ISSUED" CHAR(2), "RES_ZIP" CHAR(5), "RES_STATE" CHAR(2), "PAY_ZIP" CHAR(5), "PAY_STATE" CHAR(2), CONSTRAINT "SSN_PK" PRIMARY KEY ("SSN") VALIDATE ) ORGANIZATION INDEX TABLESPACE "PSADATA" INITRANS 2 MAXTRANS 255 STORAGE ( INITIAL 64K BUFFER_POOL DEFAULT) LOGGING NOCOMPRESS
Importantly, we did not attempt to fully normalize the SSDI data. A 3NF schema would require
location fields to be handled differently because zip codes and states are somewhat
redundant, but for maximum performance we did not want to require Oracle to do a join.
Additionally, to improve Oracle's performance, the handling of dates was also somewhat non-
standard. In the original SSDI corpus, date components are separate fields (year, month, day).
Oracle has excellent date composition/decomposition functions, but they impose a
performance penalty over raw field value matching, and they require a fully specified value,
whereas some SSDI date values are only approximate ("January 1896" or even "1896") and
therefore unsuitable for standard manipulation. On the other hand, keeping date components
separate complicates ordering and range calculations. The best compromise was to store dates
as strings in ISO 8601 format (yyyy-mm-dd), truncating them as necessary. This enabled ordering
using a simple text sort and made range calculation a simple matter.
As with the patent data, this data set required roughly the same amount of time to import and
was roughly the same size, on both systems:
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
6
Build, SSDI
Oracle
rows (millions) size (GB) load time
80 15:05:18 Indexes size (GB) index time
pk_ssn 09:22.0
ssdi_clustered 02:58.6
ssdi_by_birth_date 03:09.3
ssdi_by_lower_ln_fn 02:27.5
ssdi_by_pay_zip 01:08.4
ssdi_by_res_zip 01:48.7
ssdi_by_ln~_fn~ 02:49.6
ssdi_by_death_date 02:20.5
ssdi_by_soundex 02:00.0
ssdi_by_lower_ln 02:30.4
index total 0:30:35
18.7 15:35:53
Perfect Search rows (millions) size (GB) load + index time
80 34.0 (combined)
Indexes size (GB)
Combined 3.4
37.4 40:00:001
It is worth noting that the ratio of data-to-index-size is quite different. Perfect Search's larger raw
data size is probably caused by a lack of compression in its repository; its small index reflects the
fact that we made no attempt to create composite keys or otherwise optimize.
This particular data set resembles data that's routinely searched by genealogists. Accordingly,
we based our experiment plan for the SSDI data on actual logs from worldvitalrecords.com.
Because genealogical searches are typically oriented around last name, first name, and date
ranges, we chose to use a clustered index on last name, first name, birth date, and death date.
This time is from an appliance tuned for retrieval. In an indexer-only situation, the current time is 3:00:00, and has been dropped lower with some features that are in beta.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
7
Except for the clustered indexes, we placed all other indexes in a separate file group from the
data for ease of analysis.
3. Generated Data (GenData)
An attempt was made at indexing a third computer generated data set of 55 million xml
records. Perfect search finished the index in approximately 72 hours, but the Oracle Index with
the big IO option was slated to take over a month to complete.
Query Specifications
We wanted to compare query performance in both types of data, controlling for factors such as
the following:
Complexity of the query
Compilation versus the use of stored procedure
Existence of a cache
Ranked versus arbitrary search results
Accordingly, we developed a set of query batches for each data set, and stored these batches
as script files. We built large enough batches to place a load on each system for at least a few
seconds at a time; typically batches consisted of hundreds or thousands of queries. After running
through these query sets it became apparent that the test overhead was dominating the
performance in many cases, so query files in the range of tens-of-thousands of queries were
used to ensure that overhead was at least ten times smaller than the query time, even for multi-
threaded query cases. Generally, each query in a given batch was different (although we did
not prove perfect uniqueness). For SSDI, we also produced a mixed batch, in which queries of all
types were randomly ordered to reflect likely real-world usage patterns.
To control for caching, we ran each batch under two scenarios. To capture clean cache
(uncached) performance, we stopped all database services, sync‟ed the disks, and forced the
Linux kernel to drop all its caches. This also eliminated the possibility that the OS had cached
parts of the file system before the query run. Uncached performance is thus a worst-case
scenario.
After we had run a batch uncached, we re-ran the same batch again, to get performance with
pre-populated operating system and program caches (best-case performance). As expected,
we saw speed boosts in the perfect search system, but on the Oracle system large query runs
seemed to negate much of the performance gains.
In the case of the patent data, we scanned docs and produced sets of words and phrases that
we knew occurred. For example, one query from our 3-term patent query batch (arbitrary rather
than ranked) was:
SELECT * FROM ( SELECT /* FIRST_ROWS(10) */ id, path FROM patent_data
WHERE CONTAINS( data, ‘filler AND bridge AND adapter’) > 0 ) WHERE
ROWNUM < 11;
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
8
The 1-term, 2-term, 10-term, and 20-term equivalents of this query take a similar form. The ranked
phrase search batch included queries like this:
SELECT * FROM ( SELECT /* FIRST_ROWS(10) */ id, path, score(1) FROM
patent_data WHERE CONTAINS( data, ‘about(polymer backbone at least one
free isocynate group)’, 1) > 0 ORDER BY score(1) ) WHERE ROWNUM < 11;
The use of the nested query construct was intended to limit exhaustive table scans, as
recommended in Tuning Oracle Text.
SSDI queries were generated by parsing actual logs from worldvitalrecords.com (a Perfect
Search customer). We wrote scripts to generate SQL equivalents of the most common query
types in those logs, and used actual values from those logs as input parameters. This means that
a fair number of the queries could return either no results or an overly large result set, as in "real
world" usage. Also, a few query values are repeated, presumably because a customer double-
clicked a submit button on the WVR web site or searched for the same thing in more than one
repository (WVR has ~13000 "databases" with data similar to SSDI).
Some representative SSDI queries include:
SELECT * FROM SSDI WHERE metaphone.genprimkey(first_name, 5) =
metaphone.genprimkey('Edward', 5)
AND metaphone.genprimkey(last_name, 5) =
metaphone.genprimkey('Martel', 5) AND rownum < 11;
SELECT * FROM SSDI
WHERE ssn = '236225018';
SELECT * FROM SSDI WHERE (res_zip = '02148' OR pay_zip = '02148') AND ((birth_date > '1976' AND birth_date < '1979') OR (death_date > '1976' AND death_date < '1979')) WHERE rownum <
11;
SELECT * FROM SSDI WHERE lower(first_name) = 'jacob' AND lower(last_name) = 'oskins'
AND (state_issued = 'IN' OR res_state = 'IN' OR pay_state = 'IN')
WHERE rownum < 11;
We analyzed the estimated execution plans for the queries to confirm that Oracle was using
available indexes in reasonable ways. However, we did not optimize every case--only the
preponderance of common ones.
The SSDI and first set of Patent results were executed using scripts in sqlplus (sql files). We used the
following wrapper script to run each sql script and parsed the elapsed times reported by Oracle
from its log as the basis of our performance numbers.
set timing on;
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
9
set autotrace traceonly; spool &1.out.txt @&1 spool off exit
For the second set of Oracle tests, including the multi-threaded tests, a simple JDBC program
was used to execute the queries.
Perfect Search scripts (*.txt) were executed using psutil.exe, a testing tool that calls Perfect
Search's query engine over its web service interface. Psutil processes queries serially and
measures elapsed clock time with microsecond precision.
Assumptions
A few ambiguities required us to make assumptions in our analysis. On the Oracle side, the
resolution of timing data was a tenth of a second. Where elapsed time was reported as
“00:00:00.00”, rather than estimate a millisecond query time for such queries, Oracle was
credited with a query that took 0 ms. For the second set of queries we used the end-to-end
timing where the clock was started when the test tool began, and stopped when the final query
finished executing. This was to avoid high-precision clock problems across multiple cores.
Testing Observations
When running the small query sets (less than 1000 queries), Perfect Search returned results in less
time than it took for the testing tool to create the benchmarking tool process and establish a
network connection. To combat this, we moved to larger query runs where the query time
would be at least ten times larger than the overhead time. This gave us a new problem: what
took less than 3.5 hours on Perfect Search took Oracle 13.5 days to finish.
Oracle appeared to do well caching data on small query runs, but as the query runs became
larger the caching effect was much less noticeable. For example, the complete clean-cache
run took Oracle just over 6.9 days, and the populated cache run took 6.5 days, only a 6%
performance increase. Caching continued to increase Perfect Search performance on the
50,000 query sets. Though operating system statistics revealed that some index sets started to be
swapped in and out of memory, most sets were still cached. On the 500 query runs operating
system statistics showed that all sets continued to be in the operating system‟s file system cache.
Results
In general, we found that Perfect Search's query performance on unstructured data was at least
ten times faster than the current Oracle text engine. On a non-ranked search for a single patent
term, Oracle returned nearly 18 queries per second. However, Perfect Search achieved 299
queries per second on the same batch (Patent Graph 3, 1 term).
Both engines paid a cost for increased complexity, but the ratio remained similar. For 10-term
queries in arbitrary order, Oracle handled 0.4 qps, while Perfect Search returned 40 (Patent
Graph 1, 10 terms). The narrowest gap was on ranked searches for 10-term queries: 0.181 qps for
Oracle versus 2 for Perfect Search(Patent Graph 2, 10 terms).
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
10
The first two graphs illustrate Oracle 11.2.0.1 numbers without the Big IO feature. The next four
graphs show Oracle 11.2.0.2 with big IO.
Perfect Search might also add some value in the traditional RDBMS sweet spot of structured
data, but the performance difference is less dramatic. Perfect Search was roughly twice as fast
on a simple last name match (466 versus 1079 qps)--the most common query in the WVR logs
(SSDI Graph, *ln). However, Perfect Search is dramatically faster on some of the more exotic,
criteria-heavy searches. Across all query types in the mixed batch of 10,000 queries from actual
customer logs, Perfect Search outperformed Oracle without the big IO options by more than 50
times (1 versus 64 qps). With the big IO option Oracle performance improves, but Perfect Search
still outperforms it by at least an order of magnitude.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
11
Patent Graph 1: Ranked search results on 3.5 million patent grants comparing Oracle 11.2.0.1 to Perfect Search
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
12
Patent Graph 2: Unranked (Boolean) search results on 3.5 million patent grants comparing Oracle 11.2.0.1 to
Perfect Search
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
13
Patent Graph 3: Comparing Oracle 11.2.0.2 with Big IO to Perfect Search on 5.5 million patent documents starting
each query set with a clean file system cache.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
14
Patent Graph 4: Comparing Oracle 11.2.0.2 with Big IO to Perfect Search on 5.5 million patent documents. Each
query set was run once prior to timing the second run in order to populate the file system cache.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
15
Patent Graph 5: Comparing queries that return both the first 10 results and total count on 5.5 million patent
documents starting each query set with a clean file system cache.
COMPARING ORACLE AND PERFECT SEARCH TECHNOLOGIES
16
Patent Graph 6: Comparing queries that return both the first 10 results and total count on 5.5 million patent
documents. Each query set was run once prior to timing the second run in order to populate the file system cache.