search lucene

Post on 04-Jul-2015

4.543 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Zoe Slattery's slides from PHPNW08: The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications. For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation. In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.

TRANSCRIPT

Can you be dynamic and fast?

“Miss Marple and the case of the Missing MIPS”

Zoë Slattery

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Index and search

● Problem of finding relevant information is not new.– 3000 years BC [1]– Vannevar Bush, As We May Think, 1945.

● Today applications that search the Web must be able to provide instant access to > 10 billion documents

● Many applications need some form of search, eg searching your hard drive, email....

1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.

Options for information retrieval

● Search engines– Nutch, SearchBlox.....

● Information Retrieval libraries– Three with broadly similar features

Egothor

Xapian

Lucene

Implementationlanguage

Languagebindings

Languageports

License

Java None None BSD like

C++Perl, Python,

PHP, Java, TCLNone GPL

Java NoneC++, Perl, PHP, C#

Apache 2

Lucene [2]

DBWeb

Filesystem

Get user query Present search

results

Index

Indexdocuments

Searchindex

Gatherdata

Luce

neA

pplic

atio

n

User

2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.

.

Lucene indexing

Oh for a muse of fire that would

acsend thebrightest

heaven of invention.....

start

fire

ascend

...

Henry V, Scouting for boys...

Aerospace, Henry V...

Terms Documents

3. Inverted index

1. Documents

AnalysisIndex creation

end

[fire] [ascend] [bright] [heaven]

2. Token stream

Optimise

4. Optimised inverted index

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Indexing speed

Java + JIT

Java

PHP

4

32

167

Time to index/seconds

0.3

3

43

Time to optimise/seconds

4.3

35

210

Total time

Benchmark:●17.4 MB, 814 files of PHP source code●Linux/Thinkpad T60

Ouch! nearly 50 times as fast in Java

Why is the performance so bad?

First make sure we are comparing same thing:

➢ Compare indexes using Luke

➢ Limits on terms➢ Java stops looking at 10,000 terms

➢ Scoring➢ Java rounds down, PHP rounds to closest

➢ Analyser➢ Java Lucene has many analysers

Analysis - Java

Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Analyzing "XY&Z Corporation - xyz@example.com" StandardAnalyzer: [xy&z] [corporation] [xyz@example.com]

SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

Analysis - PHP

Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Analysing "XY&Z Corporation - xyz@example.com" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com]

Stop words filter: [xy] [z] [corporation] [xyz] [example] [com]

Short words filter: [xy] [corporation] [xyz] [example] [com]

Compare indexes

Same 663 terms

java

php

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Execution profiles

● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations

● Profiling tools (all open source)

– Java● Eclipse TPTP

– PHP● Xdebug● KCachegrind

– System● Sysprof● vmstat, iostat

Java profile

Small problems with TPTP...

Java

Java + profile

2.3

687258

Time to index/seconds

0.3

673851

Time to optimise/seconds

88

50

% time in indexing

●Invasive and slow. Takes 600,000 times as long to execute●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)●Output file is machine readable only

But – it's free, open source and it works enough.

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB

PHP profile

No problems with this tool

PHP

PHP + profile

5

70

Time to index/seconds

3

55

Time to optimise/seconds

63

56

% time in indexing

●Not so invasive as the Java tool but still adds to time and distorts slightly●Results easy to display with KCachegrind●Output file is readable

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB

look at the normalize() code

public function normalize(Token $srcToken ){

$newToken = new Token(strtolower( $srcToken­>getTermText() ),

$srcToken­>getStartOffset(), $srcToken­>getEndOffset());

$newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken; }

The normalize() function

Sum( ) = 2.92;

18.99 – 2.92 = 16.07

Micro benchmark

<?php         require_once "Token.php";         require_once "LowerCase.php"; 

        $token = new Token("GO", 105, 107);         $filter = new LowerCase(); 

        for ($i=0; $i < 10000000; $i++) {                 $norm_token = $filter­>normalize($token);         } ?> 

normalize() opcodes

compiled vars:  !0 = $srcToken, !1 = $newToken line     #  op                   ext  return   operands ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 11     0  RECV 1 13     1  ZEND_FETCH_CLASS :0 'Token'        2  NEW $1 :0        3  ZEND_INIT_METHOD_CALL !0, 'getTermText'        4  DO_FCALL_BY_NAME 0        5  SEND_VAR_NO_REF $3        6  DO_FCALL 1     'strtolower'        7  SEND_VAR_NO_REF $4 14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'        9  DO_FCALL_BY_NAME 0       10  SEND_VAR_NO_REF $6 15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'       12  DO_FCALL_BY_NAME 0       13  SEND_VAR_NO_REF $8       14  DO_FCALL_BY_NAME 3       15  ASSIGN  !1, $1 16    ......

System profile

1. Convert to lower case2. Look up opcodes

How Xdebug worksS

crip

t exe

cutio

n

●Convert function name to lower case●Look up function in function table

Execute function

Call out to profiler – start time

Call out to profiler – end time

ZEND_INIT_METHOD_CALL

DO_FCALL_BY_NAME

The normalize() function

Sum( ) = 2.92;

18.99 – 2.92 = 16.07

Is consumed in setting up functions to be run

Why is function calling faster in Java?

● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.

● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.

● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

PHP profile

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize(Token $srcToken ){

$newToken = new Token(strtolower( $srcToken­>getTermText() ), $srcToken­>getStartOffset(), $srcToken­>getEndOffset());

$newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken; }

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize (Token $srcToken) {$srcToken­>setTermText(strtolower($srcToken­>getTermtext()));return $srcToken;

}

normalize() recoded....

After fix

Performance improvement?

PHP + fix

PHP

151

167

Time to index/seconds

43

43

Time to optimise/seconds

Java 32 3 35

194

210

Total time

9.5 % improvement

Java + JIT 4 0.3 4.3

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Conclusions

● Two reasons why the PHP implementation of Lucene is slow:– Function calling overhead in PHP– Inefficient code in the analyser [3]– These are the main two, there are others....

● Dynamic and fast?– Hard to get to the same execution speed as Java – but possible to get closer.– But development speed is much better [4]– what speed to you care about?– Better not to use Java coding style (lots of methods that do nothing)

● So which implementation of Lucene should I use?– it depends.....

3. http://framework.zend.com/issues/browse/ZF-36834. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.

Options for PHP

Do you care about

speed?

Use Zend Search Lucene

Only need basic features?

Can support Java environment?

Use a Web Service?

Use Lucene via a Java bridge

No Lucene solution today [5]

Use SOLR as web service

Y

Y

Y

NN N

N

Y

5. http://pecl.php.net/package/clucene

Other useful links

●http://www.egothor.org/●http://xapian.org/●http://lucene.apache.org/●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html●http://www.derickrethans.nl/vld.php●http://lucene.apache.org/nutch/●http://www.searchblox.com/●http://www.xdebug.org/●http://www.eclipse.org/tptp/●http://www.getopt.org/luke/●http://www.projectzero.org●http://www.ibm.com/developerworks/ (Publication due 24/09/08)●http://php-java-bridge.sourceforge.net/doc/●http://www.zend.com/en/products/platform/product-comparison/java-bridge●http://lucene.apache.org/solr/●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html

top related