search lucene

36
Can you be dynamic and fast? “Miss Marple and the case of the Missing MIPS” Zoë Slattery

Upload: jeremy-coates

Post on 04-Jul-2015

4.543 views

Category:

Technology


1 download

DESCRIPTION

Zoe Slattery's slides from PHPNW08: The ability to store large quantities of local data means that many applications require some form of text search and retrieval facility. From the point of view of the application developer there are a number of choices to make, the first is whether to use a complete packaged solution or whether to use one of the available information libraries to build a custom information retrieval (IR) solution. In this talk I’ll look at the options for PHP programmers who choose to embed IR facilities within their applications. For Java programmers there is clearly a good range of options for text retrieval libraries, but options for PHP programmers are more limited. At first sight for a PHP programmer wishing to embed indexing and search facilities in their application, the choice seems obvious - the PHP implementation of Lucene (Zend Search Lucene). There is no requirement to support another language, the code is PHP therefore easy for PHP programmers to work with and the license is commercially friendly. However, whilst ease of integration and support are key factors in choice of technology, performance can also be important; the performance of the PHP implementation of Lucene is poor compared to the Java implementation. In this talk I’ll explain the differences in performance between PHP implementation of Lucene and the Java implementation and examine the other options available to PHP programmers for whom performance is a critical factor.

TRANSCRIPT

Page 1: Search Lucene

Can you be dynamic and fast?

“Miss Marple and the case of the Missing MIPS”

Zoë Slattery

Page 2: Search Lucene

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Page 3: Search Lucene

Index and search

● Problem of finding relevant information is not new.– 3000 years BC [1]– Vannevar Bush, As We May Think, 1945.

● Today applications that search the Web must be able to provide instant access to > 10 billion documents

● Many applications need some form of search, eg searching your hard drive, email....

1. Lagoze, C. Singhal, A. Information Discovery: Needles and Haystacks. IEEE Internet Computing. Volume 9(3), 16-18, 2005.

Page 4: Search Lucene

Options for information retrieval

● Search engines– Nutch, SearchBlox.....

● Information Retrieval libraries– Three with broadly similar features

Egothor

Xapian

Lucene

Implementationlanguage

Languagebindings

Languageports

License

Java None None BSD like

C++Perl, Python,

PHP, Java, TCLNone GPL

Java NoneC++, Perl, PHP, C#

Apache 2

Page 5: Search Lucene

Lucene [2]

DBWeb

Filesystem

Get user query Present search

results

Index

Indexdocuments

Searchindex

Gatherdata

Luce

neA

pplic

atio

n

User

2. Gospodnetic, O., Hatcher, E. Lucene in Action. Manning Publications Co., Greenwich. 2005.

Page 6: Search Lucene

.

Lucene indexing

Oh for a muse of fire that would

acsend thebrightest

heaven of invention.....

start

fire

ascend

...

Henry V, Scouting for boys...

Aerospace, Henry V...

Terms Documents

3. Inverted index

1. Documents

AnalysisIndex creation

end

[fire] [ascend] [bright] [heaven]

2. Token stream

Optimise

4. Optimised inverted index

Page 7: Search Lucene

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times

● Conclusions

Page 8: Search Lucene

Indexing speed

Java + JIT

Java

PHP

4

32

167

Time to index/seconds

0.3

3

43

Time to optimise/seconds

4.3

35

210

Total time

Benchmark:●17.4 MB, 814 files of PHP source code●Linux/Thinkpad T60

Ouch! nearly 50 times as fast in Java

Page 9: Search Lucene

Why is the performance so bad?

First make sure we are comparing same thing:

➢ Compare indexes using Luke

➢ Limits on terms➢ Java stops looking at 10,000 terms

➢ Scoring➢ Java rounds down, PHP rounds to closest

➢ Analyser➢ Java Lucene has many analysers

Page 10: Search Lucene

Analysis - Java

Analyzing "A Quick Brown Fox jumped over the Lazy Dog" StandardAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

SimpleAnalyzer: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

StopAnalyzer: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Analyzing "XY&Z Corporation - [email protected]" StandardAnalyzer: [xy&z] [corporation] [[email protected]]

SimpleAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

StopAnalyzer: [xy] [z] [corporation] [xyz] [example] [com]

Page 11: Search Lucene

Analysis - PHP

Analysing "A Quick Brown Fox jumped over the Lazy Dog" Default (lower case) filter: [a] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Stop words filter: [quick] [brown] [fox] [jumped] [over] [lazy] [dog]

Short words filter: [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dog]

Analysing "XY&Z Corporation - [email protected]" Default (lower case) filter: [xy] [z] [corporation] [xyz] [example] [com]

Stop words filter: [xy] [z] [corporation] [xyz] [example] [com]

Short words filter: [xy] [corporation] [xyz] [example] [com]

Page 12: Search Lucene

Compare indexes

Same 663 terms

java

php

Page 13: Search Lucene

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 14: Search Lucene

Execution profiles

● Now that we are definitely comparing the same thing, look at execution profiles for Java and PHP implementations

● Profiling tools (all open source)

– Java● Eclipse TPTP

– PHP● Xdebug● KCachegrind

– System● Sysprof● vmstat, iostat

Page 15: Search Lucene

Java profile

Page 16: Search Lucene

Small problems with TPTP...

Java

Java + profile

2.3

687258

Time to index/seconds

0.3

673851

Time to optimise/seconds

88

50

% time in indexing

●Invasive and slow. Takes 600,000 times as long to execute●Some problems getting to run on Ubuntu (missing C++ libraries, ksh specific scripts)●Output file is machine readable only

But – it's free, open source and it works enough.

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB

Page 17: Search Lucene

PHP profile

Page 18: Search Lucene

No problems with this tool

PHP

PHP + profile

5

70

Time to index/seconds

3

55

Time to optimise/seconds

63

56

% time in indexing

●Not so invasive as the Java tool but still adds to time and distorts slightly●Results easy to display with KCachegrind●Output file is readable

Benchmark data:● 39 files of PHP source code (php/Zend), 1.2 MB

Page 19: Search Lucene

look at the normalize() code

public function normalize(Token $srcToken ){

$newToken = new Token(strtolower( $srcToken­>getTermText() ),

$srcToken­>getStartOffset(), $srcToken­>getEndOffset());

$newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken; }

Page 20: Search Lucene

The normalize() function

Sum( ) = 2.92;

18.99 – 2.92 = 16.07

Page 21: Search Lucene

Micro benchmark

<?php         require_once "Token.php";         require_once "LowerCase.php"; 

        $token = new Token("GO", 105, 107);         $filter = new LowerCase(); 

        for ($i=0; $i < 10000000; $i++) {                 $norm_token = $filter­>normalize($token);         } ?> 

Page 22: Search Lucene

normalize() opcodes

compiled vars:  !0 = $srcToken, !1 = $newToken line     #  op                   ext  return   operands ­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­ 11     0  RECV 1 13     1  ZEND_FETCH_CLASS :0 'Token'        2  NEW $1 :0        3  ZEND_INIT_METHOD_CALL !0, 'getTermText'        4  DO_FCALL_BY_NAME 0        5  SEND_VAR_NO_REF $3        6  DO_FCALL 1     'strtolower'        7  SEND_VAR_NO_REF $4 14     8  ZEND_INIT_METHOD_CALL !0, 'getStartOffset'        9  DO_FCALL_BY_NAME 0       10  SEND_VAR_NO_REF $6 15    11  ZEND_INIT_METHOD_CALL  !0, 'getEndOffset'       12  DO_FCALL_BY_NAME 0       13  SEND_VAR_NO_REF $8       14  DO_FCALL_BY_NAME 3       15  ASSIGN  !1, $1 16    ......

Page 23: Search Lucene

System profile

1. Convert to lower case2. Look up opcodes

Page 24: Search Lucene

How Xdebug worksS

crip

t exe

cutio

n

●Convert function name to lower case●Look up function in function table

Execute function

Call out to profiler – start time

Call out to profiler – end time

ZEND_INIT_METHOD_CALL

DO_FCALL_BY_NAME

Page 25: Search Lucene

The normalize() function

Sum( ) = 2.92;

18.99 – 2.92 = 16.07

Is consumed in setting up functions to be run

Page 26: Search Lucene

Why is function calling faster in Java?

● Java is a static language. VM structures are known at start up – can't add code on the fly, types are known at compile time.

● First time a function is called Java caches a reference to it in a virtual dispatch table. After that function calls are fast.

● In PHP, code can be added during execution, for example, create_function() and types are not known till code is executed. This makes keeping virtual dispatch tables much more difficult.

Page 27: Search Lucene

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 28: Search Lucene

PHP profile

Page 29: Search Lucene

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize(Token $srcToken ){

$newToken = new Token(strtolower( $srcToken­>getTermText() ), $srcToken­>getStartOffset(), $srcToken­>getEndOffset());

$newToken­>setPositionIncrement($srcToken­>getPositionIncrement());

     return $newToken; }

Page 30: Search Lucene

look at the call to normalize()

$token = $this­>normalize(new Zend_Search_Lucene_Analysis_Token($str, $pos, $endpos));

public function normalize (Token $srcToken) {$srcToken­>setTermText(strtolower($srcToken­>getTermtext()));return $srcToken;

}

normalize() recoded....

Page 31: Search Lucene

After fix

Page 32: Search Lucene

Performance improvement?

PHP + fix

PHP

151

167

Time to index/seconds

43

43

Time to optimise/seconds

Java 32 3 35

194

210

Total time

9.5 % improvement

Java + JIT 4 0.3 4.3

Page 33: Search Lucene

Agenda

● Index and search applications

● The problem for PHP programmers

● Understanding execution times– Part one– Part two

● Conclusions

Page 34: Search Lucene

Conclusions

● Two reasons why the PHP implementation of Lucene is slow:– Function calling overhead in PHP– Inefficient code in the analyser [3]– These are the main two, there are others....

● Dynamic and fast?– Hard to get to the same execution speed as Java – but possible to get closer.– But development speed is much better [4]– what speed to you care about?– Better not to use Java coding style (lots of methods that do nothing)

● So which implementation of Lucene should I use?– it depends.....

3. http://framework.zend.com/issues/browse/ZF-36834. Prechelt, L. An empirical comparison of seven programming languages. Computer. Volume 33(10), 23-29, 2000.

Page 35: Search Lucene

Options for PHP

Do you care about

speed?

Use Zend Search Lucene

Only need basic features?

Can support Java environment?

Use a Web Service?

Use Lucene via a Java bridge

No Lucene solution today [5]

Use SOLR as web service

Y

Y

Y

NN N

N

Y

5. http://pecl.php.net/package/clucene

Page 36: Search Lucene

Other useful links

●http://www.egothor.org/●http://xapian.org/●http://lucene.apache.org/●http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html●http://www.derickrethans.nl/vld.php●http://lucene.apache.org/nutch/●http://www.searchblox.com/●http://www.xdebug.org/●http://www.eclipse.org/tptp/●http://www.getopt.org/luke/●http://www.projectzero.org●http://www.ibm.com/developerworks/ (Publication due 24/09/08)●http://php-java-bridge.sourceforge.net/doc/●http://www.zend.com/en/products/platform/product-comparison/java-bridge●http://lucene.apache.org/solr/●http://www.ibm.com/developerworks/websphere/library/techarticles/0809_phillips/0809_phillips.html