embedding cpython in solr

85
MontySolr: Embedding CPython in Solr Roman Chyla, CERN [email protected], May 26, 2011 Thursday, May 26, 2011

Upload: lucidworks-archived

Post on 11-May-2015

1.668 views

Category:

Documents


0 download

DESCRIPTION

SPIRES is the biggest bibliographic database for High Energy Physics, ArXiv is the biggest fulltext repository for the fulltext papers in High Energy Physics, and INSPIRE is the biggest digital library that merges the two.

TRANSCRIPT

Page 1: Embedding CPython in Solr

MontySolr:Embedding CPython in Solr

Roman Chyla, [email protected], May 26, 2011

Thursday, May 26, 2011

Page 2: Embedding CPython in Solr

Why should I care?

- Our challenge is to connect Python and Java- Without compromises- We created MontySolr extension

- Robust, tested (will be used by our system)- But works for any Python application (eg. Django)- And for any C/C++ app that Python understands!- Open source (GPL v2)

- Try it out!- https://github.com/romanchyla/montysolr

2Thursday, May 26, 2011

Page 3: Embedding CPython in Solr

Outline

‣ Context- The Challenge- Key components

- Available technologies- Our approach- Problems solved

- Evaluation- Wrap-up

3Thursday, May 26, 2011

Page 4: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 5: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 6: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 7: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 8: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 9: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 10: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 11: Embedding CPython in Solr

CERN

- European Organization for Nuclear Research- Switzerland, Geneva

- The largest laboratory for High Energy Physics- Home to the Large Hadron Collider- 40-50K HEP scientists worldwide

4Thursday, May 26, 2011

Page 12: Embedding CPython in Solr

SPIRES

- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991

- The first web outside Europe/CERN- The first database on web

5Thursday, May 26, 2011

Page 13: Embedding CPython in Solr

SPIRES

- Stanford Linear Accelerator Center - SLAC- High-Energy Physics Literature Database- Started December 1991

- The first web outside Europe/CERN- The first database on web

5Thursday, May 26, 2011

Page 14: Embedding CPython in Solr

6Thursday, May 26, 2011

Page 15: Embedding CPython in Solr

7Thursday, May 26, 2011

Page 16: Embedding CPython in Solr

Invenio

- Integrated digital library software behind INSPIRE- Used by very large institutional repositories

- http://repositories.webometrics.info/toprep_inst.asp

- Customizable virtual collections- Flexible management of metadata

- 3 000 authors per article

- Powerful search engine- Incl. citation map analysis

- Written in Python (since 2001)- 290 000 lines of code

8Thursday, May 26, 2011

Page 17: Embedding CPython in Solr

Outline

- Context‣ The Challenge- Key components

- Available technologies- Our approach- Problems solved

- Evaluation- Wrap-up

9Thursday, May 26, 2011

Page 18: Embedding CPython in Solr

The Challenge

- HEP scientific community- Searches metadata oriented

- However fulltexts are changing the situation- And we want to provide even better service

- Bigger volumes of data- NLP processing- Semantic search

10Thursday, May 26, 2011

Page 19: Embedding CPython in Solr

The Challenge

11

Invenio

Thursday, May 26, 2011

Page 20: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

Thursday, May 26, 2011

Page 21: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

Thursday, May 26, 2011

Page 22: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

Thursday, May 26, 2011

Page 23: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

Thursday, May 26, 2011

Page 24: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

Thursday, May 26, 2011

Page 25: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

Thursday, May 26, 2011

Page 26: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

1-6M IDs

Thursday, May 26, 2011

Page 27: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

1-6M IDs

1. only IDs,no score= no ranking

Thursday, May 26, 2011

Page 28: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

1-6M IDs

1. only IDs,no score= no ranking

2. score merging difficult (if available)

Thursday, May 26, 2011

Page 29: Embedding CPython in Solr

The Challenge

11

Invenio

Query: supersymmetry AND author:ellis

fulltext:supersymmetry

IDs: 1;2;3;9....

1-6M IDs

1. only IDs,no score= no ranking

2. score merging difficult (if available)

3. push IDs ? (eg._faceting)

Thursday, May 26, 2011

Page 30: Embedding CPython in Solr

What is the “best” solution?

- We love Python...- ...and our applications are written in Python...

- But what if Solr is the master search engine?- Merge results inside Solr?

- Typical size: 1-10 mil. IDs- Expected latency: 1-2 s.

- What we want to achieve:- Fast transfer of hits from Invenio to Solr- Leverage the power of both (no compromises)- Developer-friendly integration, simplicity

12Thursday, May 26, 2011

Page 31: Embedding CPython in Solr

Outline

- Context- The Challenge‣ Key components

- Available technologies- Our approach- Evaluation

- Demonstration- Wrap-up

13Thursday, May 26, 2011

Page 32: Embedding CPython in Solr

To embed Solr (in Java app)

14

- Your app simulates Java web container?- use EmbeddedSolrServer

- It knows nothing about Java servlets?- use DirectConnect class

- Maybe we are too lazy?- Embed the web container (in my case Jetty)- Seemed strange (webserver inside webserver)- ... but it worked well

Thursday, May 26, 2011

Page 33: Embedding CPython in Solr

To embed Solr (in Java app)

14

- Your app simulates Java web container?- use EmbeddedSolrServer

- It knows nothing about Java servlets?- use DirectConnect class

- Maybe we are too lazy?- Embed the web container (in my case Jetty)- Seemed strange (webserver inside webserver)- ... but it worked well

Thursday, May 26, 2011

Page 34: Embedding CPython in Solr

To embed Solr (in Java app)

14

- Your app simulates Java web container?- use EmbeddedSolrServer

- It knows nothing about Java servlets?- use DirectConnect class

- Maybe we are too lazy?- Embed the web container (in my case Jetty)- Seemed strange (webserver inside webserver)- ... but it worked well

Thursday, May 26, 2011

Page 35: Embedding CPython in Solr

To embed Solr (in Java app)

14

- Your app simulates Java web container?- use EmbeddedSolrServer

- It knows nothing about Java servlets?- use DirectConnect class

- Maybe we are too lazy?- Embed the web container (in my case Jetty)- Seemed strange (webserver inside webserver)- ... but it worked well

Thursday, May 26, 2011

Page 36: Embedding CPython in Solr

To embed Solr (in Java app)

14

- Your app simulates Java web container?- use EmbeddedSolrServer

- It knows nothing about Java servlets?- use DirectConnect class

- Maybe we are too lazy?- Embed the web container (in my case Jetty)- Seemed strange (webserver inside webserver)- ... but it worked well

Thursday, May 26, 2011

Page 37: Embedding CPython in Solr

To use Solr in non-Java app

15

- Solr is already usable via HTTP requests, but we need something else here...

- Remote objects/calls?- Pyro, execnet, CORBA, SOAP...- or simply pipes?

- Access Python from Java?- Jython- JEPP

- Access Java from Python?- JPype- JCC

Thursday, May 26, 2011

Page 38: Embedding CPython in Solr

Jython?

16

- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded

- C modules will not work- but see http://bit.ly/iTRYbb

- Slower than CPython

Thursday, May 26, 2011

Page 39: Embedding CPython in Solr

Jython?

17

- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded

- C modules will not work- but see http://bit.ly/iTRYbb

- Slower than CPython

Thursday, May 26, 2011

Page 40: Embedding CPython in Solr

Jython?

17

- Implementation of Python in 100% Java- Both Java and Python code- Truly multithreaded

- C modules will not work- but see http://bit.ly/iTRYbb

- Slower than CPython

Thursday, May 26, 2011

Page 41: Embedding CPython in Solr

JEPP - Java Embedded Python

- Python code runs inside Python interpreter

- Embeds CPython interpreter via Java Native Interface (JNI) in Java

- http://jepp.sourceforge.net/- recently updated (27-Jan)- but JCC is more active

18Thursday, May 26, 2011

Page 42: Embedding CPython in Solr

JEPP - Java Embedded Python

19Thursday, May 26, 2011

Page 43: Embedding CPython in Solr

JCC

- Embeds JVM in Python- C++ code generator- C++ object interface

wraps a Java library- C++ wrappers conform

to Python's C type system

- result: complete Python extension module

20Thursday, May 26, 2011

Page 44: Embedding CPython in Solr

JCC

21Thursday, May 26, 2011

Page 45: Embedding CPython in Solr

JCC

21Thursday, May 26, 2011

Page 46: Embedding CPython in Solr

JCC

21Thursday, May 26, 2011

Page 47: Embedding CPython in Solr

To use Solr in non-Java app

22

Jython JCC JEPP

Python CModulesSpeed

No code changesAccess from PythonAccess from Java

✓ ✓

✓ ?

✓ ✓

✓ ✓

✓ ... ✓

Thursday, May 26, 2011

Page 48: Embedding CPython in Solr

The first try

23

Invenio

JCC

Solr

Thursday, May 26, 2011

Page 49: Embedding CPython in Solr

Devil is in details...

24Thursday, May 26, 2011

Page 50: Embedding CPython in Solr

GIL - Global Interpreter Lock

25

Unfortunately Python webapp is not like Java...

Thursday, May 26, 2011

Page 51: Embedding CPython in Solr

GIL - Global Interpreter Lock

26

We can have 200 threads, but only 4 will run at time...

Thursday, May 26, 2011

Page 52: Embedding CPython in Solr

GIL - Global Interpreter Lock

27Thursday, May 26, 2011

Page 53: Embedding CPython in Solr

Fortunately solution exists

- JCC can embed Python inside Java- Special thanks to Andi Vajda! (JCC creator)

- We write ‘empty’ classes in Java ...- ... and implement them in Python

28Python /w Java inside Java /w Python inside

Thursday, May 26, 2011

Page 54: Embedding CPython in Solr

The second try

29

Inveniofrontend

Solr /w Invenio(backend)

XML

JCC

Thursday, May 26, 2011

Page 55: Embedding CPython in Solr

Implementing the bridge

- Special Java class- With method pythonExtension()

- Native method pythonDecRef()- JCC provides its implementation

- And number of other native methods- These will be implemented using Python

- Like writing JNI Java/C code but without compilation...

30Thursday, May 26, 2011

Page 56: Embedding CPython in Solr

MontySolr extension

- JCC has great potential, but also added complexity...

- So the MontySolr project was born- Modules must be built in shared mode- JCC dynamic library loaded and started from the main

thread- Simple mechanism of the Python bridge and message- Configurable handlers on the Python side- Secured dereferencing of the native objects- Threading on the Java side- Multiprocessing on the Python side- Easy ant targets (compilation) ...

31Thursday, May 26, 2011

Page 57: Embedding CPython in Solr

Hello World - Java partpublic class MontySolrBridge extends BasicBridge implements PythonBridge { private long pythonObject; public void pythonExtension(long pythonObject) { this.pythonObject = pythonObject; } public long pythonExtension() { return this.pythonObject; } public void finalize() throws Throwable { pythonDecRef(); } public native void pythonDecRef(); public void sendMessage(PythonMessage message) { PythonVM vm = PythonVM.get(); vm.acquireThreadState(); receive_message(message); vm.releaseThreadState(); } public native void receive_message(PythonMessage message);} 32

Thursday, May 26, 2011

Page 58: Embedding CPython in Solr

Hello World - Python part

from montysolr import MontySolrBridge

class SimpleBridge(MontySolrBridge): def __init__(self): super(SimpleBridge, self).__init__() def receive_message(self, message): query = message.getParam(‘query’) message.setResults(‘Hello world!’) print ‘Python received from Java:’, query

33Thursday, May 26, 2011

Page 59: Embedding CPython in Solr

Example - running MontySolr

34

- Java side- JRE (32/64 bit)- Standard Solr/Lucene jars- JCC dynamic library

- Python side- Python interpreter (32/64 bit)- 4 Python modules (jcc, solr, lucene, montysolr)

- In the main thread- First we load JCC- Then start Python interpreter ...- ... load Python handlers

Thursday, May 26, 2011

Page 60: Embedding CPython in Solr

Solr as search service

35

Inveniofrontend

Solr /w Invenio(backend)

XML

JCC

Thursday, May 26, 2011

Page 61: Embedding CPython in Solr

Solr

Example

36

MyCustomHandler

Thursday, May 26, 2011

Page 62: Embedding CPython in Solr

Solr

Example

37

MyCustomHandler

refersto:author:ellis

Thursday, May 26, 2011

Page 63: Embedding CPython in Solr

Example - Solr custom handler

MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); }

38Thursday, May 26, 2011

Page 64: Embedding CPython in Solr

Solr

Example

39

MyCustomHandler

refersto:author:ellis

PythonBridge

Example - JNI connection

Thursday, May 26, 2011

Page 65: Embedding CPython in Solr

Solr

Example

40

MyCustomHandler

refersto:author:ellis

PythonBridge

Example - JNI connection

Inveniowrappers

Thursday, May 26, 2011

Page 66: Embedding CPython in Solr

Example - Python side

# handler is made ‘visible’ at startupSolrpieTarget('Invenio:perform_search', perform_search)

# search time - called from Javadef perform_search(message): query = message.getParam(“query”) hits = call_real_search(query) # cast Python list into Java array message.setResults(JArray_ints(hits))

41Thursday, May 26, 2011

Page 67: Embedding CPython in Solr

Solr

Example

42

MyCustomHandler

refersto:author:ellis

PythonBridge

Inveniowrappers

Example

Invenio

Invenio

Invenio

Invenio

Thursday, May 26, 2011

Page 68: Embedding CPython in Solr

Example - Java side again MontySolrVM.INSTANCE.sendMessage(message); PythonMessage msg = MontySolrVM.INSTANCE .createMessage("perform_search") .setSender("Invenio") .setParam("query","refersto:author:ellis");

MontySolrVM.INSTANCE.sendMessage(msg); Object result = msg.getResults(); if (result != null) { int[] hits = (int[]) message.getResults(); }

43Thursday, May 26, 2011

Page 69: Embedding CPython in Solr

Solr as search service

44

Apachewebserver

Solr /w Invenio(backend)

XML

JCC

Invenio Invenio

Thursday, May 26, 2011

Page 70: Embedding CPython in Solr

Outline

- Context- The Challenge- Key components

- Available technologies- Our approach- Problems solved

‣ Evaluation- Wrap-up

45Thursday, May 26, 2011

Page 71: Embedding CPython in Solr

Memory and garbage collection

46Thursday, May 26, 2011

Page 72: Embedding CPython in Solr

Comparing speed and load...

47Thursday, May 26, 2011

Page 73: Embedding CPython in Solr

The effect of cache

48Thursday, May 26, 2011

Page 74: Embedding CPython in Solr

Robust?

- Extensive siege tests show very good performance and stability under high load- 100-200 users, complex searches- 50 concurrent users, citation analysis- JCC incurs small overhead

- We detected no memory leaks - The same as dbpedia.org

- But watch out for errors in C- An error in C module brings down the whole JVM- (errors in pure Python module can be handled)

49Thursday, May 26, 2011

Page 75: Embedding CPython in Solr

Easy to develop/maintain?

- Added complexity- Java in the toolbox- Need to compile C++ extensions- Python/OS version dependencies

- For this we get- Easy integration with Invenio- The best of two applications- A lot of features for free- And we can control Solr from Python!

50Thursday, May 26, 2011

Page 76: Embedding CPython in Solr

Outline

- Context- The Challenge- Key components

- Available technologies- Our approach- Problems solved

- Evaluation‣ Wrap-up

51Thursday, May 26, 2011

Page 77: Embedding CPython in Solr

Wrap-up

- Our challenge was to connect two different languages/systems

- And we wanted to get the best of the two...- So we had to plug Python into Solr- And now our Solr knows citation analysis!

- We created MontySolr extension- Robust, tested (will be used by INSPIRE)- Works for any Python application (eg. Django)- And for any C/C++ app that Python understands!- Free software license

- Try it out! Help us make it better!- https://github.com/romanchyla/montysolr

52Thursday, May 26, 2011

Page 78: Embedding CPython in Solr

Questions?

- MontySolr- https://github.com/romanchyla/montysolr

- Roman Chyla - Fellow, CERN Scientific Information Service- [email protected] @rchyla- https://svnweb.cern.ch/trac/rcarepo

Thursday, May 26, 2011

Page 79: Embedding CPython in Solr

Additional information

54Thursday, May 26, 2011

Page 80: Embedding CPython in Solr

Links

- Invenio platform- http://invenio-software.org/

- INSPIRE Digital library- http://inspirebeta.net/

- Diagrams of JCC and JEPP- Andreas Schreiber : Mixing Java and Python- http://www.slideshare.net/onyame/mixing-python-and-

java

- On Jython C Extension API- http://stackoverflow.com/questions/3097466/using-

numpy-and-cpython-with-jython

- Demo of a running service:- http://insdev01.cern.ch 55

Thursday, May 26, 2011

Page 81: Embedding CPython in Solr

#1 - How to embed Solr (standard)

56

- solr.client.solrj.embedded.EmbeddedSolrServer

Thursday, May 26, 2011

Page 82: Embedding CPython in Solr

#2 - How to embed Solr (simplified)

- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is

just a string- very flexible and probably suitable for quick

integration

57Thursday, May 26, 2011

Page 83: Embedding CPython in Solr

#2 - How to embed Solr (simplified)

- solr.servlet.DirectSolrConnection- like previous, but simpler- all the queries are sent as strings, everything is

just a string- very flexible and probably suitable for quick

integration

57Thursday, May 26, 2011

Page 84: Embedding CPython in Solr

#3 - Example of a Solr custom handler

58Thursday, May 26, 2011

Page 85: Embedding CPython in Solr

#4 - Example Python handler

59Thursday, May 26, 2011