integrate manifoldcf with solr
TRANSCRIPT
![Page 1: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/1.jpg)
O C T O B E R 1 3 - 1 6 , 2 0 1 6 • AU S T I N , T X
![Page 2: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/2.jpg)
Properly integrate ManifoldCF with SolrAurélien MAZOYER
Search Expert, Co-founder, France Labs
![Page 3: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/3.jpg)
3
01Apache Manifold CF
o Agenda
• Overview of ManifoldCF• Our scenario : find files on a file share• In real life
![Page 4: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/4.jpg)
4
01Apache Manifold CF
o Overview
• Connector Framework• Incremental crawling• Handle authorization• Configuration via REST API and UI
![Page 5: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/5.jpg)
5
01Apache Manifold CF
o History
• Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance
• Donated to the Apache Software Foundation in 2009• May 2012 : out of incubation• Current version : 2.2 (August 2015)
![Page 6: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/6.jpg)
6
01Connectors gone wild
o Different connectors for : • Content repositories• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…• But also Windows Share, Sharepoint, Dropbox…• Authorities• LDAP, AD, CMIS…• Output• Solr, Elasticsearch, OSS…
![Page 7: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/7.jpg)
7
03
Big pictureManifold CF
Solr Elasticsearch Repository N
OpenLDAP
Authority N
…
Daemon Agent
Conn. 1
Manifold CF authority
service
Ouputs
Authorities
Conn. 2
Conn. N
ManifoldCFUI
ManifoldCFAPI
Conn. 1 Conn. 2 Conn. N
Wiki
DB
Repository N
…
…
Repositories
Conn. 1
Conn. N
![Page 8: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/8.jpg)
8
01Roles of components
o Daemon agent
• Java process• Run repository and ouput connectors• Run data crawling jobs
![Page 9: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/9.jpg)
9
01Roles of components
o Authority service
• Web application• Run authority connectors• Get security tokens for a specific user
![Page 10: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/10.jpg)
10
01Component
Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*
o ManifoldCF UI
That’s it.
![Page 11: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/11.jpg)
11
01API Configuration
o API
![Page 12: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/12.jpg)
12
01Test it!
o For testing purpose:
• java –jar post.jar
• All-in-one process• Embedded database (HSQL)
![Page 13: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/13.jpg)
13
01Taking MCF to productionMulti-process deployment
o 3 web application in a servlet container• mcf-crawler-ui• mcf-authorization-service• mcf-api-service
o Daemon agento Database• PostgresSQL
o Synchronize on filesystem ( local or distributed (zK) )
![Page 14: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/14.jpg)
14
01Search files with Security : Solr + MCF
o Our scenario
• File share using Active Directory• Search with Solr • With security constraints
![Page 15: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/15.jpg)
15
01Security model : Solr + MCF
o Authorization• Early Binding
• Index documents with ACLs• Compute authorization at runtime
o Authentication
• Not handled by Solr/ManifoldCF• Front-end application should authenticate user
![Page 16: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/16.jpg)
16
01Search files with security : Solr + MCFManifold CF
AD
Daemon Agent
JCIFS Connector
Solr connectorPhase 1 : Indexing
Repositories Authorities
Output Connector
Solr
ExtractingHandler
Manifold CF authority
service
AD ConnectorWindows
Share
MCF Plugin
Send docs and ACLs
Crawl documents with ACLs
![Page 17: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/17.jpg)
Get User access token
Solr
MCF Plugin
17
01Search files with security : Solr + MCFManifold CF
AD
Daemon Agent
JCIFS Connector
Solr connector
Repositories Authorities
ExtractingHandler
Manifold CF authority
service
AD Connector
Front End Authenticated Search Filter docs based on ACLs and users info
Authorized results
Phase 2 : Searching
Output Connector
Windows Share
![Page 18: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/18.jpg)
18
01Configure Solr + MCF
o sideo 4 connections and 1 job
• Create Windows Share connection
• Create Solr connection
• Create Active Directory connection
• Create Authority Group connection
• Create a crawling Job
![Page 19: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/19.jpg)
19
01Component
0…1
1…*
Authority Group
Authority Connec-tion
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*
![Page 20: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/20.jpg)
20
01Component
AD Group
Crawl Job Solr Connection
AD Connection
Windows Share Connection
![Page 21: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/21.jpg)
21
01Configure Solr + MCF
o Frond end sideo Authentication
• For Tomcat
• JDNI Tomcat Realm• TomcatSPNEGO
![Page 22: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/22.jpg)
22
01Configure Solr + MCF
o sideo Modify schema.xml
• Add fields for security tokenso Modify solrconfig.xml
• Add MCF Solr Plugin (query parser)
o And don’t forget to protect the Solr instance :-P
![Page 23: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/23.jpg)
23
01Configure Solr + MCFo Leverage Solr Extracting handler• Based on ApacheTika• Mime type detection• Embed parsing library• Supported extension:• MS Office (OLE2 and OOXML)• OpenDocument• Pdf• Audio/video/image files• Now OCRs thanks to Tika 1.7 (and
Tesseract)o Now, can be done directly in MCF!
![Page 24: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/24.jpg)
24
01Component
0…1
1…*
Authority Group
Authority Connec-tion
1…1
1…*
Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*
Transformation Connection
0…*
1…*
![Page 25: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/25.jpg)
25
01Crawling principle
o Crawling model
• Incremental model
• Continuous modelManifoldCF In Action – Chapter 1 (Karl Wright)
Phase 1 Phase 2
![Page 26: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/26.jpg)
26
01Incremental crawling of file share
o Incremental crawling not so easy with some repositories:
Windows Share
ConnectorJCIFS
Windows ShareUhuuu, file share, what's new
since last time we met?Errkkk…
![Page 27: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/27.jpg)
27
01Incremental crawling of file share : Solr + MCF
o Phase 1 : Discovery/Indexing Depth first
Fetch SMB file attributes
If file is a directory and if matches inclu-sion regex
For each file
If file is a regular file and if matches inclusion re-gex
List files in SMB directory
Check ingeststatus entry in craw-ler DB
If no entry or the version attribute is different
Fetch file content
Update ingeststatus entry in DB
Push file to Solr
For each start pathentry
Windows Share
![Page 28: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/28.jpg)
28
01
o What is ingeststatus database entry?o Simplified version :
o LastVersion?
• Here, computed from lastModified and ACLs on the file
DOCURI LAST_INGEST LAST_VERSION
protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1
protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1
+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-2036360560-1108467148-500+S-1-5-32-
544+1+DEAD_AUTHORITY+-file://///52.30.17.184/ShareFolder/Test-File.txt+1444462827664:16Y
Incremental crawling of file share
![Page 29: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/29.jpg)
29
01Incremental crawling of file share : Solr + MCF
o Phase 2 : Deleting unreachable documents
Update Crawler database
Send delete command to Solr
For each crawler DB entry
![Page 30: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/30.jpg)
30
01How to see what happened
o Search History
o Monitoring
• Job Status• Notification Connections
![Page 31: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/31.jpg)
31
01How to see what happened
o Search History
o History • Simple History• Maximum Activity• Maximum Bandwidth• Result Histogramo Status• Document Status• Queue Status
![Page 32: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/32.jpg)
32
01Performance issue
o Find bottleneck• Crawled repository• Network• Solr• MCF database• MCF configuration
![Page 33: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/33.jpg)
33
01Handle performance issue
o Specific connector’s configuration
• Throttling• Max JVM connectionso Can improve speed / limit impact on crawled repositoryo Very specific to the repository
![Page 34: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/34.jpg)
34
01Handle performance issue
o Job settingso Size limit of ingested documentso Use regex to remove some extensions from crawl
![Page 35: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/35.jpg)
35
01Investigate errors
• Increase connector’s log level• Read MCF simple history• Thread Dump
![Page 36: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/36.jpg)
36
01Common errors in file crawling
o Crawler account rightso Exotic fileso Very biiiiiiig fileso JCIFS errors o Solr connector timeout
![Page 37: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/37.jpg)
37
01When use ManifoldCF?
q = crawled_environment:heterogeneous OR scenario:intranetOR security:mandatory
![Page 38: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/38.jpg)
38
01References
o ManifoldCF documentationhttps://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html
o ManifoldCF in Action (K. Wright)https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs
o Securing Solr document with MCF (K. Wright)http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011
o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/
![Page 39: Integrate ManifoldCF with Solr](https://reader037.vdocument.in/reader037/viewer/2022102423/5871ae721a28abda6a8b61bb/html5/thumbnails/39.jpg)
39
01Datafari
Search
Admin
o Intranet “ready to play” search solution
• Apache Licenseo Embed:o Solr o ManifoldCFo And other cool stuff:• Admin and responsive search UI• User Management• Banana for user behavior analysis• Tesseract OCR• A funny zebra• Etc… www.datafari.com