Transcript
Page 1: Integrate ManifoldCF with Solr

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • AU S T I N , T X

Page 2: Integrate ManifoldCF with Solr

Properly integrate ManifoldCF with SolrAurélien MAZOYER

Search Expert, Co-founder, France Labs

Page 3: Integrate ManifoldCF with Solr

3

01Apache Manifold CF

o Agenda

• Overview of ManifoldCF• Our scenario : find files on a file share• In real life

Page 4: Integrate ManifoldCF with Solr

4

01Apache Manifold CF

o Overview

• Connector Framework• Incremental crawling• Handle authorization• Configuration via REST API and UI

Page 5: Integrate ManifoldCF with Solr

5

01Apache Manifold CF

o History

• Based on « Connector Framework » developed by Karl Wright for the MetaCarta Appliance

• Donated to the Apache Software Foundation in 2009• May 2012 : out of incubation• Current version : 2.2 (August 2015)

Page 6: Integrate ManifoldCF with Solr

6

01Connectors gone wild

o Different connectors for : • Content repositories• Web, Wiki, DB, Email, RSS, CMIS, Alfresco…• But also Windows Share, Sharepoint, Dropbox…• Authorities• LDAP, AD, CMIS…• Output• Solr, Elasticsearch, OSS…

Page 7: Integrate ManifoldCF with Solr

7

03

Big pictureManifold CF

Solr Elasticsearch Repository N

OpenLDAP

Authority N

Daemon Agent

Conn. 1

Manifold CF authority

service

Ouputs

Authorities

Conn. 2

Conn. N

ManifoldCFUI

ManifoldCFAPI

Conn. 1 Conn. 2 Conn. N

Wiki

DB

Repository N

Repositories

Conn. 1

Conn. N

Page 8: Integrate ManifoldCF with Solr

8

01Roles of components

o Daemon agent

• Java process• Run repository and ouput connectors• Run data crawling jobs

Page 9: Integrate ManifoldCF with Solr

9

01Roles of components

o Authority service

• Web application• Run authority connectors• Get security tokens for a specific user

Page 10: Integrate ManifoldCF with Solr

10

01Component

Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*

o ManifoldCF UI

That’s it.

Page 11: Integrate ManifoldCF with Solr

11

01API Configuration

o API

Page 12: Integrate ManifoldCF with Solr

12

01Test it!

o For testing purpose:

• java –jar post.jar

• All-in-one process• Embedded database (HSQL)

Page 13: Integrate ManifoldCF with Solr

13

01Taking MCF to productionMulti-process deployment

o 3 web application in a servlet container• mcf-crawler-ui• mcf-authorization-service• mcf-api-service

o Daemon agento Database• PostgresSQL

o Synchronize on filesystem ( local or distributed (zK) )

Page 14: Integrate ManifoldCF with Solr

14

01Search files with Security : Solr + MCF

o Our scenario

• File share using Active Directory• Search with Solr • With security constraints

Page 15: Integrate ManifoldCF with Solr

15

01Security model : Solr + MCF

o Authorization• Early Binding

• Index documents with ACLs• Compute authorization at runtime

o Authentication

• Not handled by Solr/ManifoldCF• Front-end application should authenticate user

Page 16: Integrate ManifoldCF with Solr

16

01Search files with security : Solr + MCFManifold CF

AD

Daemon Agent

JCIFS Connector

Solr connectorPhase 1 : Indexing

Repositories Authorities

Output Connector

Solr

ExtractingHandler

Manifold CF authority

service

AD ConnectorWindows

Share

MCF Plugin

Send docs and ACLs

Crawl documents with ACLs

Page 17: Integrate ManifoldCF with Solr

Get User access token

Solr

MCF Plugin

17

01Search files with security : Solr + MCFManifold CF

AD

Daemon Agent

JCIFS Connector

Solr connector

Repositories Authorities

ExtractingHandler

Manifold CF authority

service

AD Connector

Front End Authenticated Search Filter docs based on ACLs and users info

Authorized results

Phase 2 : Searching

Output Connector

Windows Share

Page 18: Integrate ManifoldCF with Solr

18

01Configure Solr + MCF

o sideo 4 connections and 1 job

• Create Windows Share connection

• Create Solr connection

• Create Active Directory connection

• Create Authority Group connection

• Create a crawling Job

Page 19: Integrate ManifoldCF with Solr

19

01Component

0…1

1…*

Authority Group

Authority Connec-tion

1…1

1…*

Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*

Utilisateur Windows
Revoir le slide
Page 20: Integrate ManifoldCF with Solr

20

01Component

AD Group

Crawl Job Solr Connection

AD Connection

Windows Share Connection

Utilisateur Windows
Page 21: Integrate ManifoldCF with Solr

21

01Configure Solr + MCF

o Frond end sideo Authentication

• For Tomcat

• JDNI Tomcat Realm• TomcatSPNEGO

Page 22: Integrate ManifoldCF with Solr

22

01Configure Solr + MCF

o sideo Modify schema.xml

• Add fields for security tokenso Modify solrconfig.xml

• Add MCF Solr Plugin (query parser)

o And don’t forget to protect the Solr instance :-P

Page 23: Integrate ManifoldCF with Solr

23

01Configure Solr + MCFo Leverage Solr Extracting handler• Based on ApacheTika• Mime type detection• Embed parsing library• Supported extension:• MS Office (OLE2 and OOXML)• OpenDocument• Pdf• Audio/video/image files• Now OCRs thanks to Tika 1.7 (and

Tesseract)o Now, can be done directly in MCF!

Cedric Ulmer
faudra expliquer car on comprend pas: c'est fait par le Solr handler, et c'est dans MCF ?? En tout cas ce slide on voit pas ce que ca veut dire.
Page 24: Integrate ManifoldCF with Solr

24

01Component

0…1

1…*

Authority Group

Authority Connec-tion

1…1

1…*

Ouput ConnectionRepo Connection Crawl Job1…1 1…* 1…* 1…*

Transformation Connection

0…*

1…*

Page 25: Integrate ManifoldCF with Solr

25

01Crawling principle

o Crawling model

• Incremental model

• Continuous modelManifoldCF In Action – Chapter 1 (Karl Wright)

Phase 1 Phase 2

Page 26: Integrate ManifoldCF with Solr

26

01Incremental crawling of file share

o Incremental crawling not so easy with some repositories:

Windows Share

ConnectorJCIFS

Windows ShareUhuuu, file share, what's new

since last time we met?Errkkk…

Page 27: Integrate ManifoldCF with Solr

27

01Incremental crawling of file share : Solr + MCF

o Phase 1 : Discovery/Indexing Depth first

Fetch SMB file attributes

If file is a directory and if matches inclu-sion regex

For each file

If file is a regular file and if matches inclusion re-gex

List files in SMB directory

Check ingeststatus entry in craw-ler DB

If no entry or the version attribute is different

Fetch file content

Update ingeststatus entry in DB

Push file to Solr

For each start pathentry

Windows Share

Utilisateur Windows
Page 28: Integrate ManifoldCF with Solr

28

01

o What is ingeststatus database entry?o Simplified version :

o LastVersion?

• Here, computed from lastModified and ACLs on the file

DOCURI LAST_INGEST LAST_VERSION

protocol://REPO_HOST/Doc1.docx 10.09.2015 18:21:04 Doc1_Version1

protocol://REPO_HOST/Doc2.docx 10.09.2015 19:21:04 Doc2_Version1

+S-1-5-18+S-1-5-21-3380247023-2036360560-1108467148-1118+S-1-5-21-3380247023-2036360560-1108467148-500+S-1-5-32-

544+1+DEAD_AUTHORITY+-file://///52.30.17.184/ShareFolder/Test-File.txt+1444462827664:16Y

Incremental crawling of file share

Page 29: Integrate ManifoldCF with Solr

29

01Incremental crawling of file share : Solr + MCF

o Phase 2 : Deleting unreachable documents

Update Crawler database

Send delete command to Solr

For each crawler DB entry

Page 30: Integrate ManifoldCF with Solr

30

01How to see what happened

o Search History

o Monitoring

• Job Status• Notification Connections

Utilisateur Windows
bosser transition
Page 31: Integrate ManifoldCF with Solr

31

01How to see what happened

o Search History

o History • Simple History• Maximum Activity• Maximum Bandwidth• Result Histogramo Status• Document Status• Queue Status

Page 32: Integrate ManifoldCF with Solr

32

01Performance issue

o Find bottleneck• Crawled repository• Network• Solr• MCF database• MCF configuration

Utilisateur Windows
Regarder transition
Page 33: Integrate ManifoldCF with Solr

33

01Handle performance issue

o Specific connector’s configuration

• Throttling• Max JVM connectionso Can improve speed / limit impact on crawled repositoryo Very specific to the repository

Page 34: Integrate ManifoldCF with Solr

34

01Handle performance issue

o Job settingso Size limit of ingested documentso Use regex to remove some extensions from crawl

Page 35: Integrate ManifoldCF with Solr

35

01Investigate errors

• Increase connector’s log level• Read MCF simple history• Thread Dump

Page 36: Integrate ManifoldCF with Solr

36

01Common errors in file crawling

o Crawler account rightso Exotic fileso Very biiiiiiig fileso JCIFS errors o Solr connector timeout

Page 37: Integrate ManifoldCF with Solr

37

01When use ManifoldCF?

q = crawled_environment:heterogeneous OR scenario:intranetOR security:mandatory

Utilisateur Windows
Regarder la transition
Page 38: Integrate ManifoldCF with Solr

38

01References

o ManifoldCF documentationhttps://manifoldcf.apache.org/release/trunk/en_US/end-user-documentation.html

o ManifoldCF in Action (K. Wright)https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs

o Securing Solr document with MCF (K. Wright)http://fr.slideshare.net/lucenerevolution/wright-nokia-manifoldcfeurocon-2011

o France Labs blog posts : http://www.francelabs.com/blog/tutorial-for-combining-manifoldcf-and-solr-for-files-search/http://www.francelabs.com/blog/tutorial-on-authorizations-for-manifold-cf-and-solr/

Page 39: Integrate ManifoldCF with Solr

39

01Datafari

Search

Admin

o Intranet “ready to play” search solution

• Apache Licenseo Embed:o Solr o ManifoldCFo And other cool stuff:• Admin and responsive search UI• User Management• Banana for user behavior analysis• Tesseract OCR• A funny zebra• Etc… www.datafari.com

Page 40: Integrate ManifoldCF with Solr

40

[email protected]

@francelabswww.francelabs.com


Top Related