super size your search
DESCRIPTION
As organisations store more and more information in their Alfresco content hubs, search and discovery of content becomes important. Alfresco comes bundled with Apache Lucene and Apache Solr for search. Although these provide full text capabilities, they do not have the scalability and functionality of the newer cloud scalable search software such as Apache Solr Cloud 4, Elastic Search and Amazon Cloud Search. Also, searching across multiple Alfresco instances including Alfresco Cloud is quite a challenge and any of the possible approaches are not good enough to be production ready. This talk shows you how to index and search content stored in one or more Alfresco repositories, other CMIS repositories or file systems using either Apache Solr Cloud 4, Elastic Search or Amazon Cloud Search, while still ensuring the confidentiality of the documents based on the permissions configured in Alfresco or any other repositories.TRANSCRIPT
#SummitNow
Super Size Your Search6th November 2013Piergiorgio Lucidi (Sourcesense)
Fran Alvarez (Zaizi)
#SummitNow
#SummitNow
Piergiorgio Lucidi• Open Source ECM Specialist at Sourcesense• Alfresco Certified Trainer / Engineer• Alfresco Wiki Gardener / Community Star• Alfresco forum supporter• Global Moderator of the italian forum• Author and Technical Reviewer at Packt• PMC Member and Mentor at ASF• Project Leader in the JBoss Community
#SummitNow
#SummitNow
OverviewHow to build and manage your search server:
1. Scenario2. Introducing Apache
ManifoldCF3. Zaizi Integrated Search
Solution
#SummitNow
#SummitNow
ScenarioAn overview about the typical complex search architecture
#SummitNow
#SummitNow
Scenario - Alfresco limitationsAlfresco supports these search engines:• Apache Lucene (embedded)• Apache Solr (provided by Alfresco)• needs development if other
repositories must be involvedEvery other approach must be implemented (ScheduledActions, WebScripts, etc..)
#SummitNow
#SummitNow
Scenario – Embedded
Simple Search Architecture
Alfresco is the only one repository involved in the architecture using the embedded search engine:• the repository must take care of indexes
also managing index transactions
Indexes
AlfrescoFrontEnd
applications
Apache Lucene
#SummitNow
#SummitNow
Scenario – Embedded - Cluster
Embedded
Not easy to scale out with Lucene1. every cluster must have its own search
indexes2. The cluster must synchronize indexes
Indexes
Alfresco
Apache Lucene
Indexes
Alfresco
Apache Lucene
JGroups
#SummitNow
#SummitNow
Scenario – Simple Architecture
Simple search architecture
Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for
publish contents in the front end architecture
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd applications
#SummitNow
#SummitNow
Scenario – Publish with searchA search engine can be used for:• advanced management of search
indexes• scaling out• executing complex search on
contents• publishing contents in the FE
architecture
#SummitNow
#SummitNow
Scenario – Publish with search
Publish with search architecture
Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for
publishing contents in the front end architecture (HTML)
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd applications
BackEnd FrontEnd
Lucene / Solr
Indexes
#SummitNow
#SummitNow
Scenario – Simple Architecture
Simple Search Architecture
Alfresco is the only one repository involved in the architecture with an external search server1. The search server can be used for
publish contents in the front end architecture
2. The repository will stay in the logic backend
Search Engine
Indexes
Alfresco FrontEnd applications
#SummitNow
#SummitNow
Scenario – Complex Architecture1. Alfresco is only one of the platforms
that must be involved in your search architecture
2. You don’t want to increase the development effort
3. You want just something to configure
#SummitNow
#SummitNow
Scenario – Complex Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox
#SummitNow
#SummitNow
Scenario – Complex Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox ?
#SummitNow
#SummitNow
Scenario – Complex Architecture
Architecture with different ECM systems
Alfresco is one of the content platforms that must be involved in the indexing process
Alfresco
Search Engine
Indexes
SharePoint
FileNet
CMIS
JIRA
Google Drive
DropBox
#SummitNow
#SummitNow
Introducing Apache ManifoldCF
#SummitNow
#SummitNow
Apache ManifoldCF - HistoryManifoldCF code base was granted by MetaCarta to the Apache Software Foundation in December 2009.
The MetaCarta effort represented more than five years of successful development and testing in multiple, challenging enterprise environments.
The project was graduated as Apache Top Level Project in July 2012.
#SummitNow
#SummitNow
Apache ManifoldCF – What is?Open Source crawler• crawling model (add, change,
delete)• schedule jobs to create indexes • get contents from repositories• push contents on search servers
#SummitNow
#SummitNow
Apache ManifoldCF – What is?
Repository 1
Repository 3
Repository 4
Repository 2Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
#SummitNow
#SummitNow
Apache ManifoldCF – What is?Out-Of-The-Box it is distributed as a webapp• REST API• Authority Service• ACL indexes
• Crawler UI can be embedded in any Java application
#SummitNow
#SummitNow
Apache ManifoldCF – Why?• Reliability • Incremental• Flexible• Multi repositories• Security model• Monitoring
#SummitNow
#SummitNow
ManifoldCF – Why? - ReliabilityJobs scheduling and configuration are stored in the database to maintain the state of all the executions
Repository 1
Repository 3
Repository 4
Repository 2Apache ManifoldCF
Search Server 1
Search Server 2
Search Server 3
Search Server 4
Pull Agent Daemon
Database
#SummitNow
#SummitNow
ManifoldCF – Why? - Incrementalget content changesets obtained from the repository API
Repository 1 Apache ManifoldCF
Pull Agent Daemon
Database
query
Complete Changesets
#SummitNow
#SummitNow
ManifoldCF – Why? - FlexibleIf the repository can't supply all the changes Manifold can discover them through crawling
Apache ManifoldCF
Pull Agent Daemon
Database
queryIncomplete Changesets
Change Discovery
N N
#SummitNow
#SummitNow
ManifoldCF – Why? – Multi repoJobs can retrieve contents from the following repositories:• Google Drive• Dropbox• HDFS• CMIS-compliant• Alfresco • IBM FileNet• EMC Documentum
• Microsoft SharePoint• OpenText LiveLink• Autonomy Meridio• Memex Patriarch• Windows Share/DFS • Generic JDBC • Generic Filesystem • Generic RSS and Web
#SummitNow
#SummitNow
ManifoldCF – Why? – Multi repoJobs can ingest contents to the following search servers:
• Apache Solr• ElasticSearch • OpenSearchServ
er• MetaCarta GTS
#SummitNow
#SummitNow
ManifoldCF – Why? - SecurityRetrieve per-content ACLs
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCFSearch Server 1
Search Server 2
Search Server 3
Search Server 4
Authority Service
Authority 1
Authority 2
access tokens
#SummitNow
#SummitNow
ManifoldCF – Why? - SecurityRetrieve per-content ACLs
Repository 1
Repository 3
Repository 4
Repository 2
Apache ManifoldCFSearch Server 1
Search Server 2
Search Server 3
Search Server 4
Authority Service
Authority 1
Authority 2
user access tokens
user specific search results
#SummitNow
#SummitNow
ManifoldCF – Why? – MonitoringUI Crawler allows you to:• configure jobs and connectors• monitor jobs execution• monitor contents ingestion
• status reports• document status• queue status
• history reports • simple history• maximum activity• maximum bandwidth• result histogram
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connector
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connector Output Connector
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connector Output Connector
Authority Connector
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connectorquery to retrieve
contentsOutput Connector
Authority Connector
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connectorquery to retrieve
contents
Output Connectormetadata mappingcontent ingestion
Authority Connector
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connectorquery to retrieve
contents
Output Connectormetadata mappingcontent ingestion
Authority Connectorretrieve content
ACEs
#SummitNow
#SummitNow
ManifoldCF – Architecture
Repository Job Search Server
ACLs
Repository Connectorquery to retrieve
contents
Output Connectormetadata mappingcontent ingestion
Authority Connectorretrieve content
ACEs
• verbal description
• crawling model• scheduling
#SummitNow
#SummitNow
Who is using ManifoldCF?
#SummitNow
#SummitNow
ManifoldCF - Resources
The project is available at http://manifoldcf.apache.org/
From this website you can access to the mailing lists, documentation and download links for binaries and source.
#SummitNow
#SummitNow
ManifoldCF – Resources - BookManifoldCF in Action by Karl Wright published by Manning Karl is the original developer and the principal committer of Apache ManifoldCF The book is available at
http://www.manning.com/wright
#SummitNow
#SummitNow
Zaizi Integrated Search Solution
#SummitNow
#SummitNow
Fran Alvarez• Director of Zaizi Iberia and Lead
Architect• Alfresco Certified Engineer• Responsible of large Alfresco
architectures• Semantic Consultant for Sensefy• Alfresco Meetups Organizer
#SummitNow
#SummitNow
Alfresco + Solr ApproachQuite a good architecture
• Performance issues are solved
• Different architectures depending on business requirements
However…
• It does not cover some use cases or scenarios
• It does not leverage Cloud benefits or latest technologies
• With huge data volume there are other approaches
How can we solve limitations and enhance benefits?
#SummitNow
#SummitNow
Alfresco + Solr Approach• Decouples Search solution from Alfresco•Allow to implement different Search solutions•Allow to change Search solution without changing anything in Alfresco
• Not even a property!•Provides an API to integrate it with Alfresco as search engine
• Even other repository vendors! E.g. Filesystem, Sharepoint, Documentum, Filenet, Drupal…
•And preserve security permissions in the results• Alfresco permissions are indexed and used during search
It’s included in our Semantic solution: Sensefy!
#SummitNow
#SummitNow
What we’ve done in ManifoldRepository Connector:• Alfresco Repository Connector: New implementation
• Removing dependency with Alfresco Solr APIOutput connectors:• Cloud Search Output Connector: Design & Development• Elastic Search Output Connector: Improvements• Solr Cloud Output Connector: Configuration for Alfresco
Authority Connector• Alfresco Authority Connector: Design & Development
• Similar approach to Alfresco Solr• Acl reads for Users and Groups in Alfresco
#SummitNow
#SummitNow
Scenarios
Let’s see some examples
#SummitNow
#SummitNow
I: Several Alfresco instancesCurrent Approach:
• Each Alfresco has its own Search subsystem
• They can’t share indexes
Implications:• Federated search is not an option• Results can’t be merged
• If so, what resultset should be first?
ConclusionResults could be presented to users in different tabs or “manually” merged.Not the best approach
#SummitNow
#SummitNow
I: Several Alfresco instancesZaizi Approach:
• Our solution like search box• Which manages a single index
Implications:• All documents are driven to same
index• Users can select results from
either all Alfresco instances or a subset
ConclusionSearch across Repositories
Could be based Elastic Search, Solr Cloud, Amazon Cloud, etc.
#SummitNow
#SummitNow
II: Alfresco + Other data providers
Current Approach:• Alfresco has its own Search
subsystem• Other repository may have (or
not) its own Search subsystemImplications:
• Different data providers mean different formats• E.g. Filesystem does not
support CMIS• Alfresco can’t reach external data
ConclusionNo way to merge results and
present them uniformly to end users
#SummitNow
#SummitNow
II: Alfresco + Other data providers
Zaizi Approach:
• Both Alfresco and other repositories share Search subsystem (Manifold)
Implications:• Alfresco and other providers
results will have same format in our Solution• They will speak ‘our’ language
• Alfresco reaches external data when communicating with our solution
ConclusionResults are present and accessible between data providers
#SummitNow
#SummitNow
III: Alfresco + O(TB) dataCurrent Approach:
• Alfresco has its own Search subsystem
• All data is in one (or several if cluster) Solr instance
Implications:• Every Solr node manages the
whole index• No chance to apply scale
techniques for indexing:• Sharding, Replication…
ConclusionHuge servers are required and performance might be compromised
#SummitNow
#SummitNow
III: Alfresco + O(TB) dataZaizi Approach:
• Alfresco uses our solution• Data is indexed in search solution
which better suits:• Amazon Cloud, Solr Cloud,
Elastic Search…
Implications:• Cloud Search solution manages
index• Indexing techniques can be applied
according to use cases• Sharding, Replication
ConclusionSearch strategy can be adopted and easily implemented with search solution which better fits
#SummitNow
#SummitNow
Apache Manifold: Other benefitsCan extract, index and map information from any other sources• Apache Stanbol, RedLink, any other data enricher• Our solution will gather everything in one place
• Documents, entities…Permissions are checked just once• Everything is in the same place, even user
authorization capabilities• Performance and scalability is improved• Faceted search and other search capabilities are
combined with such permission feature
#SummitNow
#SummitNow
Demo
#SummitNow
#SummitNow
ConclusionsZaizi solution allows searching and indexing in the most popular Cloud Search solutions
• Other Search solutions can be integrated as wellZaizi solution allows retrieving information from the most popular repositories
• Other Data providers can be integrated too• It solves plenty of current issues related search and
indexing in Alfresco• Can be used outside Alfresco or even with Alfresco and
any other data repositoryZaizi solution manages permissions and security from the most popular repositories and the latest Cloud search technologies Fully supported by us!
#SummitNow
#SummitNow
Conclusions
#SummitNow
#SummitNow
What’s comingPowerful User Interface• Admin functions• Wide range of
facets• UI for Share
Benchmarking
New connectors• Filesystem
authority• RedLink repository• Stanbol repository
Alfresco Search Subsystem?
#SummitNow