apache solr cms integration @ lucene/solr revolution san diego 2013

26
APACHE SOLR CMS INTEGRATION Ingo Renner Software Engineer

Upload: ingo-renner

Post on 19-May-2015

2.301 views

Category:

Technology


9 download

DESCRIPTION

TYPO3 is an Open Source Content Management System that is very popular in Europe, especially in the German market, and gaining traction in the U.S., too. TYPO3 is a good example of how to integrate Solr with a CMS. The challenges we faced are typical of any CMS integration. We came up with solutions and ideas to these challenges and our hope is that they might be of help for other CMS integrations as well. That includes content indexing, file indexing, keeping track of content changes, handling multi-language sites, search and facetting, access restrictions, result presentation, and how to keep all these things flexible and re-usable for many different sites. For all these things we used a couple additional Apache projects and we would like to show how we use them and how we contributed back to them while building our Solr integration.

TRANSCRIPT

Page 1: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

APACHE SOLR CMS INTEGRATION

Ingo RennerSoftware Engineer

Page 2: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

we build smart.

ID INFIELD DESIGN

MAY.01.2013LUCENE/SOLR REVOLUTION

TYPO3 CMS and Solr. How we did it.

APACHE SOLR CMS INTEGRATION

Page 3: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

ABOUT IDWhat we do and who we do it for

• Strategy Planning

• Design

• UX

• Development & Integration

Page 4: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

WHO IS THIS GUY?

• Committer TYPO3 CMS

• Committer and PMC member Apache Tika

• Release Manager TYPO3 CMS 4.2

• New San Franciscan

• Snowboarding, mountain biking

• Software Engineer, Architect at Infield Design

- Caution -TYPO3-Evangelist

Page 5: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

TYPO3 CMS

Page 6: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

TYPO3 CMS

• Free and Open Source Enterprise CMS

• Estimated 500,000+ installations worldwide

• Over 6,000+ public extensions

• 6,000,000+ downloads

• Content Management Framework

• Multi-Site, Multi-Language, Versioning, Workflows, ...

• Stable, Secure, Scaleable

Page 7: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

TYPO3 COMMUNITY

• Community driven development

• Conferences in North America, Europe, Asia

• Barcamps, Developer Days, Snowboard Tour

• 4 times Google Summer of Code participant

• Backed by TYPO3 Association

• Several other projects under the TYPO3 brand

Page 8: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

SOLR & CMS INTEGRATION

Page 9: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

PAGE RENDERING

• Di!erent template engines

• (too) flexible page rendering engine

• Identify relevant content on websites

• Exclude navigation and common page elements

• Content generated by plugins

Page 10: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

INDEX QUEUE

• Index Queue to track and index content

• Record Monitor to update Index Queue

• Crawl pages, index unstructured content marked relevant

• Exclude pages with plugin-generated content

• Index structured plugin data directly from DB

Page 11: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

ACCESS RIGHTS

• Intranet, Extranet, ...

• Not everybody may see everything

• Flexible user groups and permissions

• Permissions extended to sub-pages

Page 12: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

SOLR ACCESS FILTER PLUGIN

• Custom Solr access filter plugin

• Query Parser and Filter

• User group IDs stored in documents

• Current user’s groups submitted with query

• Plugin matches document groups with user’s groups

Page 13: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

FILE INDEXING

• Finding file links in page content

• Core file links vs. plugin file links

• Track files for indexing

• Reading file content

• Separate tools for di!erent file formats

Page 14: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

FILE INDEXING

• File Detectors & File Index Queue

• File system abstraction layer

• Apache Tika

• Knows 1,200+ file formats, reads about half of them

• Content & meta data extraction

• Language detection

Page 15: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

THE REST

• PHP people vs. Java technology

• Talking to Solr

• Learning from mistakes

Page 16: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

Integration Challenges & Solutions

THE REST

• Fully automated bash install script

• SolrPhpClient

• Separate your languages

Page 17: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

EXT:solr - Apache Solr for TYPO3

FEATURES

• Facetted Search

• File Indexing

• Multi-Language & Multi-Site Support

• Did you mean, More Like This

• Search Word Highlighting

• Auto Complete

• Access Rights Support

• Many More ...

Page 18: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
Page 19: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
Page 20: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
Page 21: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
Page 22: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013
Page 23: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

we build smart.

ID INFIELD DESIGN

QUESTIONS?

Page 24: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

ID INFIELD DESIGN

we build smart.

THANKS.

Page 25: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

ID INFIELD DESIGN

we build smart.

T3CON North AmericaSan Francisco, May 30-3120% o! regular ticket price, use:LUCENETYPO3

INFIELD DESIGN is hiring!

Page 26: Apache Solr CMS Integration @ Lucene/Solr Revolution San Diego 2013

CONFERENCE PARTYThe Tipsy Crow: 770 5th AveStarts after Stump The ChumpYour conference badge gets you in the door

TOMORROW Breakfast starts at 7:30Keynotes start at 8:30

CONTACT@[email protected], [email protected]