search features and architecture in dnn 7.1
Post on 24-May-2015
1.719 Views
Preview:
DESCRIPTION
TRANSCRIPT
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
7.1 Search and Lucene.Net
Ash Prasad
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• History and New Objectives • Architecture• Lucene / Lucene.Net• Crawlers, Entities, Controllers• Ranking, Synonyms, Ignore Words,
Stemming• Security Trimming• Module Integration, New Crawler
Agenda
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Platform Edition• SQL Server• ISearchable
• Commercial Edition• Lucene 2.9.2• URL and Files
History of Search
Lucene
Scheduler
SQL
Scheduler Module
Module
ISearchable
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Handle diverse Content • CMS, Social, Localized, 3rd Party
Modules)
• Consistent User Experience• Simple for Module Developers• Uniform Architecture • Feature based differentiation
Objectives of New Search
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Architecture
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Java-based indexing and search technology
• Managed by Apache• NOSQL database• Near real-time, Spellchecking,
Highlighting, Ranking, Synonyms
• Many companies use Lucene directly or customize
• Facebook’s Graph search uses
similar ‘Inverted Index’
What’s Lucene
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Line-by-line port from Java to C#• Maintains high-performance requirements• A bit behind Java releases• Who Uses Lucene.Net• Products - RavenDB, Orchard, Umbraco,
SubText• Commercial Sites – BBC UK Top Gear,
AutoDesk, Koders.Com
What’s Lucene.Net
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Flexible Schema
• Consists of Documents• Which are collection of Fields
• Documents can have different set of Fields• Field(“ID”,”xxx-yyy-999”), Field(“Title”,
“My best doc”)• Field(“Owner”,”Ash”),
Field(“Locale”,”en-US”)
Lucene – A Document Store
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Denormalized (No Referential Integrity)
• Deletion – Done through a flag• Compact reclaims deleted space
• Update is Delete + Insert • Boost = Ranking• Unicode compliant
Lucene – A Document Store (Contd.)
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Book consulted for Search
• Book on version 3.0
• ~ 500 pages• Very useful
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Search Phases
Content Acquisition• Crawling• ISearchable• ModuleSearchBase• URL• Doc / PDF
Content Indexing• Text Analysis• Ranking• Synonyms• Ignore Words• Stemming
Content Search• Querying• Sorting• Security Trimming• Boolean Search• Highlighting
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Platform• Site Crawler• Module and Tab Metadata• Module Content
(ModuleSearchBase/ISearchable)
• Commercial Edition• File Crawler • Uses IFilter for extraction of text
PDF/Office files
• URL Crawler• Internal and External URLs
Crawlers
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• SearchType• Distinguishes Crawlers
• SearchDocument• Properties for a Content• Stored in the Index
• SearchQuery• Parameters to execute a Query
• SearchResult• Derived from SearchDocument
Search Entities
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Search Entities – Indexing vs. Querying
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• SearchController• For Querying
• InternalSearchController• For Adding / Updating / Deleting
• LuceneController• Interacts with Lucene
Controllers
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Doc and/or Field can be boosted in Lucene
• DNN does Field boosts (Default - 10)• Title (50)• Tag (40)• Keyword (35)• Description (20)• Author (15)
• Configured manually by HostSettings
Ranking = Boosting
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Synonyms are injected into Index
• Ignore Words are removed from Index
Synonyms and Ignore Words
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Convert words to its root• PorterStemFilter is used• Country and Countries = countri• breathe, breathes, breathing,
breathed = breath• fishing, fished, fisher = fish
Stemming
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Done through Collectors (Callback)
• Each Doc found is sent to Collector
• Collector rejects/accept per Permission
• Site Crawler - Module / Tab Permission
• File Crawler - Folder Permission• User Crawler – Profile
Permission
Security Trimming
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• ModuleSearchBase • New abstract class with just one
method• Defined in BusinessControllerClass• GetModifiedSearchDocuments• Returns New, Changed and Deleted
content• Delta based• Granular Permission, Localization, etc.
• ISearchable continues to work (no delta)
Module Integration
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• Define a new SearchType• Optionally use IsPrivate to hide
from site search
• Implement BaseResultController (2 methods)• HasViewPermission• GetDocUrl
• Create Scheduled Task• Call AddSearchDocuments to inject
content
New Crawler (How to)
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
Demo
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
• New Search uses Lucene.Net• Platform has Site Crawler • Commercial has URL and File
Crawlers• Modules to implement
ModuleSearchBase• New Crawler implements
BaseResultController
Recap
@DNNCon @ashishprasad
Don’t forget to include #DNNCon in your tweets!
THANKS TO ALL OF OUR GENEROUS SPONSORS!
top related