architecture overview content ingestion content enrichment advanced enrichment
Post on 31-Mar-2015
224 Views
Preview:
TRANSCRIPT
Search Content Enrichment and Extensibility in SharePoint 2013Brent GroomSenior PFEMicrosoft
SPC414
Sreedhar MallangiSenior ConsultantMicrosoft
Identify content extensibility pointsLearn about custom connectorsLearn the basics of content enrichmentAdvanced content enrichmentLearn about two community Toolkits
Almost all “on-prem”
Session Objectives
Agenda
Architecture overviewContent ingestionContent enrichmentAdvanced enrichment
SharePoint 2013 Search Architecture
SearchAdmin
Content UXCrawl
ContentProcessing Index
QueryProcessing WFE
API
AnalyticsProcessing
Crawl
Search Admin
Link
Analytics Reporting
FAST Search Index
Public API
Unit of scale/role boundary
Extensibility PointsQuery Features
SharePoint 2013 Search Architecture
SearchAdmin
Content UXCrawl
ContentProcessing Index
QueryProcessing WFE
API
AnalyticsProcessing
Crawl
Search Admin
Link
Analytics Reporting
FAST Search Index
ContentEnrichmentWeb Service
CustomConnectors
Public API
Unit of scale/role boundary
Extensibility Points
Crawl Component OOB connectors Extensible through
BCS Local disk cache Crawled items
tracked in Crawl database
Configurations stored in Admin database
Crawl modes Full Crawl Incremental Crawl Continuous Crawl
Crawl
ContentProcessing Index
Crawl
FAST Search Index
HTTP
File Shares
SharePoint
User Profiles
Exchange
Lotus Notes
Documentum
Custom
...Admin
mssearch.exe
SearchAdmin
Content Processing Component
Extending content processing
Web Service Callout
Web Service
You can customize the search experience through the extensibility points in the content processing flow
Delete
Update
Crawler
IndexDelete Links
Security Descriptors
Inse
rt
Detect language
Document summary
Map to managed properties
Custom Entity
Extraction
Phonetic name
variations
Word breaking
Web Service Callout
Ifilter sandbox
Security Descriptor
s
Parse Documents
AnalyticsMetadata Extration
Register crawled
properties
Agenda
Architecture overview
Content ingestionContent enrichmentAdvanced enrichment
Why?Enterprises have many different data sourcesWe are building Enterprise Search PlatformsAllow users to find the content they are looking for - all sources in one placeIncrease productivity
No Search Content API anymoreFAST ESP had a push based content API
OK! What do we have?Connector
Protocol Handlers
Default Solutions
BCS Connector Framework
Lotus NotesExchange public folder
Documentum
File shareSharePoint
WebsitePeople Profile
BCS
Custom solutions
What is Business Connectivity Services?Connects external data sources to SharePointCan be used as a search sourceHas several flavors
No-Code OData
SQL
Code WCF
.NET Assembly
B311@TechEd 2013
Search Indexing Toolkit - SITA generic implementation of a Custom SharePoint Indexing ConnectorGeneric Data Model FileImplements all the complexities ofBatching – for scalabilityCrawling – Full and IncrementalSecurity Trimming – Both Active Directory security and Custom Claims security
Hides all of that behind one single interface
What’s in the package?
Search Indexing ToolkitSIT Core Library
SITModel.xml
XML Files Indexing Connector
AdventureWorks Product DB
Indexing Connector
Implementing the ISearchConnector interface With a
detailed How-To Guide
SIT XML file connectorIndex Any XML FileThe connector can split items on a configurable xml element
FlexibleAll sub elements are submitted as crawled properties, no need to configure
High PerformanceTesting has shown 100 DPS even on a laptop
ScalableCrawl million of XML files
DemoIndexing Wikipedia Abstracts
Search Indexing Toolkit
SIT ISearchConnector interface
SIT CoreYour
Connector
ContentSource
GetAllItems[id1,id2,id3..]
GetSpecificItem(id1)
Initialize
[id1’s properties]
id1’s dataGetSpecificItemData(id1)
id1’s security descriptorGetSecurityDescriptorForSpecificItem
offsetcrawlTypechangeTokenchangeTokenUpdate
itemId, aclmeta,usesPluggableAuth
Content source supports NTLM?
Pass-through the security descriptor
Item level securityTag each document with an NTLM security descriptor
Otherwise…
Need to map to NTLM and create security descriptors
If no NTLM available, use Custom claims
Implement Custom claims provider or security trimmer
Crawling XML files generated from 3rd party sources.
SQL Server with security trimming
SQL Server with related BLOB on file share
Live Use cases
SIT reduces the complexity to create SharePoint Search connectors
Enhance the Search experience
SIT back and relax!
SIT Takeaways
Agenda
Architecture overviewContent ingestion
Content enrichmentAdvanced enrichment
Business Use Cases
Add DB or ERP meta-data into search results
Clean-up or reformat existing properties to facilitate search
Label documents that contain known patterns
Tag documents that violate corporate policy
Copy data from one managed property to another (including a type change)
What are your customers trying to do?What would your customers like to do?
Content Enrichment Web Service (CEWS)Web service hosted outside of SharePointReplaces SharePoint 2010 Pipeline Extensibility executableOptimized for performance (no need to read/write XML files, start a new process, etc)Input/output managed properties
CrawlerContent
Processing Index
Web Service
ProcessedItemProcessItem(Item)
CEWS Configuration
Endpoint URL of web service
Input properties Managed properties passed in
Output properties Managed properties that can be returned
Include raw data? Optionally include raw data (read only)
Debug mode Sends all input properties, ignores all output properties
Error mode Warning or Error. In Error mode, failing items are dropped
Trigger Test to determine if enrichment should be called (per document)
Register with Search Service Application via PowerShell
Average number of milliseconds spent on content enrichment
6 more things you need to know about CEWS1. Properties must exist when you register 2. Property names are case sensitive3. Cannot use property aliases4. Some standard properties can be
confusingDisplayAuthors vs Author
5. Some properties are read-only (body!)6. Single web service per Search Application
Agenda
Architecture overviewContent ingestionContent enrichment
Advanced enrichmentChallenges and techniques
Doing it in production: the challengesScale-outIncrease capacity to match farmLarge topology ≈ 144 flow instances
Fault toleranceSurvive hardware failures without loss of functionality
Service aggregationMultiple enrichment tasks to support disparate content sources
Doing it in production: techniquesWCF RoutingIntroduced in .NET 4.0100% declarative, configured in Web.config xmlApplies Xpath filters against request to determine destination endpoint Supports backup destination endpoints to achieve Fault Tolerance
Load BalancingHide multiple end points behind a load balancer to provide Scale and Fault Tolerance
“Localhost”Register web service on localhost and run instance on each content processing nodeScales with content processingProvides Fault Tolerance with that content processing node
http://aka.ms/Pqkjjj
Agenda
Architecture overviewContent ingestionContent enrichment
Advanced enrichmentCEWS Pipeline Toolkit
CEWS Pipeline ToolkitEnhance Search IndexDocument markupEntity extraction
ArchitectureWCFXML config
Hides the complexities ofScalabilityService aggregationConditional processing
Powerful framework for content enrichment
CEWS Pipeline Toolkit – What does it do?
Extract entitiesString matchingRegular ExpressionsDictionary-based
Normalize Manipulate strings
Access external repositories
Framework for document analysisSolves majority of customer business use casesPackaged with over 55 pipeline stagesConfigurable document routing
CEWS Pipeline Toolkit – What’s in the package?
Platform supportSharePoint 2013 Enterprise SearchFAST Search For SharePoint 2010Stand-alone
Easy to install, Easy to CustomizeVisual Studio 2012 & .NET 4.5 FrameworkInherit from AbstractDocumentProcessor class
Detailed documentation on TechNet Wiki – Help the community
CEWS Pipeline Toolkit architecture
CrawlerContent
Processing Index
Web Service
ProcessedItemProcessItem(Item)
CEWS Pipeline Toolkit
Pipeline configxml
Initialize
DemoWikipedia categoryTotal population
CEWS Pipeline Toolkit
Future – Community Effort
DataWikipediaFileshareDB – Adventure WorksWeb Services
DisplayCustom Search CenterSearch App
DeployDemoPOCDevQAProduction
CEWS and SIT – Join the community effort Canned prototypes for search POCsSeveral sample scenarios to leverage in your projectSimple to deploy and useProduction ready
How to get these toolsMCS ContactPremier ContactPublic Available DateGoing through the legal process. Will be made available publicly once approved.
Identified extensibility points in content acquisition
Saw how to customize the content processing pipeline via code callout.
Learned how to use SIT. Dove into advanced content enrichment
topics (CEWS Pipeline Toolkit)
In Review: Session Objectives
See you at the Search booth’s & Search tables at Asks the Experts WED @6:15!
Session Session Room Time
Develop Advanced Search-Driven SharePoint 2013 Apps SPC402 Palazzo I, J Tue 1:45pm
Best practices for Hybrid Search deployments SPC306 Veronese 2401 Tue 5:00pm
SharePoint 2013 Search Analytics SPC340 Palazzo M, N Wed 9:00am
How to manage and troubleshoot Search: A practical guide SPC375 Veronese 2401
Wed 10:45am
6 Proven Steps to Get the Best Out of Search in SharePoint 2013 SPC265 Delphino 4001 Wed 1:45pm
Best practices for Information Architecture and Enterprise Search SPC207 Veronese 2401 Wed 1:45pm
Search content enrichment and extensibility in SharePoint 2013 SCP414 Palazzo K, L Wed 1:45pm
Customizing Search experiences with Azure Hosted Data and Bing Maps SPC321 Veronese 2401 Wed 3:15pm
Futuristic Search applications using Kinect and Yammer! SPC405 Palazzo M, N Wed 3:15pm
Search architecture and sizing in SharePoint 2013 SPC336 Titian 2201 Wed 5:00pm
Effective Search deployment and operations in SharePoint 2013 SPC360 Veronese 2401 Thu 9:00am
SharePoint 2013 Search display templates and query rules SPC322 Palazzo M, N Thu 9:00am
Managing Search Relevance in SharePoint 2013 and O365 SPC382 Veronese 2401 Thu 12:00pm
Searc
h R
ela
ted S
ess
ion
s
MySPCSponsored by
connect. reimagine. transform.
Evaluate sessionson MySPC using yourlaptop or mobile device:myspc.sharepointconference.com
© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
top related