aspire document processing
DESCRIPTION
Aspire Document Processing. Document Processing – “Aspire”. Very High Performance Structured Document Processing Architecture Dynamic configuration and deployment Based on Open Source Technologies Well Supported (wiki, javadoc) Administration interface built-in - PowerPoint PPT PresentationTRANSCRIPT
1
Aspire DocumentProcessing
1
2Document Processing – “Aspire”
• Very High Performance• Structured Document Processing Architecture• Dynamic configuration and deployment• Based on Open Source Technologies• Well Supported (wiki, javadoc)• Administration interface built-in• Vendor Neutral (CMS and search engine)
2
3Top-Level Overview
Aspire
Data Sources
Feeders
Document Processing Pipelines Indexing Index
4
Aspire
Common Resources
Components In Aspire (today)
Content Control DB
SubJob Extractors
Unload ARC Files
Unload CSV
Component Manager Pipeline Manager
MetadataManipulation
Text Extraction
Date Chooser
Split Multi-valued data
Host to Domain
Groovy Scripting
JDBC Connection
FeedersRSS
Hot Folder
Single Page
RDB
Enhancers
Get CCD Metadata
RDB Enhancer
OutputPush XML to REST
Error Job Handler
Debug Output
JMS
RDB Unloader
Feed One
Fetch URL
Category Tagger
Content Boost
5Functions Handled by Aspire
• Threading• Collection Deployment• Error handling and notification
• Including individual sub-job notifications• Collection Configuration• Component Scripting• Job Processing• Admin I/F, performance, live system status
6Benefits
• Much lower lifecycle cost• File processing no longer an ad-hoc
collection of java objects and methods• Encourages re-use of components• New collections with no programming
• Just re-configure existing components• Flexibility: deploy collections individually• Much better visibility into the file processing
internals, performance, and queuing
7Typical Installation Structure
Machine #1 Machine #2
CrawlerAspire
(other feeders and doc processing)
Search Engine
8
Aspire Architecture and Components
Details
9Top-Level Component Architecture
10Aspire and OSGi Components
AspireComponent
AspireComponent
Factory
OSGi Bundle
Java Jar File
Manufactured By
ISA
ISA
11The Contents of a Bundle/Component Factory
12Component and Factory Details
13
14
Aspire Sample Configurations
15Web Site Crawler / Search
16Processing CSV Files
17RSS Feeds, Single Pages
18
Aspire Deployment
19Deployment
• Architected to the latest deployment standards• Distribution Archetypes• Component Repositories
• Redeploy collections independently• In a live running system
• Redeploy and update components• In a live running system
• Ready for the cloud
19
20Deployment Structure
Aspire
Resources
CollectionConfigCollection
ConfigCollectionConfigCollection
ConfigCollectionConfigCollection
Config
Feeders & Pipelines
Administrator
load/reloadconfiguration
ConfigurationControl
re-useable components
ComponentRepository
21Deployment Implications
• Collections are configured independently• Collections use standard components• Can be dynamically and remotely deployed
Remote System
Aspire(always running)
CollectionConfig
load remoteconfigurations
remoteadmincontrol