using the cloud to process unstructured big data by jason cornez
TRANSCRIPT
May 21, 2016
Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain
RavenPack: Mapping the World’sBig Data for Financial Applications
Jason Cornez ‒ [email protected]
2ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• RavenPack delivers big data analytics to financial professionals• Top hedge funds and investment banks use RavenPack
for trading and risk management• Patented, proprietary technology and award-winning research• Archive of more than 300 million documents, spanning past 20 years
RavenPack processes hundreds of thousands of documents each day.
We produce machine readable analytics for each document in real time.
Expected processing time for a typical document is less than 250ms.
RavenPack at a Glance
3ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Classification Overview• Realtime Classification: Classic vs Cloud• Historical Classification: Classic vs Cloud• New Challenges: Spot Instances and The Weather• New Opportunities
Contents
4ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Extract meaning from Unstructured Text
• Tokenization
• Entity Detection
• Attribute Tagging
• Event Detection
• Consolidation
A stream-based Classification Framework allow us to add new classifiers into a stream of documents. As much as possible, classifiers use separate threads to run in parallel.
Classification Overview
5ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Dictionary of nearly 400,000 entities
• Point-in-time aware
• Rules per entity type
• Extensive entity relationship modeling
• Supports metadata and other hints
• Equivalent terms and stop words
We support: company (Oracle Corp.), organization (European Union), geo-political place (Spain), currency (US Dollar), nationality (Spanish), people (Barack Obama), commodity (Crude Oil), position (CEO, President), team (Real Madrid), product (iPhone 6S), and more.
Entity Detection
6ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Example: People Detection• Many people share the same or similar names
• Many people hold various positions at employers across time
• People have one or more nationalities
• People are related to other people
Melanie Griffith files for divorce from Banderas
Mai And Banderas Star In The New The King Of Fighters XIV Trailer
After year out, Tim Cook joins competitive Oregon State running back battle
Apple CEO Tim Cook Attends iPad Pro 9.7 inch Launch at Palo Alto Store
7ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Classic Model• 6 Servers, 19 KVM virtual machines
• Limited Storage - Expensive to Upgrade
• Multiple Points of Failure
Use Case: Realtime Classification
RDBMS
CollectorsRT Feed
Snapshots
Classifier
Files
8ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model using AWS• CloudFormation to model the Stack
• Unlimited, Distributed Storage
• Easy redundancy, failover and backup
Use Case: Realtime Classification
Amazon EC2
AWSCloudFormation
AmazonDynamoDB
AmazonS3
AmazonRDS
Amazon CloudSearch
Amazon Redshift
Amazon Kinesis
RT Feed
Snapshots
ClassifiersCollectors
9ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Lose central RDBMS → Lose transactions
• S3 great for documents, but no index
• DynamoDB great for index, but...
Must manage throughput
No foreign keys or integrity constraints
Eventual consistency
• RedShift amazing for OLAP, but not OLTP
So use Kinesis to stream and then batch
• Schema-free is a myth
Applications are more flexible and scalable, but also more complex.
Cloud Migration Challenges
10ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Classic Model• Same Limited Set of Servers, Same RDBMS
• Can affect Realtime System, Backups
• Full archive, 4-6 Classifiers → 6 weeks!
Use Case: History Classification
RDBMS FilesClassifiers
Classifiers
11ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model using AWS• Servers on Demand, Distributed Storage
• Independent of Realtime System
• Full archive, 100 Classifiers → 3 days!
Use Case: History Classification
Amazon EC2
AWSCloudFormation
AmazonDynamoDB
AmazonS3
AmazonRDS
Amazon Redshift
Availability ZoneAvailability Zone
...
Classifiers
Coordinator
12ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Classic Model - Clear skies!
• Well-known resources
• Predictable workload
• Predictable behavior
• Stable Behavior
We have full control over the resources.
We expect a service to be started seldom
and to run for a long time without interruption.
The Weather
13ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Spot Instances
• Bid for unused capacity
• Save money, control costs
• Great for jobs with no specific deadline
• Possible to bid above on-demand rates
Typically pay 1/2 to 1/10 the “on-demand” rates.
We use spot instances for our historical
classification runs.
The Weather
14ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Warning! Uncertain Conditions
• Someone else’s resources
• Unpredictable behavior
• Easy to move the spot market
We have no control over the resources or who
else might be using them. We expect a server
can be killed with little notice.
The Weather
15ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Cloud Model - Warning! Uncertain Conditions
• Do work in multiple zones
• Optimize image startup
• Group work into well-defined chunks
• Use on-demand instances for co-ordination
Expect inclement weather and be prepared for it!
Dealing with Bad Weather
Availability Zone
16ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
Download a Custom “Slice” of Analytics Data• Provide a Web-API and Web Service• Let client specify parameters
Data Set and Time Range
Entities and Events
Filters• Leverage Amazon RedShift and S3• Compression and Multiple Output Formats
Opportunity: Self-Service Data
AmazonS3
Amazon Redshift
Amazon EC2
Amazon API Gateway
17ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90
• Let Clients upload Proprietary Contentto a Private and Secure VPC
• Provision Computing and Storage Resourceson a Per Project Basis
• View Private Analytics in Isolation or AlongsideStandard RavenPack Analytic DataSets
• Everything Goes Away when Project Completes
Opportunity: The RavenPack Cloud
AmazonDynamoDB
AmazonRDS
AmazonS3
Amazon Redshift
Amazon EC2
AWSCloudFormation
Amazon CloudSearch
May 21, 2016
Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain
Thank you! Gracias!
Jason Cornez ‒ [email protected]