using the cloud to process unstructured big data by jason cornez

18
May 21, 2016 Using the Cloud to Process Unstructured Big Data J on the Beach, Malaga, Spain RavenPack: Mapping the World’s Big Data for Financial Applications

Upload: j-on-the-beach

Post on 19-Jan-2017

52 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Using the cloud to process unstructured big data by Jason Cornez

May 21, 2016

Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain

RavenPack: Mapping the World’sBig Data for Financial Applications

Jason Cornez ‒ [email protected]

Page 2: Using the cloud to process unstructured big data by Jason Cornez

2ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• RavenPack delivers big data analytics to financial professionals• Top hedge funds and investment banks use RavenPack

for trading and risk management• Patented, proprietary technology and award-winning research• Archive of more than 300 million documents, spanning past 20 years

RavenPack processes hundreds of thousands of documents each day.

We produce machine readable analytics for each document in real time.

Expected processing time for a typical document is less than 250ms.

RavenPack at a Glance

Page 3: Using the cloud to process unstructured big data by Jason Cornez

3ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Classification Overview• Realtime Classification: Classic vs Cloud• Historical Classification: Classic vs Cloud• New Challenges: Spot Instances and The Weather• New Opportunities

Contents

Page 4: Using the cloud to process unstructured big data by Jason Cornez

4ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Extract meaning from Unstructured Text

• Tokenization

• Entity Detection

• Attribute Tagging

• Event Detection

• Consolidation

A stream-based Classification Framework allow us to add new classifiers into a stream of documents. As much as possible, classifiers use separate threads to run in parallel.

Classification Overview

Page 5: Using the cloud to process unstructured big data by Jason Cornez

5ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Dictionary of nearly 400,000 entities

• Point-in-time aware

• Rules per entity type

• Extensive entity relationship modeling

• Supports metadata and other hints

• Equivalent terms and stop words

We support: company (Oracle Corp.), organization (European Union), geo-political place (Spain), currency (US Dollar), nationality (Spanish), people (Barack Obama), commodity (Crude Oil), position (CEO, President), team (Real Madrid), product (iPhone 6S), and more.

Entity Detection

Page 6: Using the cloud to process unstructured big data by Jason Cornez

6ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Example: People Detection• Many people share the same or similar names

• Many people hold various positions at employers across time

• People have one or more nationalities

• People are related to other people

Melanie Griffith files for divorce from Banderas

Mai And Banderas Star In The New The King Of Fighters XIV Trailer

After year out, Tim Cook joins competitive Oregon State running back battle

Apple CEO Tim Cook Attends iPad Pro 9.7 inch Launch at Palo Alto Store

Page 7: Using the cloud to process unstructured big data by Jason Cornez

7ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model• 6 Servers, 19 KVM virtual machines

• Limited Storage - Expensive to Upgrade

• Multiple Points of Failure

Use Case: Realtime Classification

RDBMS

CollectorsRT Feed

Snapshots

Classifier

Files

Page 8: Using the cloud to process unstructured big data by Jason Cornez

8ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS• CloudFormation to model the Stack

• Unlimited, Distributed Storage

• Easy redundancy, failover and backup

Use Case: Realtime Classification

Amazon EC2

AWSCloudFormation

AmazonDynamoDB

AmazonS3

AmazonRDS

Amazon CloudSearch

Amazon Redshift

Amazon Kinesis

RT Feed

Snapshots

ClassifiersCollectors

Gonzalo Bahut
Using Cloudformation we can replicate the Stack in the same or different geographical regions. High-availability and client-oriented performance
Page 9: Using the cloud to process unstructured big data by Jason Cornez

9ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Lose central RDBMS → Lose transactions

• S3 great for documents, but no index

• DynamoDB great for index, but...

Must manage throughput

No foreign keys or integrity constraints

Eventual consistency

• RedShift amazing for OLAP, but not OLTP

So use Kinesis to stream and then batch

• Schema-free is a myth

Applications are more flexible and scalable, but also more complex.

Cloud Migration Challenges

Gonzalo Bahut
reconsider "Eventual consistency". Dynamodb read consistency is configurable. We can make our reads consistent at a price of lower performance.
Gonzalo Bahut
I think the term "loosing" is too severe. Nothing prevent us from running a big oracle RDBMS in the cloud as well.
Page 10: Using the cloud to process unstructured big data by Jason Cornez

10ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model• Same Limited Set of Servers, Same RDBMS

• Can affect Realtime System, Backups

• Full archive, 4-6 Classifiers → 6 weeks!

Use Case: History Classification

RDBMS FilesClassifiers

Classifiers

Page 11: Using the cloud to process unstructured big data by Jason Cornez

11ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model using AWS• Servers on Demand, Distributed Storage

• Independent of Realtime System

• Full archive, 100 Classifiers → 3 days!

Use Case: History Classification

Amazon EC2

AWSCloudFormation

AmazonDynamoDB

AmazonS3

AmazonRDS

Amazon Redshift

Availability ZoneAvailability Zone

...

Classifiers

Coordinator

Page 12: Using the cloud to process unstructured big data by Jason Cornez

12ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Classic Model - Clear skies!

• Well-known resources

• Predictable workload

• Predictable behavior

• Stable Behavior

We have full control over the resources.

We expect a service to be started seldom

and to run for a long time without interruption.

The Weather

Page 13: Using the cloud to process unstructured big data by Jason Cornez

13ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Spot Instances

• Bid for unused capacity

• Save money, control costs

• Great for jobs with no specific deadline

• Possible to bid above on-demand rates

Typically pay 1/2 to 1/10 the “on-demand” rates.

We use spot instances for our historical

classification runs.

The Weather

Page 14: Using the cloud to process unstructured big data by Jason Cornez

14ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Warning! Uncertain Conditions

• Someone else’s resources

• Unpredictable behavior

• Easy to move the spot market

We have no control over the resources or who

else might be using them. We expect a server

can be killed with little notice.

The Weather

Page 15: Using the cloud to process unstructured big data by Jason Cornez

15ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Cloud Model - Warning! Uncertain Conditions

• Do work in multiple zones

• Optimize image startup

• Group work into well-defined chunks

• Use on-demand instances for co-ordination

Expect inclement weather and be prepared for it!

Dealing with Bad Weather

Availability Zone

Page 16: Using the cloud to process unstructured big data by Jason Cornez

16ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

Download a Custom “Slice” of Analytics Data• Provide a Web-API and Web Service• Let client specify parameters

Data Set and Time Range

Entities and Events

Filters• Leverage Amazon RedShift and S3• Compression and Multiple Output Formats

Opportunity: Self-Service Data

AmazonS3

Amazon Redshift

Amazon EC2

Amazon API Gateway

Page 17: Using the cloud to process unstructured big data by Jason Cornez

17ravenpack.com | [email protected] | AMERICAS Tel: (646) 277-7339 | EMEA-APAC Tel: +34 952 90 73 90

• Let Clients upload Proprietary Contentto a Private and Secure VPC

• Provision Computing and Storage Resourceson a Per Project Basis

• View Private Analytics in Isolation or AlongsideStandard RavenPack Analytic DataSets

• Everything Goes Away when Project Completes

Opportunity: The RavenPack Cloud

AmazonDynamoDB

AmazonRDS

AmazonS3

Amazon Redshift

Amazon EC2

AWSCloudFormation

Amazon CloudSearch

Page 18: Using the cloud to process unstructured big data by Jason Cornez

May 21, 2016

Using the Cloud to Process Unstructured Big DataJ on the Beach, Malaga, Spain

Thank you! Gracias!

Jason Cornez ‒ [email protected]