bizosys at fifth elephant

Post on 12-Jul-2015

935 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

©2013 BIZOSYS TECHNOLOGIES PRIVATE LIMITED

15 Billion computations in

187 milliseconds

with a Big Join in Hadoop

Business Drivers

1. Support 6 months of data as opposed to 2 days

2. Near real-time calculation with optimal infrastructure

The Use-case : Assessing Market Risk of an

Investment Portfolio

The Use-case : Assessing Market Risk of an

Investment Portfolio

Acc Equity Qty

A1 MSFT 100

A1 ORCL 500

A2 CISCO 400

Equity Model1 Model2

MSFT $78.00 $77.12

ORCL $33.78 $31.09

CISCO $32.12 $16.00

X

What is the total portfolio value for Model1?

Problem with The Big Join :

Acc Equity Qty

A1 MSFT 100

A1 ORCL 500

A2 CISCO 400

Equity Model1 Model2

MSFT 78$ 77.12$

ORCL 45.12$ 49.77$

CISCO 32.12$ 16.0$

X3M positions2M products * 5000

Models/Day

15 Billion Calculations

Schema Design…

Price Model DAY1 DAY N

Model1 Product 1 - PriceProduct 2 - Price….Product 2000000 - Price

… … …

Model 5000 … …

Date All Positions

XX-XXX-XXXX Acc Id 1 – ProductId 1 - 23 stocks…Acc Id 22000 – ProductId 200000 - 111 stocks

Why 1 price model is packed in 1 HBase Cell?

0

100

200

300

400

500

600

2M Products in 1 Cell 2M Products in 2M Cells

Eventual Consistency Overhead

GBs required : Product-Price model Data

Get rid of “HBase Cell meta-data” payload

Why Region Server is set at 16*64 MB?

1 Thread per Price Model64 Price Model/Machine

78 64core machines** @ 78 Region Servers

Enable Parallel Computing

**This is based on scalability factor of performance testing (150ms/ price model with parallel computing)

Why HBase Coprocessors are used?

Region 2Machine 1

Region 1Machine 1

HBaseCoprocessor

1 Cell = 1st Price Model =2 Million product prices =

8 * 2 = 16M

1 Cell = 2nd Price Model =2 Million product prices =

8 * 2 = 16M

Region 78Machine 78

1 Cell=5000th Price Model =2 Million product prices =

8 * 2 = 16M

Value @ Risk output For 1

Day

Reducer

Mapper

Mapper

Mapper

Map-Reduce does not Jam Network.

Fin

al o

utp

ut

of

mo

de

ls

Why is price-model-id stored as row-key?

Reading Sequentially (HBase Scanner) is lot faster than Random Row Read

Hadoop Distributed File System

Hadoop Map-Reduce Hadoop HBase

HSearch Indexer HSearch Coprocessor

MR Indexing Job with Lucene Analyzers

VAR RealTime MR Plug-In

HSearch Adapter

VAR Computation Application

Batch Mode Indexing Real-Time computation

The Final Building Blocks

Why We Like HBase

Why We Built HSearch

• Scalable• Real-Time• Apache Licensed

• Search and Analysis inside Hadoop• Real-time Map-Reduce• Extreme Parallelization

• Distribute index with auto-sharding and auto-replication - Handle Big Data

• Parallelize Indexing, Searching, Grouping – in milliseconds

• Binary serde, Compress, (May encrypt) at storage and transmission - Securely

• Cache everything – Serving thousand of users

• Redundize everything –With very limited support engineers.

• Index, Search and Analyze multi-structure big data in milliseconds.

• Search/Analyze as events unfold - For any additions or changes at sources.

• Plug-in custom algos/code with runtime data grouping and computing.

WHY

HOW

Available on

Apache Licensed

hadoopsearch.net

©2013 BIZOSYS TECHNOLOGIES PRIVATE LIMITED

For more information regarding Bizosys business, please write to sunil@bizosys.com

http://www.bizosys.com

top related