soft-shake 2013 : enabling realtime queries to end users

24
Enabling Real- time Queries to End Users Benoit Perroud SoftShake, Geneva, October 24, 2013

Upload: benoit-perroud

Post on 06-May-2015

1.141 views

Category:

Technology


0 download

DESCRIPTION

Since it became an Apache Top Level Project in early 2008, Hadoop has established itself as the de-facto industry standard for batch processing. The two layers composing its core, HDFS and MapReduce, are strong building blocks for data processing. Running data analysis and crunching petabytes of data is no longer fiction. But the MapReduce framework does have two major drawbacks: query latency and data freshness. At the same time, businesses have started to exchange more and more data through REST API, leveraging HTTP words (GET, POST, PUT, DELETE) and URI (for instance http://company/api/v2/domain/identifier), pushing the need to read data in a random access style – from simple key/value to complex queries. Enhancing the BigData stack with real time search capabilities is the next natural step for the Hadoop ecosystem, because the MapReduce framework was not designed with synchronous processing in mind. There is a lot of traction today in this area and this talk will try to answer the question of how to fill in this gap with specific open-source components, ultimately building a dedicated platform that will enable real-time queries on Internet-scale data sets. After discussing the evolution of the deployments of common Hadoop platform, a hybrid approach called lambda architecture will be proposed. It will be demonstrated with concrete examples, discussing which technology could be a good match, and how they would interact together.

TRANSCRIPT

Page 1: Soft-Shake 2013 : Enabling Realtime Queries to End Users

Enabling Real-time Queries to End Users

Benoit Perroud

SoftShake, Geneva, October 24, 2013

Page 2: Soft-Shake 2013 : Enabling Realtime Queries to End Users

2 Verisign Public

About Me

• Benoit Perroud• Software Engineer @ Verisign• Leading Hadoop Infrastructure Team• Apache Committer• @killerwhile

Page 3: Soft-Shake 2013 : Enabling Realtime Queries to End Users

3 Verisign Public

Agenda

• What’s going on• Data lifecycle• Batch and Realtime• Hadoop Deployments• Next Steps

Page 4: Soft-Shake 2013 : Enabling Realtime Queries to End Users

4 Verisign Public

What’s going on

• Mainframes are obsolete, replaced by commodity hardware’s cluster

• TenG (10Gb/s) links are the new standard• RESTful APIs are everywhere• Everybody wants to visit Paxos Island• Firehoses do not only carry water• Asynchronous non-blocking functional programming is taught

at primary school• NoSQL is the new way to store data at scale• API management startups are rising (and raising)• Hadoop keywords boost your LinkedIn profile by 2000%• Public clouds are responsible for more than 50% of the global

Internet traffic• … and counting …

Page 5: Soft-Shake 2013 : Enabling Realtime Queries to End Users

5 Verisign Public

Source: http://dev.datasift.com/blog/high-scalabilityNote: the diagram is stamped from 2009, it is probablypartially or even completely outdated today

A Possible Deployment

Page 6: Soft-Shake 2013 : Enabling Realtime Queries to End Users

6 Verisign Public

Data Lifecycle

Page 7: Soft-Shake 2013 : Enabling Realtime Queries to End Users

7 Verisign Public

Data Lifecycle

Data Ingestion

Data Storage

Data Processing

Data Retrieval

Producers Consumers

Page 8: Soft-Shake 2013 : Enabling Realtime Queries to End Users

8 Verisign Public

• Copying internal and external sources of data into the cluster

• Pre-processing: data cleanup, proper format, …• Time vs. block-size tradeoff

• Targeted property: Availability

Source of Data

Ingesting the flow

Local buffering

HDFSUploading to HDFS

Page 9: Soft-Shake 2013 : Enabling Realtime Queries to End Users

9 Verisign Public

• Hadoop HDFS is a well established distributed file system

• File system is the central component of every data-driven approach

• Space vs. network tradeoff

• Targeted property: Reliability

File1

Upload to HDFS

DataNode1 DataNode2

DataNode3 DataNode4

Page 10: Soft-Shake 2013 : Enabling Realtime Queries to End Users

10 Verisign Public

• Hadoop MapReduce• Higher level tools (Hive, Pig, Impala) help• Data catalog needs to be maintained

Targeted property: parallelism

Page 11: Soft-Shake 2013 : Enabling Realtime Queries to End Users

11 Verisign Public

• Only way to make use of the data• Business driven need• At scale, data needs to be stored as they are queried.• DPI: Data Programmable Interfaces

Targeted property: user friendliness, reliability

Page 12: Soft-Shake 2013 : Enabling Realtime Queries to End Users

12 Verisign Public

Batch and Realtime

Page 13: Soft-Shake 2013 : Enabling Realtime Queries to End Users

13 Verisign Public

Batch Processing

Batch 1

Batch 1 ready to be served

Time

Batch 1 startsprocessing

t1 t2

Batch 2

Batch 2 ready to be served

Batch 2 startsprocessing

t3 t4

Query data from batch 1 Query data from batch 2

Batch 3

Batch 3 startsprocessing

t5

Data gap Data gap

Page 14: Soft-Shake 2013 : Enabling Realtime Queries to End Users

14 Verisign Public

Batch Processing in details

Batch with data from yesterday

Time

New batch granularityperiod

Let some timefor data to finishupload

Load resultsin a data store

Notify the retrieval systema new batch is readyto be served

Processing time

Query data from the day before yesterday?

Page 15: Soft-Shake 2013 : Enabling Realtime Queries to End Users

15 Verisign Public

Realtime Query

• Interactive query• REST like request/response queries• With SLA

And

• Query the latest version of the data• Latest means n seconds ago with n predictible

Page 16: Soft-Shake 2013 : Enabling Realtime Queries to End Users

16 Verisign Public

Hadoop Deployments

Page 17: Soft-Shake 2013 : Enabling Realtime Queries to End Users

17 Verisign Public

Naïve Hadoop Deployment

Gateway

NameNodehdfs dfs -put

mapred job …jar

hdfs dfs -get

JobTracker

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

Processing

Page 18: Soft-Shake 2013 : Enabling Realtime Queries to End Users

18 Verisign Public

Industry Hadoop Deployment

Data In GW

Data Out GWMetadata StoreMonitoring

Gateway

NameNode JobTracker

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNode

Processing

NameNode JobTracker NameNode JobTracker

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

NameNode

Research,Data Science

Page 19: Soft-Shake 2013 : Enabling Realtime Queries to End Users

19 Verisign Public

Realtime Hadoop Deployment

Data In GW

Gateway

NameNode JobTracker

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

Processing

NameNode JobTracker

RT Data Out GW

RT processing

Page 20: Soft-Shake 2013 : Enabling Realtime Queries to End Users

20 Verisign Public

Hybrid Approach

Batch 1

Batch 1 ready to be served

Time

Batch 1 startsprocessing

t1 t2

Batch 2

Batch 2 ready to be served

Batch 2 startsprocessing

t3 t4

Complementary data for batch 1

Complementary data for batch 2

Page 21: Soft-Shake 2013 : Enabling Realtime Queries to End Users

21 Verisign Public

Realtime Search with Hadoop

Data In GW

Gateway

NameNode JobTracker

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

DataNodeDataNode

Generate Indexes

NameNode JobTracker

RT Data Out GW

Update indexes

Coordinator

Page 22: Soft-Shake 2013 : Enabling Realtime Queries to End Users

22 Verisign Public

Next Steps

Page 23: Soft-Shake 2013 : Enabling Realtime Queries to End Users

23 Verisign Public

Hadoop Ecosystem

… is moving … really fast

• Interactive Queries: Cloudera Impala, Apache Drills, Tez, …

• Search: SolrCloud, ElasticSearch, Cloudera Search• Hybrid layer: Twitter SummingBird

• … and counting…

Page 24: Soft-Shake 2013 : Enabling Realtime Queries to End Users

Thanks for the attention!

Follow @[email protected]

“Copyright © 2013 VeriSign, Inc.  All rights reserved.  The VERISIGN word mark, the Verisign logo, and other Verisign trademarks, service marks, and designs that may appear herein are registered or unregistered trademarks or service marks of VeriSign, Inc., and its subsidiaries in the United States and foreign countries.  All other trademarks, service marks, and designs are property of their respective owners.  Verisign has made efforts to ensure the accuracy and completeness of the information in this document.  However, Verisign makes no warranties of any kind (whether express, implied or statutory) with respect to the information contained herein. Verisign assumes no liability to any party for any loss or damage (whether direct or indirect) caused by any errors, omissions, or statements of any kind contained in this document.  Further, Verisign assumes no liability arising from the application or use of the products, services, or materials described or referenced herein and specifically disclaims any representation that any such products, services, or materials do not infringe upon any existing or future intellectual property rights.”