穆黎森:interactive batch query at scale

67
Interactive Batch Query At Scale Adhoc query system for game analytics based on Drill [email protected] 1

Upload: hdhappy001

Post on 11-May-2015

231 views

Category:

Technology


1 download

DESCRIPTION

BDTC 2013 Beijing China

TRANSCRIPT

Page 1: 穆黎森:Interactive batch query at scale

Interactive Batch Query At Scale

Adhoc query system for game analytics based on Drill

[email protected]

!1

Page 2: 穆黎森:Interactive batch query at scale

Related Topics

• Java Programming

• Relational Algebra

• Distributed Database

• Hadoop Ecosystem

!2

Page 3: 穆黎森:Interactive batch query at scale

About Us

• Elex-tech

• Game Development, Game Publishing

• SNS Games, Web Games, Mobile Games, Apps

• Global Market

!3

Page 4: 穆黎森:Interactive batch query at scale

• The Problem!

• Brief on Drill

• Design Considerations

• Enhancement from Xingcloud

• Now & Future

!4

Page 5: 穆黎森:Interactive batch query at scale

The Problem

!5

Page 6: 穆黎森:Interactive batch query at scale

The Problem• How many logins today?

• How many individual users this week?

• Total income today?

• Paid user amount this month?

• …

!6

Page 7: 穆黎森:Interactive batch query at scale

The Problem: Facts• How many X during time period of Y

!

!

!

• Fact Table

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!7

Page 8: 穆黎森:Interactive batch query at scale

The Problem: Facts• How many logins today?

• How many individual users this week?

• Total income today?

• Paid user amount this month?

• …

!8

Page 9: 穆黎森:Interactive batch query at scale

The Problem: Facts• How many logins today?

!

!

!

• select count(*) from fact where event=‘login’ and date(timestamp)=‘2013-12-06’;

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!9

Page 10: 穆黎森:Interactive batch query at scale

The Problem: Facts• How many individual users this week?

!

!

!

• select count(distinct uid) from fact where event=‘login’ and timestamp>=‘?’ and timestamp<‘?’;

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!10

Page 11: 穆黎森:Interactive batch query at scale

The Problem: Facts• Total income today?

!

!

!

• select sum(amount) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’;

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!11

Page 12: 穆黎森:Interactive batch query at scale

The Problem: Facts• Paid user amount this month?

!

!

!

• select count(distinct uid) from fact where event=‘pay’ and timestamp >=‘?’ and timestamp<‘?’;

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!12

Page 13: 穆黎森:Interactive batch query at scale

The Problem: Dimensions• How many logins today from China?

• How many individual users of each server this week?

• Total income today by new user?

• Paid user amount this month from Adwords?

• …

!13

Page 14: 穆黎森:Interactive batch query at scale

The Problem: Dimensions• The user X’s property Y is of value Z

!

!

!

• Dimension Table

user id reg_time language refer …user_001 20100612 en adwordsuser_002 20110927 cn facebookuser_003 20121010 fr admobuser_004 20130522 it tapjoy

!14

Page 15: 穆黎森:Interactive batch query at scale

Fact & Dimension• Aggregation on Join

user id reg_time language refer …user_001 20100612 en adwordsuser_002 20110927 cn facebookuser_003 20121010 fr admobuser_004 20130522 it tapjoy

user id event amount timestampuser_001 login - 1383729081user_002 login - 1383729082user_001 pay 4.99 1383729084user_003 login - 1383729090

!15

Page 16: 穆黎森:Interactive batch query at scale

Fact & Dimension• How many logins today from China?

• How many individual users of each server this week?

• Total income today by new user?

• Paid user amount this month from adwords?

• …

!16

Page 17: 穆黎森:Interactive batch query at scale

Fact & DimensionSELECT COUNT DISTINCT (on uid)

JOIN (1 fact, n dimension, on uid)

WHERE (filter by value of dimensions/facts)

GROUP BY (value of dimension)

!17

Page 18: 穆黎森:Interactive batch query at scale

Fact & Dimension• SQL

• -> Syntax tree

• -> Logical Plan

• -> Physical Plan

scan: Fact

scan: Dimension

filterfilter

Join

agg

scan: Dimension

filter

Join

Page 19: 穆黎森:Interactive batch query at scale

pre-aggregation?

!19

Page 20: 穆黎森:Interactive batch query at scale

!20

Page 21: 穆黎森:Interactive batch query at scale

Combinatorial Explosion!!21

Page 22: 穆黎森:Interactive batch query at scale

Access Pattern

Facts Dimensions

Write Append Insert, update

Read by date event

user id prop value full table

!22

Page 23: 穆黎森:Interactive batch query at scale

Volume

• 200GB new Facts

• 50GB Dimension updates

!23

Page 24: 穆黎森:Interactive batch query at scale

Storage

Architecture

Drill

MySQL HBase

MySQL StorageEngine

HBase StorageEngine

Data Loader

Query

!24

Page 25: 穆黎森:Interactive batch query at scale

• The Problem

• Brief on Drill!

• Design Considerations

• Our work

• Now & Future

!25

Page 26: 穆黎森:Interactive batch query at scale

http://www.slideshare.net/MapRTechnologies/technical-overview-of-apache-drill-by-jac!26

Page 27: 穆黎森:Interactive batch query at scale

http://www.slideshare.net/jasonfrantz/drill-architecture-20120913!27

Page 28: 穆黎森:Interactive batch query at scale

• The Problem

• Brief on Drill

• Design Considerations!

• Our work

• Now & Future

!28

Page 29: 穆黎森:Interactive batch query at scale

http://www.slideshare.net/jasonfrantz/drill-architecture-20120913!29

Page 30: 穆黎森:Interactive batch query at scale

Data Model{

name: "icecream",

price: {

basic: 4.99,

coupon: true

}

}

• Various types

• Nested values

• price.basic

• Schema-free

!30

Page 31: 穆黎森:Interactive batch query at scale

Design Considerations

• As Fast As possible

• Space efficient

• Time efficient

!31

Page 32: 穆黎森:Interactive batch query at scale

about Space Efficiency• Compact data representation

• Java object overhead: high

• JVM friendly(GC)

• Simpler object graph

• Less tenured space, less full GC

!32

Page 33: 穆黎森:Interactive batch query at scale

about Time Efficiency• Cache friendly

• data access Locality

• Superscalar: pipeline friendly

• the inner loop problem

• SIMD friendly

• opportunity to operate on a vector of values

• JVM friendly(JNI)

!33

Page 34: 穆黎森:Interactive batch query at scale

ValueVector & RecordBatch

ValueVector!34

Page 35: 穆黎森:Interactive batch query at scale

ValueVector & RecordBatch

• ValueVector

• small memory overhead

• backed by DirectByteBuffer

• further encoding

• continuous access/random access

!35

Page 36: 穆黎森:Interactive batch query at scale

{

name: "icecream",

price: {

basic: 4.99,

coupon: true

}

}

icecream…

4.99…

T…

name:VarChar

price.basic:floatprice.coupon:boolean

ValueVector & RecordBatch

RecordBatch

!36

Page 37: 穆黎森:Interactive batch query at scale

ValueVector & RecordBatch

scan: Fact

scan: Dimension

filter

filter Join agg

• Data passed in RecordBatch

• Inner loop: next() vs for

!37

Page 38: 穆黎森:Interactive batch query at scale

Review the Considerations• Cache friendly

• Superscalar: pipeline friendly

• SIMD friendly

• Compact data representation

• JVM friendly(GC)

• JVM friendly(JNI)!38

icecream…

4.99…

T…

name:VarCh

price.basic:floprice.coupon:boole

Page 39: 穆黎森:Interactive batch query at scale

• The Problem

• Brief on Drill

• Design Considerations

• Our work!

• Now & Future

!39

Page 40: 穆黎森:Interactive batch query at scale

Our work, primarily

• Adhoc batch query

!40

Page 41: 穆黎森:Interactive batch query at scale

Reports: 2-dimensional tables generally

!41

Page 42: 穆黎森:Interactive batch query at scale

Adhoc batch query

DailyActiveUser 2013-07-26 2013-07-27

en 576 491

cn 361 945

!42

Page 43: 穆黎森:Interactive batch query at scale

Adhoc batch queryuser id event timeuser_13 login 2013-07-26user_13 login 2013-07-26user_76 pay 2013-07-27

user id nationuser_13 cnuser_76 en

DAU 2013-07-26 2013-07-27

en 576 491

cn 361 945

Dimension

Fact

!43

Page 44: 穆黎森:Interactive batch query at scale

Adhoc batch queryDAU 2013-07-26 2013-07-27

en 576 491

cn 361 945

!44

Page 45: 穆黎森:Interactive batch query at scale

DAU 2013-07-26 2013-07-27

en 576 491

cn 361 945

Adhoc batch queryscan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-26’

nation=‘en’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-27’

nation=‘en’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-26’

nation=‘cn’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-27’

nation=‘cn’

!45

Page 46: 穆黎森:Interactive batch query at scale

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-26’

nation=‘en’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-27’

nation=‘en’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-26’

nation=‘cn’

scan: Fact

scan: Dimension

filter

filter Join

agg

date=‘2013-07-27’

nation=‘cn’

!46

Page 47: 穆黎森:Interactive batch query at scale

scan: Fact

scan: Dimension

filter

filter Join

date=‘2013-07-26’

nation=‘en’

filter

filter Join

date=‘2013-07-27’

nation=‘en’

filter

filter Join

date=‘2013-07-26’

nation=‘cn’

filter

filter Join

date=‘2013-07-27’

nation=‘cn’

agg

agg

agg

agg

!47

Page 48: 穆黎森:Interactive batch query at scale

Adhoc batch query• Benefits

• Reduce the same Scans

• Merge similar Scans

• Possibility

• SQL usually Parses into Tree, while

• LogicalPlan in Drill is DAG!48

Page 49: 穆黎森:Interactive batch query at scale

More Benefits: Middle result reuse

!49

Page 50: 穆黎森:Interactive batch query at scale

Adhoc batch queryscan: Fact

scan: Dimension

filter

filter Join

date=‘2013-07-26’

nation=‘en’

filter

filter Join

date=‘2013-07-27’

nation=‘en’

filter

filter Join

date=‘2013-07-26’

nation=‘cn’

filter

filter Join

date=‘2013-07-27’

nation=‘cn’

agg

agg

agg

agg

!50

Page 51: 穆黎森:Interactive batch query at scale

Adhoc batch queryscan: Fact

scan: Dimension

filter

Join

date=‘2013-07-26’

nation=‘en’

filter

Join

date=‘2013-07-27’

filter

Join

date=‘2013-07-26’

nation=‘cn’

filter

Join

date=‘2013-07-27’

agg

agg

agg

agg

Filter

Filter

!51

Page 52: 穆黎森:Interactive batch query at scale

Adhoc batch queryscan: Fact

scan: Dimension

Join

date=‘2013-07-26’

nation=‘en’Join

date=‘2013-07-27’

Join

nation=‘cn’Join

agg

agg

agg

agg

Filter

Filter

Filter

Filter

!52

Page 53: 穆黎森:Interactive batch query at scale

More Benefits: More Batched,

More Offline

!53

Page 54: 穆黎森:Interactive batch query at scale

Single Query

!54

Page 55: 穆黎森:Interactive batch query at scale

Batched 3 Queries

!55

Page 56: 穆黎森:Interactive batch query at scale

Batched Query, from a report

!56

Page 57: 穆黎森:Interactive batch query at scale

Batched Query, from tens of reports, with 1k+ operators

!57

Page 58: 穆黎森:Interactive batch query at scale

Jobs vs Predictions

• Offline job

• becomes predictions of what data user may be interested in

• by merging more query together

• daily predictions & hourly predictions

!58

Page 59: 穆黎森:Interactive batch query at scale

More Benefits: Utilising multi-core

!59

Page 60: 穆黎森:Interactive batch query at scale

Utilising Multi-core• Original:

• Pull data from root

• Downwards recursively

scan: Fact

scan: Dimension

filterfilter

Join

agg

date=‘2013-07-26’nation=‘en’

!60

Page 61: 穆黎森:Interactive batch query at scale

Utilising Multi-core• Now:

• Push data from Leaf

• Data driven upwards

• Pooled executionscan: Fact

scan: Dimension

filterfilter

Join

agg

date=‘2013-07-26’nation=‘en’

!61

Page 62: 穆黎森:Interactive batch query at scale

Adhoc batch query• Benefits

• Reduce the same Scans

• Merge similar Scans

• Merge intermediate operators

• Unified process for adhoc & batch process

• Multi-core process of single Plan!62

Page 63: 穆黎森:Interactive batch query at scale

• The Problem

• Brief on Drill

• Design Considerations

• Our work

• Now & Future

!63

Page 64: 穆黎森:Interactive batch query at scale

About Xingcloud• Now

• http://a.xingcloud.com

• 2 billion insert/update daily

• 200k+ aggregation data/day, 6k sec in total

• query response time: <1sec - 100 sec, 10 sec on avg.

• Future

• Plan Merge

• Unified process for batch, adhoc & stream process, SQL oriented

• SQL(t): Plan with time window

!64

Page 65: 穆黎森:Interactive batch query at scale

About Drill• Now

• Distributed Join

• on Parquet/ORCFile on HDFS

• Write interface of storage engines

• Future

• 1.0 M2: December 2013

• 1.0 GA: Early 2014

• more detail on https://issues.apache.org/jira/browse/DRILL

!65

Page 66: 穆黎森:Interactive batch query at scale

References• http://incubator.apache.org/drill/index.html#resources

• http://www.slideshare.net/jasonfrantz/drill-architecture-20120913

• http://prezi.com/j43vb1umlgqv/timothy-chen/

• http://www.cs.virginia.edu/kim/publicity/pldi09tutorials/memory-efficient-java-tutorial.pdf

• http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

!66

Page 67: 穆黎森:Interactive batch query at scale

Q & A

!67