fifth elephant-grill

Post on 24-Jun-2015

339 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

GrillUnified analytics platform

Amareshwari Sriramadasu

Amareshwari

Sriramadasu

•Architect in Data platform team at Inmobi

•Working in Hadoop and eco systems since 2007

•Apache Hadoop – PMC

•Apache Hive– Committer

•Worked with Yahoo! earlier

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Digital advertising at Inmobi

Courtesy: http://www.liesdamnedlies.com/

Owns & Sells Real estate on digital inventory

Has reach to users

Wants to target Users

Brings money

Market place

Consumer

Analytics Use cases

• Understanding Trends & Inference

• Forecasting and Anamoly detection

Data scientists

• Feedback to improve Ad Relevance in Real Time

Engineering systems

• Troubleshooting of issues

Developers

• Publisher/Advertiser specific analytics(dashboards)

Advertisers and publishers

• Tracking specific accounts

Account managers

• Inventory sizing and estimation

Business/Product analysts

• Canned/Dashboard queries

• Adhoc queries

• Interactive/Batch queries

• Scheduled queries

• Infer insights through ML algorithms

Analytics use cases

Analytics Warehouses at Inmobi and Scale

• Billions of Ad Requests/Impressions per day

• 170 TB Hadoop Warehouse• 5 TB SQL Columnar Datawarehouse• 70 TB Hbase cluster

Why both Hadoop and SQL warehouse?

Canned

Adhoc

Response Times

IO (Input Records)

Adhoc

Canned Adhoc

Query Engine

Query

Dashboard queries are mostly canned and Interactive

Adhoc queries can be Interactive or batch depending on the data volumes and query complexity

•Disparate user experience

•Disparate data storage and execution engines

•Schema management across storages

•Not leveraging ‘SQL on Hadoop’ community

Problems

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

GRILL

Analytics As Service

Unify the Catalog and Query layer for Adhoc/CannedBatch/Interactive

Reports on single Interface

Grill Architecture

Data Layout – Fact data

Aggr Factk : measures (mak <= ma(k-1)), dim-cuts

(dak < da(k-1))

…..Aggr Fact2 : measures (ma2 <=

ma1), dim-cuts(da2 < da1)

Aggr Fact1: measures (ma1 <= mr), dim-cuts(da1 < dr)

Raw fact(mr), dim-cuts(dr)

TimeDim-cuts, Measures

Cost

Data Layout – Dimension data

Subsetm (am < am-1)

Subset2 (a2 < a1)

Subset1 (a1 < ax)

All attributes (ax)

Cost

Data Layout – Snowflake

Aggr Factk : measures (mak <= ma(k-1)), dims (dak

< da(k-1))

…..

Aggr Fact2 : measures (ma2 <= ma1), dimensions (da2 < da1)

Aggr Fact1 : measures (ma1 <= mr), dimensions (da1 < dr)

Raw Fact: measures (mr), dimensions(dr)

Dim2_1

Dim3

Dim2

Dim4_1

Dim4

Dim1

Associates structure to dataProvides Metastore and catalog service – HcatalogProvides pluggable storage interfaceAccepts SQL like queriesHQL is widely adopted language by systems like Shark, ImpalaHas strong apache community

Data warehouse features like cubes, facts, dimensionsLogical table associated with multiple physical storagesPluggable execution engineQuery lifecycle managementQuery quota managementScheduling queries

Wh

at

does

Hiv

e

pro

vid

e

Wh

at is m

issing in

H

ive

Apache Hive to the rescue

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Data Model

Cube

Fact Table• Physical Fact

tables

Dimension

Dimension Table• Physical

Dimension tables

Storage

Data Model - Cube

Expressions• Any expression with

reachable fields

Dim-attributes• Simple dim-attribute• Referenced dim-attribute• Hierarchical dim-

attribute• Timed dim-attribute

Measure• Column Measure

Cube

MeasuresDim-

attributesExpression

s

Note : Some of the concepts are borrowed from http://community.pentaho.com/projects/mondrian/

Data Model - Dimension

Expressions• Any expression with

reachable fields

Attributes• Simple dim-attribute• Referenced dim-

attribute• Hierarchical dim-

attribute

Dimension

Attributes Expressions

Data Model - Relationships

Cube

Dimension

Dimension

Fact table

Cube

Fact table

Storage

Dimension Table

Dimension

Dimension table

Storage

Data Model

Aggrk Fact Table

…..

Aggr2 Fact Table

Aggr1 Fact Table

Raw FactTable

Dim2_table1

Dim2_table2

Dim3_table1

Dim2_table2

Dimtable1

Dimtable2

CUBE

Dimension

Dimension

Dimension

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

CUBE SELECT [DISTINCT] select_expr, select_expr, ...FROM cube_table_referenceWHERE [where_condition AND] TIME_RANGE_IN(colName , from, to)[GROUP BY col_list][HAVING having_expr][ORDER BY colList][LIMIT number]

Queries on OLAP cubes

• SELECT ( city.name ), ( city.stateid ) FROM c2_citytable city LIMIT 100 • SELECT ( city.name ), ( city.stateid ) FROM c1_citytable city WHERE

(city.dt = 'latest') LIMIT 100

cube select name, stateid from city limit 100

Example query

Example query

• SELECT (citytable.name), sum((testcube.msr2)) FROM c2_testfact testcube INNER JOIN c1_citytable city ON ((testcube.cityid)= (city.id)) WHERE (( testcube.dt='2014-03-10-03') OR (testcube.dt='2014-03-10-04') OR (testcube.dt='2014-03-10-05') OR (testcube.dt='2014-03-10-06') OR (testcube.dt='2014-03-10-07') OR (testcube.dt='2014-03-10-08') OR (testcube.dt='2014-03-10-09') OR (testcube.dt='2014-03-10-10') OR (testcube.dt='2014-03-10-11') OR (testcube.dt='2014-03-10-12') OR (testcube.dt='2014-03-10-13') OR (testcube.dt='2014-03-10-14') OR (testcube.dt='2014-03-10-15') OR (testcube.dt='2014-03-10-16') OR (testcube.dt='2014-03-10-17') OR (testcube.dt='2014-03-10-18') OR (testcube.dt='2014-03-10-19') OR (testcube.dt='2014-03-10-20') OR (testcube.dt='2014-03-10-21') OR (testcube.dt='2014-03-10-22') OR (testcube.dt='2014-03-10-23') OR (testcube.dt='2014-03-11') OR (testcube.dt='2014-03-12-00') OR (testcube.dt='2014-03-12 -01') OR (testcube.dt='2014-03-12-02') )AND (city.dt = 'latest')

• GROUP BY(city.name)

cube select city.name, msr2 from testcube where timerange_in(dt, '2014-03-10-03’, '2014-03-12-03’)

Analytics at Inmobi - Problem areas and Motivation

Introduction to Grill

OLAP Model

Query examples

Grill design

Agenda

Implements an interface• explain• execute• executeAsynchronously• fetchResults• Specify all storages it can support

Pluggable execution engine

OLAP Cube QL query

Rewrite query for available execution engine’s supported storages

Get cost of the rewritten query from each execution engine

Pick up execution engine with least cost and fire the query

Cube query with multiple execution engines

Grill Server

Grill – current state

Server

Query ServiceMetastore ServiceMetricsQuery statistics(In progress)Scheduled queries(In progress)Query caching(In progress)

Client

Java ClientCLIJDBC Client

Execution EngineHive DriverJDBC Driver

• Normalize query cost• Load balancing across execution engines• Alter meta hooks in StorageHandler• Authentication and authorization• Machine learning through Grill• Query quota management

Grill roadmap

Github source for grill• https://github.com/InMobi/grill

Github source for Hive• https://github.com/InMobi/hive

Documentation• http://inmobi.github.io/grill

Mailing lists• grill-users@googlegroups.com• grill-dev@googlegroups.com

References

Thank you!• amareshwari@apache.

org

Backup

Data Model – Storage

Sto

rag

e • Name• End point• Properties• Ex : ProdCluster, StagingCluster,

Postgres1, HBase1, HBase2

Data Model – Fact Table

Fact table

Cube

Fact table

Storage

Fact

Ta

ble• Columns

• Cube that it belongs

• Storages on which it is present and the associated update periods

Data Model – Dimension tableD

imen

sion

Ta

ble • Columns

• Dimension to which it belongs

• Storages on which it is present and associated snapshot dump period, if any.

Dimension Table

Dimension

Dimension table

Storage

Data Model – Storage tables and partitionsS

tora

ge

tab

le• Belongs to fact/dimension• Associated storage

descriptor• Partitioned by columns• Naming convention –

storage name followed by fact/dimension name

• Partition can override its storage descriptor

• Fact storage table

Fact table

• Dimension storage table

Dimension table

Resolve candidate tables and storages

Automatically resolve joins, aggregations

Allows SQL over Cube QL

Queries can span multiple storages

Accepts multiple time ranges in query

All Hive QL features

Query features

Adhoc querying system Internal Dashboards

Customer facing Dashboards and Reporting

Analytics systems at Inmobi

top related