why is data independence (still) so important? optiq and apache drill

17
Why is data independence (still) so important? Julian Hyde @julianhyde http://github.com/julianhyde/optiq http://github.com/julianhyde/optiq- splunk Apache Drill Meeting 2012/9/13

Upload: julian-hyde

Post on 25-Jun-2015

2.409 views

Category:

Technology


3 download

DESCRIPTION

Presentation to the Apache Drill Meetup in Sunnyvale, CA on 2012/9/13. Framing the debate about Drill's goals in terms of a "typical" modern DBMS architecture; and also introducing the Optiq extensible query optimizer.

TRANSCRIPT

Page 1: Why is data independence (still) so important? Optiq and Apache Drill

Why is data independence(still) so important?

Julian Hyde @julianhyde

http://github.com/julianhyde/optiqhttp://github.com/julianhyde/optiq-splunk

Apache Drill Meeting2012/9/13

Page 2: Why is data independence (still) so important? Optiq and Apache Drill

Data independenceThis is my opinion about data management systems in general. I don't

claim that it is the right answer for Apache Drill.

I claim that a logical/physical separation can make a data management system more widely applicable, therefore more widely adopted, therefore better.

What “data independence” means in today's “big data” world.

Page 3: Why is data independence (still) so important? Optiq and Apache Drill

About meJulian Hyde

Database hacker (Oracle, Broadbase, SQLstream, LucidDB)

Open source hacker (Mondrian, olap4j, LucidDB, Optiq)

@julianhyde

http://github.com/julianhyde

Page 4: Why is data independence (still) so important? Optiq and Apache Drill

http://www.flickr.com/photos/torkildr/3462606643

Page 5: Why is data independence (still) so important? Optiq and Apache Drill

http://www.flickr.com/photos/sylvar/31436961/

Page 6: Why is data independence (still) so important? Optiq and Apache Drill

“Big Data”Right data, right time

Diverse data sources / Performance / Suitable format

Volume / Velocity / Variety

Volume – solved :)

Velocity – not one of Drill's goals (?)

Variety – ?

Page 7: Why is data independence (still) so important? Optiq and Apache Drill

Variety

Variety of source formats (csv, avro, json, weblogs)

Variety of storage structures (indexes, projections, sort order, materialized views) now or in future

Variety of query languages (DrQL, SQL)

Combine with other data (join, union)

Embed within other systems, e.g. Hive

Source for other systems, e.g. Drill | Cascading > Teradata

Tools generate SQL

Page 8: Why is data independence (still) so important? Optiq and Apache Drill

Use case: Optiq* at Splunk

SQL interface on NoSQL system

“Smart” JDBC driver – pushes processing down to Splunk

* Truth in advertising: I am the author of Optiq.

Page 9: Why is data independence (still) so important? Optiq and Apache Drill

MySQL

Splunk

Expression tree SELECT p.“product_name”, COUNT(*) AS cFROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id”WHERE s.“action” = 'purchase'GROUP BY p.”product_name”ORDER BY c DESC

join

Key: product_id

group

Key: product_nameAgg: count

filter

Condition:action =

'purchase'

sort

Key: c DESC

scan

scan

Table: splunk

Table: products

Page 10: Why is data independence (still) so important? Optiq and Apache Drill

Splunk

Expression tree(optimized)

SELECT p.“product_name”, COUNT(*) AS cFROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id”WHERE s.“action” = 'purchase'GROUP BY p.”product_name”ORDER BY c DESC

join

Key: product_id

group

Key: product_nameAgg: count

filter

Condition:action =

'purchase'

sort

Key: c DESC

scan

Table: splunk

MySQL

scan

Table: products

Page 11: Why is data independence (still) so important? Optiq and Apache Drill

Conventional DBMS architecture

JDBC server

SQL parser /validatorQuery

optimizer

Metadata

DataData

Data-flowoperators

JDBC client

Page 12: Why is data independence (still) so important? Optiq and Apache Drill

Drill architecture

DrQL parser /validator

Metadata

DataData

Data-flowoperators

DrQL client

?

Page 13: Why is data independence (still) so important? Optiq and Apache Drill

Optiq architecture

JDBC server

SQL parser /validatorQuery

optimizer

3rd partydata

3rd partydata

JDBC client

3rd

partyops

3rd

partyops

Optional

Pluggable

Core

MetadataSPI

Pluggablerules

Page 14: Why is data independence (still) so important? Optiq and Apache Drill

Analogy: Compiler architecture

C++ C Fortran

x86 ARM Fortran

front end

middle end

back end

Optimizations

Page 15: Why is data independence (still) so important? Optiq and Apache Drill

Conclusions

Clear logical / physical separation allows a data management system to handle a wider variety of data, query languages, and packaging.

Also provides a clear interface between the sub-teams working on query language and operators.

A query optimizer allows new operators, and alternative algorithms and data structures, to be easily added to the system.

Page 16: Why is data independence (still) so important? Optiq and Apache Drill

Extra material follows...

Page 17: Why is data independence (still) so important? Optiq and Apache Drill

Writing an adapterDriver – if you want a vanity URL like “jdbc:drill:”

Schema – describes what tables exist

Table – what are the columns, and how to get the data.

Operators (optional) – non-relational operators, if any

Rules (optional, but recommended) – improve efficiency by changing the question

Parser (optional) – additional source languages