drill bay area hug 2012-09-19

22
Apache Drill Interactive Analysis of Large-Scale Datasets Jason Frantz Architect, MapR

Upload: jasonfrantz

Post on 10-May-2015

2.477 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Drill Bay Area HUG 2012-09-19

Apache DrillInteractive Analysis of Large-Scale Datasets

Jason Frantz Architect, MapR

Page 2: Drill Bay Area HUG 2012-09-19

My Background

• Caltech• Clustrix• MapR• Founding member of Apache Drill

Page 3: Drill Bay Area HUG 2012-09-19

MapR Technologies• The open enterprise-grade distribution for Hadoop

– Easy, dependable and fast– Open source with standards-based extensions

• MapR is deployed at 1000’s of companies– From small Internet startups to the world’s largest enterprises

• MapR customers analyze massive amounts of data:– Hundreds of billions of events daily– 90% of the world’s Internet population monthly– $1 trillion in retail purchases annually

• MapR has partnered with Google to provide Hadoop on Google Compute Engine

Page 4: Drill Bay Area HUG 2012-09-19

Latency Matters

• Ad-hoc analysis with interactive tools

• Real-time dashboards

• Event/trend detection and analysis– Network intrusions– Fraud– Failures

Page 5: Drill Bay Area HUG 2012-09-19

Big Data ProcessingBatch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Introducing Apache Drill…

Page 6: Drill Bay Area HUG 2012-09-19

GOOGLE DREMEL

Page 7: Drill Bay Area HUG 2012-09-19

Google Dremel• Interactive analysis of large-scale datasets

– Trillion records at interactive speeds– Complementary to MapReduce– Used by thousands of Google employees– Paper published at VLDB 2010

• Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis

• Model– Nested data model with schema

• Most data at Google is stored/transferred in Protocol Buffers• Normalization (to relational) is prohibitive

– SQL-like query language with nested data support

• Implementation– Column-based storage and processing– In-situ data access (GFS and Bigtable)– Tree architecture as in Web search (and databases)

Page 8: Drill Bay Area HUG 2012-09-19

Google BigQuery• Hosted Dremel (Dremel as a Service)• CLI (bq) and Web UI• Import data from Google Cloud Storage or local files

– Files must be in CSV format • Nested data not supported [yet] except built-in datasets

– Schema definition required

Page 9: Drill Bay Area HUG 2012-09-19

APACHE DRILL

Page 10: Drill Bay Area HUG 2012-09-19

Architecture

• Only the execution engine knows the physical attributes of the cluster– # nodes, hardware, file locations, …

• Public interfaces enable extensibility– Developers can build parsers for new query languages– Developers can provide an execution plan directly

• Each level of the plan has a human readable representation– Facilitates debugging and unit testing

Page 11: Drill Bay Area HUG 2012-09-19

Architecture (2)

Page 12: Drill Bay Area HUG 2012-09-19

Execution Engine Layers• Drill execution engine has two layers

– Operator layer is serialization-aware• Processes individual records

– Execution layer is not serialization-aware• Processes batches of records (blobs)• Responsible for communication, dependencies and fault tolerance

Page 13: Drill Bay Area HUG 2012-09-19

Data Flow

Page 14: Drill Bay Area HUG 2012-09-19

Nested Query Languages

• DrQL– SQL-like query language for nested data– Compatible with Google BigQuery/Dremel

• BigQuery applications should work with Drill

– Designed to support efficient column-based processing• No record assembly during query processing

• Mongo Query Language– {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}

• Other languages/programming models can plug in

Page 15: Drill Bay Area HUG 2012-09-19

Nested Data Model• The data model in Dremel is Protocol Buffers

– Nested– Schema

• Apache Drill is designed to support multiple data models– Schema: Protocol Buffers, Apache Avro, …– Schema-less: JSON, BSON, …

• Flat records are supported as a special case of nested data– CSV, TSV, …

{ "name": "Srivas", "gender": "Male", "followers": 100}{ "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305"}

enum Gender { MALE, FEMALE}

record User { string name; Gender gender; long followers;}

Avro IDL JSON

Page 16: Drill Bay Area HUG 2012-09-19

DrQL Example

SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS StrFROM tWHERE REGEXP(Name.Url, '^http') AND DocId < 20;

* Example from the Dremel paper

Page 17: Drill Bay Area HUG 2012-09-19

Query Components• Query components:

– SELECT– FROM– WHERE– GROUP BY– HAVING– (JOIN)

• Key logical operators:– Scan– Filter– Aggregate– (Join)

Page 18: Drill Bay Area HUG 2012-09-19

Extensibility• Nested query languages

– Pluggable model– DrQL– Mongo Query Language– Cascading

• Distributed execution engine– Extensible model (eg, Dryad)– Low-latency– Fault tolerant

• Nested data formats– Pluggable model– Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV)– Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)

• Scalable data sources– Pluggable model– Hadoop– HBase

Page 19: Drill Bay Area HUG 2012-09-19

Scan Operators

Scan with schema Scan without schema

Operator output

Protocol Buffers JSON-like (MessagePack)

Supported data formats

ColumnIO (column-based protobuf/Dremel)RecordIO (row-based protobuf)CSV

JSONHBase

SELECT … FROM …

ColumnIO(proto URI, data URI)RecordIO(proto URI, data URI)

Json(data URI)HBase(table name)

• Drill supports multiple data formats by having per-format scan operators• Queries involving multiple data formats/sources are supported

• Fields and predicates can be pushed down into the scan operator

• Scan operators may have adaptive side-effects (database cracking)• Produce ColumnIO from RecordIO• Google PowerDrill stores materialized expressions with the data

Page 20: Drill Bay Area HUG 2012-09-19

Design PrinciplesFlexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 21: Drill Bay Area HUG 2012-09-19

Hadoop Integration

• Hadoop data sources– Hadoop FileSystem API (HDFS/MapR-FS)– HBase

• Hadoop data formats– Apache Avro– RCFile

• MapReduce-based tools to create column-based formats• Table registry in HCatalog• Run long-running services in YARN

Page 22: Drill Bay Area HUG 2012-09-19

Get Involved!

• Download these slides– http://www.mapr.com/company/events/bay-area-hug/9-19-20

12

• Join the mailing list– [email protected]

• Join MapR– [email protected]