impala: a modern, open-source sql engine for hadoop
TRANSCRIPT
…
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHDFS NN
Statestore&
Catalog
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
HiveMetastore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBC
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
HDFS NNStatestore
&Catalog
Planner turns request into collections of plan fragmentsCoordinator initiates execution on remotes nodes
HiveMetastore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NNStatestore
&Catalog
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
query results
Intermediate results are streamed between nodes
Operation permitted, query results are streamed back to client
void MaterializeTuple(char* tuple) {for (int i = 0; i < num_slots_; ++i) {
char* slot = tuple + offsets_[i];switch (types_[i]) {
case BOOLEAN:*slot = ParseBoolean();break;
case INT:*slot = ParseInt();
case FLOAT: …case STRING: …// etc.
}}
}
void MaterializeTuple(char* tuple) {// i = 0*(tuple + 0) = ParseInt();// i = 1*(tuple + 4) = ParseBoolean();// i = 2*(tuple + 5) = ParseInt();
}
Hot code path, called per row
QueryFragment
QueryFragment
QueryFragment
IO Manager
Disk Disk Disk
Impala Daemon
Disk Disk
Thread0
Thread1
Thread2
Thread3
Thread4
container format for all popular serialization formats: Avro, Thrift, Protocol Buffers
From Twitter’s “Dremel Made Simple” blog
The most efficient IO, is one that never happens at all
OVER PARTITION, RANK, LEAD, LAG, NTILE, ..
•VARCHAR, CHAR
ROLLUP, CUBE, GROUPING SETSET MINUS INTERSECT
SELECT question FROM audience WHERE has_question = true;