analytical query processing - marco serafini

26
Analytical Query Processing Marco Serafini COMPSCI 532 Lecture 7

Upload: others

Post on 31-Oct-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Analytical Query Processing - Marco Serafini

Analytical Query Processing

Marco Serafini

COMPSCI 532Lecture 7

Page 2: Analytical Query Processing - Marco Serafini

22

Announcement• Midterm date and location confirmed

• October 22 at 7-9pm in ILC S331

Page 3: Analytical Query Processing - Marco Serafini

3

MapReduce vs. DBMSs

Page 4: Analytical Query Processing - Marco Serafini

44

Advantages of DBMSs• Abstract data representation

• Relational model• Data storage is delegated to the DBMS

• Functional query language (SQL)• Queries specified as simple relational operators• Actual query execution delegated to the DBMS…• … including parallelism, distribution, pipelining etc.

• Support for indexing

Page 5: Analytical Query Processing - Marco Serafini

55

Disadvantages of DBMSs• SQL is a limited interface for complex analytics

• E.g. image analysis, creating maps• Need to define a schema for data a priori• High cost of loading data and indexing

• Can be amortized only if same data and schema reused• Too complex for “one shot” analytics

Page 6: Analytical Query Processing - Marco Serafini

66

Advantages of MapReduce• Support for arbitrary UDFs• Support for a variety of arbitrary data formats• Simple API• Scalability

Page 7: Analytical Query Processing - Marco Serafini

77

Disadvantages• Many of the optimizations of DBMS must be reimplemented, for example

• Indices• Query execution plans (logical + physical)• Column-based storage• Data format specifications (ProtoBuf)• Support for updates

• Several efforts towards closing the gap for analytics

Page 8: Analytical Query Processing - Marco Serafini

8

Data Analytics

Page 9: Analytical Query Processing - Marco Serafini

9 9

In Situ Analytics• Data dumped on GFS/HDFS (data lake)• Some of this data is relational• Several systems to execute relational queries on HDFS data

• SQL-like language• Query optimization• Columnar data representation

• Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto)

• We will discuss both classes

Page 10: Analytical Query Processing - Marco Serafini

1010

Analytical Queries• Long-running, complex queries• Often aggregates• Run on read-only data or snapshots of dynamic data• Data characteristics

• Tuples (rows) have many possible attributes (columns)• A row will have only a subset of attributes set

Page 11: Analytical Query Processing - Marco Serafini

1111

Star Schema: Facts and Dimensions• Popular schema for analytics/data warehousing

• Many others exist!• At the center is a large fact table• Foreign-key references to small dimension tables

Page 12: Analytical Query Processing - Marco Serafini

12

Page 13: Analytical Query Processing - Marco Serafini

13 13

Dremel• In-situ analytics• Independent query executor (not on top of MR)• Uses columnar store

Page 14: Analytical Query Processing - Marco Serafini

1414

Data Model: Column Families• Also called column groups, nested columns• Common to many systems, e.g. Cassandra, HBase

New record Nested in repeated Language

Nested in repeated Name

Page 15: Analytical Query Processing - Marco Serafini

1515

Assembling a Row• Finite state machine

Page 16: Analytical Query Processing - Marco Serafini

1616

Pros and Cons of Columnar Model• Pros

• Compression: columns have uniform values• Less data to scan on projections (which are common)

• Cons• Additional CPU load to decompress columns and rebuild rows

Page 17: Analytical Query Processing - Marco Serafini

1717

Query Execution

Page 18: Analytical Query Processing - Marco Serafini
Page 19: Analytical Query Processing - Marco Serafini

19 19

SparkSQL: Spark + DBMS• Extend Spark with

• Simple, high-level SQL-like operators• Query optimization

• No need to transfer data across systems• ETL, query processing, complex analytics in one system

Page 20: Analytical Query Processing - Marco Serafini

2020

Architecture

Page 21: Analytical Query Processing - Marco Serafini

2121

DataFrames• Collection of rows with homogeneous schema

• Like a table in a DBMS• Can be manipulated like an RDD

• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries

Page 22: Analytical Query Processing - Marco Serafini

2222

SparkSQL Query Execution

Page 23: Analytical Query Processing - Marco Serafini

2323

Advantages• Relational structure enables query optimization• In-memory caching using columnar representation

• Better compression• Mix SQL-like operators and arbitrary code

• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations

Page 24: Analytical Query Processing - Marco Serafini

2424

Catalyst• Query optimizer of SparkSQL• Rule-based optimization

• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules

• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient

• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer

Page 25: Analytical Query Processing - Marco Serafini

2525

Catalyst: Rule-Based Optimization• Apply rules to subtree until fixed point

Execution tree Transformation rules

Page 26: Analytical Query Processing - Marco Serafini

2626

Catalyst: Code Generation• Faster than interpreting a physical plan