analytical query processing - marco serafini
TRANSCRIPT
Analytical Query Processing
Marco Serafini
COMPSCI 532Lecture 7
22
Announcement• Midterm date and location confirmed
• October 22 at 7-9pm in ILC S331
3
MapReduce vs. DBMSs
44
Advantages of DBMSs• Abstract data representation
• Relational model• Data storage is delegated to the DBMS
• Functional query language (SQL)• Queries specified as simple relational operators• Actual query execution delegated to the DBMS…• … including parallelism, distribution, pipelining etc.
• Support for indexing
55
Disadvantages of DBMSs• SQL is a limited interface for complex analytics
• E.g. image analysis, creating maps• Need to define a schema for data a priori• High cost of loading data and indexing
• Can be amortized only if same data and schema reused• Too complex for “one shot” analytics
66
Advantages of MapReduce• Support for arbitrary UDFs• Support for a variety of arbitrary data formats• Simple API• Scalability
77
Disadvantages• Many of the optimizations of DBMS must be reimplemented, for example
• Indices• Query execution plans (logical + physical)• Column-based storage• Data format specifications (ProtoBuf)• Support for updates
• Several efforts towards closing the gap for analytics
8
Data Analytics
9 9
In Situ Analytics• Data dumped on GFS/HDFS (data lake)• Some of this data is relational• Several systems to execute relational queries on HDFS data
• SQL-like language• Query optimization• Columnar data representation
• Can build on top of MR/Spark (e.g. Hive, SparkSQL) or not (e.g. Dremel, Impala, Presto)
• We will discuss both classes
1010
Analytical Queries• Long-running, complex queries• Often aggregates• Run on read-only data or snapshots of dynamic data• Data characteristics
• Tuples (rows) have many possible attributes (columns)• A row will have only a subset of attributes set
1111
Star Schema: Facts and Dimensions• Popular schema for analytics/data warehousing
• Many others exist!• At the center is a large fact table• Foreign-key references to small dimension tables
12
13 13
Dremel• In-situ analytics• Independent query executor (not on top of MR)• Uses columnar store
1414
Data Model: Column Families• Also called column groups, nested columns• Common to many systems, e.g. Cassandra, HBase
New record Nested in repeated Language
Nested in repeated Name
1515
Assembling a Row• Finite state machine
1616
Pros and Cons of Columnar Model• Pros
• Compression: columns have uniform values• Less data to scan on projections (which are common)
• Cons• Additional CPU load to decompress columns and rebuild rows
1717
Query Execution
19 19
SparkSQL: Spark + DBMS• Extend Spark with
• Simple, high-level SQL-like operators• Query optimization
• No need to transfer data across systems• ETL, query processing, complex analytics in one system
2020
Architecture
2121
DataFrames• Collection of rows with homogeneous schema
• Like a table in a DBMS• Can be manipulated like an RDD
• DataFrame operations• Similar to Python Pandas or R data frames• Evaluated lazily (query planning is postponed)• Can optimize across multiple queries
2222
SparkSQL Query Execution
2323
Advantages• Relational structure enables query optimization• In-memory caching using columnar representation
• Better compression• Mix SQL-like operators and arbitrary code
• More flexible than UDFs in DBMSs• Can optimize across multiple SQL operations
2424
Catalyst• Query optimizer of SparkSQL• Rule-based optimization
• Rule: find pattern and transform• Used for both logical and physical plans• Can customize rules
• Code generation• Directly outputs bytecode (as opposed to interpreting a plan)• Much more CPU efficient
• Flexible data sources• Can change the physical representation of DataFrames• Still use the optimizer
2525
Catalyst: Rule-Based Optimization• Apply rules to subtree until fixed point
Execution tree Transformation rules
2626
Catalyst: Code Generation• Faster than interpreting a physical plan