a new platform for a new era - john funk immersion gav1.2.pdf · a new platform for a new era...

A NEW PLATFORM FOR A NEW ERA

Additional Line 18 Point Verdana

2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved. 2 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Pivotal GPDB Immersion v1.2 Dana Brennemen

3 Capgemin / Pivotal Alliance Confidential–Do Not Distribute July 2014 New Hire Immersion Training © Copyright 2014 Pivotal. All rights reserved.

Welcome!


Immersion Description �  This course is 3 days in length

�  It is designed to rapidly enable partners with a technical working knowledge of GPDB

�  The course organization and flow presents content in order of work being done in a GPDB system

–  In a manner as an FE would initiate tasks in a POC

�  The course content emphasizes database performance and optimization rather than administration, monitoring and maintenance tasks

–  All performance tuning and optimization techniques and guidelines are presented and discussed within the related topic module


Immersion Labs �  Labs require the lab tar files and VM to be loaded and installed on your

laptop before arriving to class

�  All labs (except for the last lab) are scripted and are intended to be used as a toolkit of examples and reusable scripts that you can later leverage when in the field

–  Scripted labs are to demonstrate how to use and implement features in GPDB

–  Review the scripts to understand what, why and how

�  The last lab is not scripted, is completely hands on and there is no solution presented

–  It is a culmination of applying all that was learned throughout this immersion


Immersion Pre-Requisites �  To successfully complete this immersion the pre-requisites are required

–  General knowledge of RDBMS’s –  Experience with SQL ▪  http://www.quackit.com/sql/tutorial/ ▪  Or other online tutorial

–  Working knowledge of Linux / UNIX –  Shell scripting, ls, vi, more, etc

�  This immersion is a pre-requisite for PHD/HAWQ immersion if the student does not have prior experience with GPDB


Etiquette � All cell phones must be turned off or set to Do Not Disturb

� Only laptops that will be used for labs are allowed in the classroom

–  Laptops must have the GPDB VM installed

� Please no checking email, surfing the internet during class –  Email, voice mails, text messages can be checked during breaks and

lunch

� Please limit side conversations as they are disruptive and distracting to other participants


Start, Stop, Breaks and Lunch � This immersion is 3 days in length

� Your instructor will advise you of the daily start and stop times – Please arrive on time – Please inform your instructor of all absences

� 20 minute morning break

�  Lunch break

� 20 minute afternoon break


Pivotal GPDB Introduction and Positioning


Pivotal Database Enterprise Platform

PRODUCT FEATURES

CLIENT ACCESS & TOOLS

Multi-Level Fault Tolerance

Shared-Nothing MPP Parallel Query Optimizer

Polymorphic Data Storage™

CLIENT ACCESS ODBC, JDBC, OLEDB, etc.

CORE MPP ARCHITECTURE

Parallel Dataflow Engine gNet™ Software Interconnect

MPP Scatter/Gather Streaming™

Online System Expansion Workload Management GPDB ADAPTIVE

SERVICES

LOADING & EXT. ACCESS Petabyte-Scale Loading Trickle Micro-Batching

Anywhere Data Access

STORAGE & DATA ACCESS Hybrid Storage & Execution (Row- & Column-Oriented) In-Database Compression

Multi-Level Partitioning Indexes – Btree, Bitmap, etc.

LANGUAGE SUPPORT Comprehensive SQL Native MapReduce

SQL 2003 OLAP Extensions Programmable Analytics

3rd PARTY TOOLS BI Tools, ETL Tools

Data Mining, etc

ADMIN TOOLS GP Performance Monitor

pgAdmin3 for GPDB


Pivotal Database: Platform for Big Analytics

�  Descriptive BI

�  Predictive Modeling

� Machine Learning

�  Descriptive BI

�  Fast and Scalable

Big data predictive analytics with any tool


Pivotal Database: The Essentials

•  SQL Based: –  Load And Query Like Any SQL Database –  MPP Shared-Nothing Parallelization: –  Automatic data distribution without tuning

•  Linear Scalability: –  Linear scaling of capacity, loading, users and concurrency

•  Analytics Optimized: –  Analytics-oriented query optimization, write locking, storage

management, data compression, etc. •  Extensible for Analytics:

–  “Plug-In” Analytical Algorithm Libraries •  Flexible Deployment Models:

–  Appliance or Software Deployments

Database


Architecture: No-Forklift Scalability

•  Advantages: –  Scale Existing Systems –  No Forklifting –  Immediate Capacity

Increase –  Simple Process –  Connect New Hardware –  Simple Restart –  Schedule Redistribution

of Existing Data

...

New Segment Servers Query planning & dispatch

...


Performance Through Parallelism

•  Scale-out architecture on standard commodity hardware

•  Automatic parallelization –  Load and query like any database –  Automatically distributed tables across

nodes –  No need for manual partitioning or tuning

•  Extremely scalable MPP shared-nothing architecture

–  All nodes can scan and process in parallel –  Linear scalability by adding nodes –  On-line expansion when adding nodes

Loading

Interconnect

Database

Storage

Compute


Performance: Parallel Query Optimizer •  Cost-based optimization looks

for the most efficient plan •  Physical plan contains scans,

joins, sorts, aggregations, etc. •  Global planning avoids sub-

optimal ‘SQL pushing’ to segments

•  Directly inserts ‘motion’ nodes for inter-segment communication

PHYSICAL EXECUTION PLAN FROM SQL OR MAPREDUCE

Gather Motion 4:1(Slice 3)

Sort

HashAggregate

HashJoin

Redistribute Motion 4:4(Slice 1)

HashJoin

Hash Hash

HashJoin

Hash

Broadcast Motion 4:4(Slice 2)

Seq Scan on motion

Seq Scan on customer Seq Scan on line item

Seq Scan on orders


Performance: Dynamic Pipelining

•  A supercomputing-based “soft-switch” responsible for –  Efficiently pumping streams of data between motion nodes

during query-plan execution –  Delivers messages, moves data, collects results, and

coordinates work among the segments in the system Dynamic Pipelining

Software Interconnect


Loading: Industry’s Fastest •  Industry leading performance

at 10+TB per-hour per-rack •  Scatter-Gather Streaming™ provides

true linear scaling •  Support for both large-batch and

continuous real-time loading strategies

•  Enable complex data transformations “in-flight”

•  Transparent interfaces to loading via support files, application, and services

SINGLE RACK COMPARISON

Greenplum load rates scale linearly with the number of racks, others do not.

For example, two racks = >20TB/H

Greenplum Oracle Exadata

Netezza Teradata


Loading: Massively-Parallel Ingest

•  Fast Parallel Load & Unload –  No Master Node bottleneck –  10+ TB/Hour per Rack –  Linear scalability

•  Low Latency –  Data immediately available –  No intermediate stores –  No data “reorganization”

•  Load/Unload To & From: –  File Systems –  ETL Products –  Hadoop Distributions

Extreme speed and, immediate usability from files, ETL & Hadoop

External Sources

Loading, streaming, etc.

gNet Network Interconnect

... ...

... ... Master Servers

Query planning & dispatch

Segment Servers

Query processing & data storage

SQL

ETL File Systems


Storage: Polymorphic Table Storage™

•  Provide the choice of processing model for any table or any individual partition –  Enable Information Lifecycle Management (ILM)

•  Storage types can be mixed within a table or database –  Four table types: heap, row-oriented AO, column-oriented, external –  Block compression: Gzip (levels 1-9), QuickLZ –  Columnar compression: RLE

TABLE ‘CUSTOMER’ Mar ‘11 Apr ‘11 May

‘11 Jun ‘11 Jul ‘11 Aug ‘11 Sept

‘11 Oct ‘11 Nov ‘11

Row-oriented for HOT DATA Column-oriented for COLD DATA


Storage: Multi-Level Partitioning •  Hash Distribution to evenly

spread data across all segment instances

•  Range Partition within an segment instance to minimize scan work

DATA SET Segment 1A

Segment 1C

Segment 1D

Segment 2A

Segment 2B

Segment 2C

Segment 2D

Segment 3A

Segment 3B

Segment 3C

Segment 3D

Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007 Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007

Segment 1B Node 1

Node 2

Node 3


Schema Agnostic

Schema Support

•  Third-Normal Form

•  Star Schema

•  Snowflake Schema

•  Hybrid Schemas

•  Denormalized Tables

Philosophies

•  Kimball

•  Inmon

•  Non-traditional

•  Co-Processing with Hadoop


Extensible for Analytics: In-Database Analytical Algorithms

•  Bringing the power of parallelism to commonly-used modeling and analytics functions

•  In-database analytics –  SAS – HPA, Access, and Scoring

Accelerator –  MADLib – An open-source library of

advanced analytics functions –  Analytics extensions supported, including

•  PostGIS - Geospatial support, PL/R - Statistical Computing, PL/Java, PL/Perl, etc.

MAD lib

MAD lib


High Availability

Segment Server Data Protection •  Mirrored segments for server failures •  Optional RAID protection for drive failures Upon server failure •  Mirrored segments take over with no loss of service •  Fast online differential recovery

Master

Segment Segment Segment Segment

Master

Master Server Data Protection •  Replicated transaction logs for server failure •  Optional RAID protection for drive failures Upon server failure •  Standby server activated •  Administrator alerted •  Orchestrated failover


Simple to Manage •  Greenplum Command Center

–  Complete platform management and control

•  Greenplum Package Manager –  Automates install,

uninstall, update, and query for analytics extensions

–  Support package migration during upgrade, segment recovery, expansion, and standby initialization


Pivotal Greenplum Database Delivers

�  Massively Parallel Analytics Performance

�  In-Database Analytical Extensions

�  Industry-Leading Load Speed

�  Rich SQL with Schema Agnosticism

�  Industry-Leading Workload Mgmt.

�  SAS Acceleration Options

�  Parallel Co-Processing with Hadoop

�  No-Forklift Scalability �  Multi-Level

Redundancy �  Rich, Easy-to-Use

Administration Tools �  Big Data Backup �  Comprehensive

Security


GPDB Architecture Overview


MPP Shared Nothing Architecture

Standby Master

Segment Host with one or more Segment Instances Segment Instances process queries in parallel

Flexible framework for processing large datasets

High speed interconnect for continuous pipelining of data processing …

Master Host

SQL Master Host and Standby Master Host Master coordinates work with Segment Hosts

Interconnect

Segment Host Segment Instance Segment Instance Segment Instance Segment Instance

Segment Hosts have their own CPU, disk and memory (shared nothing)


node1


node2


node3


nodeN


Master Host

Master Segment

Catalog

Query Optimizer

Distributed TM

Dispatch Query Executor

Parser enforces syntax, semantics and produces a

parse tree

Client Accepts client connections, incoming user requests and

performs authentication

Parser

Master Host


Query Optimizer

Local Storage

Master Segment

Catalog Distributed TM

Interconnect

Dispatcher Query Executor

Parser Query Optimizer Consumes the parse tree and

produces the query plan

Query plan contains how the query is executed

(e.g. Hash join versus Merge join)

Master Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage


Query Dispatcher

Local Storage

Master Segment


Interconnect

Query Optimizer

Query Executor

Parser

Dispatcher

Responsible for communicating the

query plan to segments

Allocates cluster resources required to perform the job and

accumulating/presenting final

results

Master Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage


Query Executor

Local Storage

Master Segment


Interconnect

Query Optimizer

Query Dispatcher

Parser

Query Executor

Responsible for executing the steps

in the plan (e.g. open file,

iterate over tuples)

Communicates its intermediate results

to other executor processes

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Master Host


Interconnect

Local Storage

Master Segment


Query Optimizer

Query Dispatcher

Parser

Query Executor

Interconnect

Responsible for serving tuples from

one segment to another to perform

joins, etc.

Uses UDP for optimal performance

and scalability

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Master Host


System Catalog

Local Storage

Master Segment

Query Executor

Distributed TM

Interconnect

Query Optimizer

Query Dispatcher

Parser

Catalog

Stores and manages metadata for

databases, tables, columns, etc.

Master keeps a copy of the metadata coordinated on

every segment host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Master Host


Distributed Transaction Management

Local Storage

Master Segment

Query Executor

Catalog

Interconnect

Query Optimizer

Query Dispatcher

Parser

Distributed TM

Segments have their own commit and replay logs and decide when to commit, abort for

their own transactions

DTM resides on the master and

coordinates the commit and abort

actions of segments

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Host

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Segment Instance

Local TM

Query Executor

Catalog

Local Storage

Master Host


GPDB High Availability � Master Host mirroring

–  Warm Standby Master Host ▪  Replica of Master Host system catalogs

–  Eliminates single point of failure –  Synchronization process between Master Host and Standby Master Host ▪  Uses replication logs

� Segment mirroring –  Creates a mirror segment for every primary segment

▪  Uses a file block replication process

–  If a primary segment becomes unavailable automatic failover to the mirror


Master Mirroring � Warm Standby Master enabled at initialization or on an

active system using gpinitstandby

�  If Master Host becomes unavailable the replication process is stopped

– Replication logs are used to reconstruct the state of the master at the time of failure

– Standby Master Host can be activated to start at the last successful transaction completed by the Master Host ▪  Use gpactivatestandby


Segment Mirroring

� Enabled at initialization or on an active system using gpaddmirrors

� Can be configured on same array of hosts or a system outside of the array

�  If a primary segment becomes unavailable automatic failover to the mirror


Fault Detection and Recovery �  ftsprobe fault detection process monitors and scans segments and database

processes at configurable intervals

�  Use gpstate utility to verify status of primary and mirror segments

�  Query gp_segment_configuration catalog table for detailed information about a failed segment

▪  $ psql -c "SELECT * FROM gp_segment_configuration WHERE status='d';"

�  When ftsprobe cannot connect to a segment it marks it as down –  Will remain down until administrator manually recovers the failed segment

using gprecoverseg utility

�  Automatic failover to the mirror segment –  Subsequent connection requests are switched to the mirror segment


Getting Started with GPDB


Required GPDB Environment Variables

$ GPHOME=/usr/local/greenplum-db-4.1.x.x $ export GPHOME

$ PATH=$GPHOME/bin:$PATH $ export PATH

$ LD_LIBRARY_PATH=$GPHOME/lib $ export LD_LIBRARY_PATH

$ MASTER_DATA_DIRECTORY=/data/master/gpseg-1 $ export MASTER_DATA_DIRECTORY


GPDB Server Configuration File � Master and segments have their own postgresql.conf file

� Some parameters are local; each segment looks to its own file

–  Must be set on master and every segment host

� Some parameters are master; set at the master host

� Some require a database restart

� System, database, role and session level

� Session level use SET Command, for example: $ SET statement_mem TO '200MB';


GUCs � Parameters that affect the behavior of GPDB

� Based on type of GUC can be set at –  Session level, User level, Server level (in postgresql.conf)

� Based on type of GUC is set locally only on the master or globally on the master and all segments

� Based on the type of GUC a restart or reload may or may not be required

� Certain GUCs can be set only by superuser


gpstart �  $ gpstart –a

�  Watch for WARN and FATAL lines –  FATAL lines indicate that the database did not start –  WARN lines indicate a failure on one or more segment databases to

start

�  Recover failed segment startups with gprecoverseg


gpstop �  -u

–  Reload configuration files (pg_hba.conf, postgresql.conf) used after making changes

�  -r –  Stop and restart the database

�  -a –  Don’t prompt

�  -M fast | immediate

$ gpstop –M fast –r


gpstate �  Provides the status of individual components in GPDB system including

primary segments, mirror segments, master host, and standby master host

�  Examples –  gpstate –s Show detailed status information of a Greenplum Database

system –  gpstate –Q Quick check for down segments in the master host system

catalog –  gpstate –m Show information about mirror segments –  gpstate –f Show information about the standby master configuration –  gpstate –I Display the Greenplum software version information


Standard Database Processes � Since GPDB is a collection of databases you will see lots of

processes running on the segment servers

� The main GPDB/postgres process can be identified with: $ ps –ef | grep ‘postgres –D’

� Each process will be running on a specific port

� On segment servers you will see one of these processes for each primary and mirror database


Introduction to psql


psql �  Command line terminal interface to GPDB

–  Interactive or file –  Meta-commands and shell-like features to facilitate scripting

�  Connect to the master

�  -d database_name or set PGDATABASE env variable

�  -h hostname or set PGHOST env variable

�  -p port_number or set PGPORT env variable

�  -u user_name or set PGUSER env variable


Common psql commands $ psql mydatabase

$ psql mydatabase – f /home/lab/createdb.sql

$ psql \h

$ psql \?

$ psql \l

$ psql \dt

$ psql \q


Common psql Meta Commands \? (help on psql meta-commands)

\h (help on SQL command syntax)

\dt (show tables)

\dtS (show system tables)

\dg or \du (show roles)

\l (show databases)

\c db_name (connect to this database)

\q (quit psql) (ctrl-d also works)


Using psql

�  Interactive mode $ psql mydatabase mydatabase=# SELECT * FROM foo;

�  Non-interactive mode (single command) $ psql mydatabase –ac “SELECT * FROM foo;”

�  Non-interactive mode (multiple commands) $ psql mydatabase –af /home/lab1/sql/createdb.sql;

Issuing SQL Statements


Lab


Lab Getting Started

�  Read the README file

�  Set your environment variables

�  Start the Pivotal DB –  Also start Command Center

�  Create the database


Introduction to Creating Tables in GPDB


Piv

otal

DB

Inst

ance

Databases, Schemas and Objects �  Database

–  Multiple databases per GPDB instance –  Data is not shared between databases

�  Schema –  Logically organize data within a database

�  Object –  Tables, Indexes, functions, etc.

�  search_path defines what order is schemas are searched for the object

�  Qualified name: –  ASCII standard: database.schema.object –  PDB best practice: schema.object

Database: dev

Schema:finance

Table: ar_history

Table: customer

Table: ap_history

View: customer_v

Schema:support

Table: case_history

Table: products

View: customer_v

Database: qa

Schema:finance

Table: ar_history

Table: customer

Table: ap_history

View: customer_v

Schema:support

Table: case_history

Table: products

View: customer_v


Lab Database and Schemas P

ivot

al D

B In

stan

ce

Database: dca_demo

Schema: retail_demo

Table: customer_addresses_dim

Table: customers_dim

Table: categories_dim

Table: date_dim

Table: order_lineitems

Table: orders

Table: email_addresses_dim

Table: payment_methods

Table: products_dim

Schema: ext

Table: customer_addresses_dim_ext

Table: customers_dim_ext

Table: categories_dim_ext

Table: date_dim_ext

Table: order_lineitems_ext

Table: orders_ext

Table: email_addresses_dim_ext

Table: payment_methods_ext

Table: products_dim_ext

Schema: err

Table: customer_addresses_dim_err

Table: customers_dim_err

Table: categories_dim_err

Table: date_dim_err

Table: order_lineitems_err

Table: orders_err

Table: email_addresses_dim_err

Table: payment_methods_err

Table: products_dim_err


Lab Tables retail_demo Schema

payment_methods

orders

order_lineitems

date_dim

products_dim categories_dim

customers_dim

email_addresses_dim

customer_addresses_dim


Database, Table, Row, Index, PK and FK �  A database is the set of physical files in which all the objects and

database metadata are stored

�  A table is set of columns that can contain data

�  A row is a set of columns from a table reflecting a record or tuple

�  An index is an object that allows for fast retrieval of table rows

�  Primary key is one or more columns in a table that makes a record unique

�  Foreign key is a common column, common between two tables that define the relationship between those two tables


Template Databases �  At initialization 3 databases are created

–  template0 –  template1 –  postgres

�  CREATE DATABASE works by copying an existing database �  The default database that is copied is template1

–  A new database using this template will contain all objects within template1

–  For example, CREATE DATABASE new_dbname;

�  template0 should not be altered and can be used to recreate template1, postgres for general use


Define Columns and Data Types CREATE TABLE

� Choose data types that use the least amount of space –  Use TEXT or VARCHAR rather than CHAR –  Do not use BIGINT for data that fits in INT or SMALLINT

� Use the same data type for columns used in table joins –  When data types are different the database must convert one so the

data values can be compared correctly


Define Constraints CREATE TABLE

� CHECK constraints

� NOT NULL constraints

� UNIQUE constraint

� PRIMARY KEY constraint

� FOREIGN KEY constraint –  Allowed but not enforced in GPDB


Define the Storage Model CREATE TABLE

� Heap storage

� Append only storage

� Row oriented storage

� Column oriented storage

� Compression –  Table level compression applied to entire table –  Column level compression applied to a specific column


Column Orientated Considerations � Consider use cases for column oriented storage

– More efficient I/O and storage – Not optimized for write operations – A physical file is created for each column and partition

for partitioned columnar tables

� Benefits of columnar –  Speed –  Compression ratios


Storage Optimization � Use Row based tables when

–  Updates are required –  Frequent inserts are performed –  Selects against table are wide (approximately 30+ columns)

� Use Column based tables when –  Selects are narrow –  Higher compression rates are required –  Optimal query performance is desired


Append Only Tables Compression Considerations

Table Orientation Compression Type Algorithms Row Table ZLIB and QUICKLZ Column Column and Table RLE_TYPE, ZLIB and QUICKLZ

� Compression ratio and disk size

� Compression speed

� Decompression speed and scan rate

Compression costs CPU and should not be used unless space is a high priority or the table compresses so the I/O reduced is greater than the resulting CPU costs


Table Distributions and Partitioning


CREATE TABLE

� Every table has a distribution method

�  DISTRIBUTED BY (column) –  Uses a hash distribution

� DISTRIBUTED RANDOMLY –  Uses a random distribution which is not guaranteed to provide a

perfectly even distribution

=> CREATE TABLE products (name varchar(40), prod_id integer, supplier_id integer) DISTRIBUTED BY (prod_id);

Define Data Distributions


Data Distribution: The Key to Parallelism The primary strategy and goal is to spread data evenly across as many nodes (and disks) as possible

43 Oct 20 2005 12 64 Oct 20 2005 111 45 Oct 20 2005 42 46 Oct 20 2005 64 77 Oct 20 2005 32 48 Oct 20 2005 12

Order

Ord

er #

Ord

er

Dat

e

Cus

tom

er

ID

Greenplum Database High Speed Loader

50 Oct 20 2005 34 56 Oct 20 2005 213 63 Oct 20 2005 15 44 Oct 20 2005 102 53 Oct 20 2005 82 55 Oct 20 2005 55


Master

Parallel Data Scans

SELECT COUNT(*) FROM orders WHERE order_date >= ‘Oct 20 2007’ AND order_date < ‘Oct 27 2007’

4,423,323

Each Segment Scans Data Simultaneously

Segment 1A Segment 1B Segment 1C Segment 1D






Segments Return Results Return Results Send Plan to Segments Develop Query Plan


DISTRIBUTED RANDOMLY � Uses a random algorithm

– Distributes data across all segments – Minimal data skew but not guaranteed to have a

perfectly even distribution

� Any query that joins to a table that is distributed randomly will require a motion operation – Redistribute motion – Broadcast motion


DISTRIBUTED BY (column_name) � For large tables significant performance gains can be

obtained with local joins (co-located joins) –  Distribute on the same column for tables commonly joined together ▪  WHERE clause

�  Join is performed within the segment –  Segment operates independently of other segments

� Eliminates or minimizes motion operations –  Broadcast motion –  Redistribute motion


Use the Same Distribution Key for Commonly Joined Tables

= Distribute on the same key used in the join

(WHERE clause) to obtain local joins

Segment 1A

Segment 2A

customer (c_customer_id)

freg_shopper (f_customer_id)


freq_shopper (f_customer_id)

=

=


Redistribution Motion

WHERE customer.c_customer_id = freg_shopper.f_customer_id freq_shopper table is dynamically redistributed on f_customer_id

Segment 1A

customer (c_customer_id) customer_id =102

freg_shopper (f_trans_number)

Segment 2A

customer (c_customer_id) customer_id=745

freq_shopper (f_trans_number) customer_id=102

Segment 3A


freq_shopper (f_trans_number) customer_id=745


Broadcast Motion

WHERE customer.c_statekey = state.s_statekey The state table is dynamically broadcasted to all segments

Segment 1A

Segment 2A

Segment 3A


state (s_statekey) AK, AL, AZ, CA…






Commonly Joined Tables Use the Same Data Type for Distribution Keys

customer (c_customer_id) 745::int freq_shopper (f_customer_id) 745::varchar(10)

�  Values might appear the same but they are stored differently at the disk level

�  Values might appear the same but they HASH to different values

•  Resulting in like rows being stored on different segments •  Requiring a redistribution before the tables can be joined


Hash Distributions: Data Skew and Computational Skew � Select a distribution key with unique values and high

cardinality that will not result in data skew –  Do not distribute on boolean keys and keys with low cardinality

▪  The system distributes rows with the same hash value to the same segment instance therefore resulting in the data being located on only a few segments

� Select a distribution key that will not result in computational skew (in flight when a query is executing)

–  Operations on columns that have low cardinality or non-uniform distribution


Always Check for Data Skew

� SELECT COUNT(*), gp_segment_id FROM <table-name> GROUP BY gp_segment_id;

� SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;


Check for Skew Simple Example test=# SELECT COUNT(*), gp_segment_id FROM sales GROUP BY gp_segment_id;

count | gp_segment_id -------+--------------- 1 | 1 2 | 0 (2 rows)

test=# SELECT 'sales' as "Table Name",max(c) as "Max Seg Rows", min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from sales group by 2) as a;

Table Name | Max Seg Rows | Min Seg Rows | Percentage Difference Between Max & Min

-----------------+---------------------+--------------------+-----------------------------------------

sales | 2 | 1 | 50.0000000000000000

(1 row)


CREATE TABLE

� Reduces the amount of data to be scanned by reading only the relevant data needed to satisfy a query

–  The goal is to achieve partition elimination

�  Supports range partitioning and list partitioning

�  Uses table inheritance and constraints –  Persistent relationship between parent and child tables

Define Partitioning


Multi-Level Partitioning…. Use Hash Distribution to evenly spread data across all instances

Use Range Partition within an instance to minimize scan work

Segment 1A

Segment 1B

Segment 1C

Segment 1D

Segment 2A

Segment 2B

Segment 2C

Segment 2D

Segment 3A

Segment 3B

Segment 3C

Segment 3D

Jan 2007 Feb 2007 Mar 2007 Apr 2007 May 2007 Jun 2007 Jul 2007

Aug 2007 Sep 2007 Oct 2007 Nov 2007 Dec 2007











…Further Improve Scan Times SELECT COUNT(*) FROM orders WHERE order_date >= ‘Oct 20 2007’ AND order_date < ‘Oct 27 2007’

VS

Hash Partition Multi-Level Partition





Partitioning Guidelines �  Use table partitioning on large tables to improve query performance

–  Table partitioning is not a substitute for distributions

�  Use if the table can be divided into rather equal parts based on a defining criteria

–  For example, range partitioning on date –  No overlapping ranges or duplicate values

�  And the defining partitioning criteria is the same access pattern used in query predicates

–  WHERE date = ‘1/30/2012’


Loading Partitioned Tables •  Top-level parent tables are empty •  Data is loaded into child partitions •  COPY or INSERT automatically loads data to the correct partition •  Load a staging table and swap the table in place of an existing partition

–  ALTER TABLE…EXCHANGE PARTITION


Lab


Lab CreateTables


�  Review the DDL

�  Using psql run the DDL scripts

�  Using psql review the results –  What was created?


Indexes


Indexes •  Most data warehouse environments operate on large volumes of data

–  Low selectivity –  Sequential scan is the preferred method to read the data

•  For queries with high selectivity, indexes may improve performance

–  Avoid indexes on frequently updated columns –  Avoid overlapping indexes –  Drop indexes before loading data and recreate indexes

after the load


Optimize Indexes •  Compressed tables

•  Create selective B-tree indexes •  Single row lookups are good candidates

•  Use Bitmap indexes for low cardinality/selectivity columns

•  Index columns used in joins

•  Index columns frequently used in predicates

•  Consider a clustered index


Bitmap Indexes � When to use Bitmap indexes?

–  Best suited for querying than updating –  Performs best when the column has a low cardinality – 100 to

100,000 distinct values

� When not to use Bitmap Indexes? –  Do not not use for unique columns –  Do not use for high cardinality data

▪  For example, customer names and phone numbers –  Do not use for very low cardinality

▪  For example, gender

–  OLTP workloads


Creating and Managing

� B-tree: CREATE INDEX gender_idx ON employee (gender);

� Bitmap: CREATE INDEX title_bmp_idx ON films USING bitmap (title);

� To rebuild all indexes on a table: REINDEX my_table;

� To rebuild a particular index: REINDEX my_index;

� To drop an index: DROP INDEX name_idx;


•  random_page_cost (master/session/reload) Default value: 100 –  Sets the planner’s estimate of the cost of a nonsequentially fetched disk page –  Lower value increases the chances for index scan to be picked

•  enable_indexscan (master/session/reload) Default value: on –  Enables or disables the query planner’s use of index-scan plan types

•  enable_nestloop (master/session/reload) –  Default value: off –  Enables or disables the query planner’s use of nested-loop join plans –  This should be enabled for use of index in nested loop joins

•  enable_bitmapscan (master/session/reload) Default value: on –  Enables or disables the query planner’s use of bitmap-scan plan types. –  Generally bitmap scan provides faster access, however you can try disabling it in specifically if you are getting very few rows out of index

•  enable_seqscan (master/session/reload) Default value: on –  Disabling enable_seqscan results in use of index –  Use this parameter very carefully only as last resort

GUCs for Index Selection


Setting GUCs to Influence Index Usage •  Iterative tuning steps to favor index usage

–  Start by turning setting or confirming the following GUC settings •  enable_indexscan to on •  For joins via index lookup, set enable_nestloop to on

–  Start by lowering random_page_cost •  Set to 20 •  If still not using the index, then set it to 10

–  If still not using the index, increase seq_page_cost •  Set to 10 •  If still not using the index, then set it to 15

–  If still not using the index, set enable_seqscan off


Introduction to External Tables and Loading Data


GPDB Data Loading Options Loading Method Common Uses Examples

INSERTS •  Operational Workloads •  OBDC/JDBC Interfaces

INSERT INTO performers (name, specialty) VALUES (‘Sinatra’, ‘Singer’);

COPY

•  Quick and easy data in •  Legacy Postgres applications •  Output sample results from SQL statements

COPY performers FROM ‘/tmp/comedians.dat’ WITH DELIMITER ‘|’;

External Tables

•  High speed bulk loads •  Parallel loading using gpfdist protocol •  Local file, remote file, executable or HTTP

based sources

INSERT INTO craps_bets SELECT g.bet_type , g.bet_dttm , g.bt_amt FROM x_allbets b JOIN games g ON ( g.id = b.game_id ) WHERE g.name = ‘CRAPS’;

GPLOAD

•  Simplifies external table method (YAML wrapper )

•  Supports Insert, Merge & Update

gpload –f blackjack_bets.yml


Example Load Architectures

Master Host

Segment Host Segment Host Segment Host

ETL Host

Data file Data file

Data file

Data file Data file

Data file

gpdfdist gpdfdist

ETL Host

Data file Data file

Data file

Data file Data file

Data file

gpdfdist gpdfdist

Segment Instance

Segment Instance

Segment Instance

Segment Instance

Segment Instance

Segment Instance

Master Instance

Segment Host

Segment Instance

Segment Instance

Singleton INSERT statement

COPY statement

INSERT via external table or gpload


External Tables �  Access external files as if they were regular database tables

�  Used with gpfdist provides full parallelism to load or unload data

�  Query using SQL

�  Create views for external tables

�  Readable external tables for loading data –  Perform common ETL tasks

�  Writeable external tables for unloading data –  Select data from database table to insert into writeable external

table –  Send data to a data stream


File Based and Web Based External Tables � File based external tables

– Access static flat files and are rescannable – Uses file:// or gpfdist:// protocols

� Web based external tables – Access dynamic data sources and are not rescannable – On web server using http:// protocol – Or by executing OS commands or scripts ▪  EXECUTE clause


File Based External Tables �  Specify format of input files

–  FORMAT clause

�  Specify location of external data sources (URIs)

�  Specify protocol to access external data sources –  gpfdist ▪  Provides the best performance ▪  Segments access external files in parallel up to the value of

gp_external_max_segments (Default 64) –  gpfdists ▪  Secure version of gpfdist

–  file:// ▪  Segments access external files in parallel based on the number of URIs


Web Based External Tables

� Command based –  Output of shell command or scripts defines web table data –  EXECUTE command

� URL based –  Accesses data on a web server using HTTP protocol –  Web data files must reside on a web server that segment hosts can

access


Load Using Regular External Tables � File based (flat files)

–  gpfdist provides the best performance

=# CREATE EXTERNAL TABLE ext_expenses (name text, date date, amount float4, category text, description text) LOCATION ( ‘gpfdist://etlhost:8081/*.txt’, ‘gpfdst://etlhost:8082/*.txt’) FORMAT ’TEXT' (DELIMITER ‘|’ );

$ gpfdist –d /var/load_files1/expenses –p 8081 –l /home/gpadmin/log1 &

$ gpfdist –d /var/load_files2/expenses –p 8082 –l /home/gpadmin/log2 &


Create the External Table -- etl.x_bets_dy:q -- DDL to drop and create an external table definition DROP EXTERNAL TABLE IF EXISTS etl.x_bets_dy CASCADE; CREATE EXTERNAL TABLE etl.x_bets_dy ( bet_ts TIMESTAMP , player_id INT , game_id SMALLINT , bet_amt SMALLINT ) LOCATION ( 'gpfdist://sdw1-10:8081/xbet_dy_091113_*.dat' , 'gpfdist://sdw2-10:8081/xbet_dy_091113_*.dat' , 'gpfdist://sdw3-10:8081/xbet_dy_091113_*.dat' , 'gpfdist://sdw4-10:8081/xbet_dy_091113_*.dat' , 'gpfdist://sdw1-10:8082/xbet_dy_091113_*.dat' , 'gpfdist://sdw2-10:8082/xbet_dy_091113_*.dat' , 'gpfdist://sdw3-10:8082/xbet_dy_091113_*.dat' , 'gpfdist://sdw4-10:8082/xbet_dy_091113_*.dat' , 'gpfdist://sdw1-10:8083/xbet_dy_091113_*.dat' , 'gpfdist://sdw2-10:8083/xbet_dy_091113_*.dat' , 'gpfdist://sdw3-10:8083/xbet_dy_091113_*.dat' , 'gpfdist://sdw4-10:8083/xbet_dy_091113_*.dat' , 'gpfdist://sdw1-10:8084/xbet_dy_091113_*.dat' , 'gpfdist://sdw2-10:8084/xbet_dy_091113_*.dat' , 'gpfdist://sdw3-10:8084/xbet_dy_091113_*.dat' , 'gpfdist://sdw4-10:8084/xbet_dy_091113_*.dat' ) FORMAT 'TEXT' ( DELIMITER '|’) LOG ERRORS INTO etl.err_bets_dy SEGMENT REJECT LIMIT 10;


Load Using the External Table -- BulkLoad.sql -- SQL to Load the daily, weekly and monthly bets -- Will DROP and CTAS each time.. SET gp_external_max_segs=18; -- Will truncate the error log table--this is not required TRUNCATE TABLE etl.err_bets_dy; -- The Daily Bets DROP TABLE IF EXISTS stage.bets_dy CASCADE; CREATE TABLE stage.bets_dy WITH ( appendonly=true ) AS SELECT * FROM etl.x_bets_dy DISTRIBUTED BY (player_id ); SELECT count(*) FROM stage.bets_dy; -- The Monthly Bets DROP TABLE IF EXISTS stage.bets_mn CASCADE; CREATE TABLE stage.bets_mn WITH ( appendonly=true ) AS SELECT * FROM etl.x_bets_mn DISTRIBUTED BY (player_id ); SELECT count(*) FROM stage.bets_mn;


Load Using External Web Tables � Shell command or script based =# CREATE EXTERNAL WEB TABLE log_output (linenum int, message text) EXECUTE '/var/load_scripts/get_log_data.sh' ON HOST FORMAT 'TEXT' (DELIMITER '|');

� URL based =# CREATE EXTERNAL WEB TABLE ext_expenses (name text, date date, amount float4, category text, description text) LOCATION ( 'http://intranet.company.com/expenses/sales/file.csv’, ) FORMAT 'CSV' ( HEADER );


Create the External Web Table -- x_cpu_stat: -- -- An example external web executable table that reads the current CPU -- utilization stats via the CPUSTAT script. -- DROP EXTERNAL WEB TABLE IF EXISTS admin.x_cpu_stat CASCADE; CREATE EXTERNAL WEB TABLE admin.x_cpu_stat ( seg_host VARCHAR , user_cpu NUMERIC , system_cpu NUMERIC , wait_cpu NUMERIC , idle_cpu NUMERIC ) EXECUTE '/home/load/VEGAS/scripts/CPUSTAT' ON ALL FORMAT 'TEXT' ( DELIMITER '|' ) ;


Optimizing gpfdist for Performance �  In general, maximize the parallelism as the number of

segments increase

� Spread the data evenly across as many nodes as possible

� Spread the data evenly across as many file systems as possible

–  Run two gpfdist's per file system

� Run gpfdist on as many interfaces (NICs) as possible

� Keep the work even across ALL of these resources –  In an MPP shared nothing environment loading is as fast as the slowest node


gp_external_max_segs Optimization � Controls the maximum number of segments each gpfdist

serves

� Keep gp_external_max_segs and number of gpfdist processes an even factor

–  gp_external_max_segs / # of gpfdist processes should have a remainder of 0

� Default is 64


COPY � Quick and easy

� Recommended for small loads –  Not recommended for bulk loads

� Load from file or standard input

�  Is not parallel uses a single process on the master –  Can improve performance by running multiple COPY commands

concurrently –  Data must be divided across all concurrent processes

� Source file must be accessible by the master


GPLOAD �  Interface to readable external tables

–  Invokes gpfdist for parallel loading

�  Creates external table based on source data defined

�  Uses load specification defined in a YAML formatted control file –  INPUT ▪  Hosts, ports, file structure

–  OUTPUT ▪  Target Table ▪  MODES: INSERT, UPDATE, MERGE ▪  BEFORE & AFTER SQL statements


GPLOAD YAML Control File Example VERSION: 1.0.0.1 DATABASE: vegas USER: load HOST: mdw1-10 PORT: 5434 GPLOAD: INPUT: - SOURCE: LOCAL_HOSTNAME: - mdw1-10 PORT: 8081 FILE: - /data/GPD-LOAD/EXT/bet_091101_1* - /data/GPD-LOAD/EXT/bet_091101_2* - /data/GPD-LOAD/EXT/bet_091101_3* - /data/GPD-LOAD/EXT/bet_091101_4* - COLUMNS: - bet_ts: TIMESTAMP - player_id: INTEGER - game_id: SMALLINT - bet_amt: SMALLINT - FORMAT: text - DELIMITER: '|' OUTPUT: - TABLE: stage.bets_wk - MODE: INSERT SQL: - BEFORE: "TRUNCATE stage.bets_wk; CREATE TEMPORARY TABLE foo AS SELECT current_timestamp AS start_ts;" - AFTER: "INSERT INTO admin.etl_log VALUES ( 'weekly bets load', (SELECT start_ts FROM foo), current_timestamp, (SELECT count(*) FROM stage.bets_wk))"

GPDB Connection Information

gpfdist Configuration Information

External Table Options

Data Format


Error Handling �  Single Row Error Handling

–  Supported in COPY, GPLOAD and external tables –  Define a table to catch the ‘unloadable’ rows

— Load continues and does not fail until reject limit

�  Use LOG ERRORS INTO to declare an error table to write error rows

�  Capping the number of rejects –  Once limit is met, load statement fails –  Limit can be actual number (count) or percent of total rows (1-100) –  Rejects are evaluated at the segment level

. . . LOG ERRORS INTO err_expenses SEGMENT REJECT LIMIT 100 ROWS;


Load Optimization and Maintenance � Drop indexes before loading into existing tables

� Create indexes after loading

� Run ANALYZE after loading –  If the load significantly alters the table data run VACUUM ANALYZE –  disable auto stat collection during loading by setting the GUC

gp_autostats_mode to none

� Run VACUUM after load errors


After loading Always Loading Summary

1.  VACUUM

2.  ANALYZE

3.  Check for skew –  SELECT 'facts' as "Table Name",max(c) as "Max Seg Rows",

min(c) as "Min Seg Rows", (max(c)-min(c))*100.0/max(c) as "Percentage Difference Between Max & Min" from (SELECT count(*) c, gp_segment_id from facts group by 2) as a;


Lab


Lab Load


�  Review the load scripts

�  Load using an external table

�  Load using COPY

�  Load using GPLOAD


Unloading Data


PDB Data Unloading Options Unloading Method Common Uses Examples

COPY TO

•  Quick and easy data out •  Output sample results from SQL

statements

COPY ( SELECT * FROM performers) TO ‘/tmp/performers.dat’;

External Writeable Tables

•  High speed bulk unloads •  File based or web based •  Parallel unloading using gpfdist protocol •  Local file, named pipes, applications

INSERT INTO craps_bets SELECT g.bet_type , g.bet_dttm , g.bt_amt FROM x_allbets b JOIN games g ON ( g.id = b.game_id ) WHERE g.name = ‘CRAPS’;

gp_dump

•  Dumps database into SQL script files •  Restore, recreate a database •  INSERT or COPY

gp_dump mydatabase

psql with SELECT


Unload Using COPY TO � Quick and easy

� Recommended for small unloads –  Not recommended for bulk unloads

� Can filter output using SELECT

� Unload to file or standard output

�  Is not parallel uses a single process on the master

� Source file must be accessible by the master


Unload Using a File Based Writable External Table � Uses gpfdist

� Allows only INSERT operations =# CREATE WRITABLE EXTERNAL TABLE unload_expenses

( LIKE expenses )

LOCATION ('gpfdist://etlhost-1:8081/expenses1.out',

'gpfdist://etlhost-2:8081/expenses2.out')

FORMAT 'TEXT' (DELIMITER ',')

DISTRIBUTED BY (exp_id);

psql –c ‘INSERT INTO unload_expenses SELECT * FROM expenses;’


Unload Using a Writable External Web Table � Command based

–  Use EXECUTE to specify a shell command, script, or application

=# CREATE WRITABLE EXTERNAL WEB TABLE output (output text)

EXECUTE 'export PATH=$PATH:/home/gpadmin/programs; myprogram.sh'

FORMAT 'TEXT'

DISTRIBUTED RANDOMLY;


gp_dump � Dumps the contents of a database into a script file in the

master data directory –  Global objects, users, groups, permissions, etc

� Launches gp_dump_agent for each segment to create a log file and data file in the segments data directory

� Restore/rebuild the database using gp_restore


pg_dump �  pg_dump can be used to dump only ddl for a schema or database

�  pg_dump can also be used to create a single backup file of the database/schema on the master

–  Sufficient space must be available

�  -n schema_name

�  -s dump ddl only

�  Use pg_dump –help –  To see command options


Lab


Lab UNLOAD


�  Review the 4 unload shell scripts

�  Unload using COPY

�  Unload using external table

�  Unload using gpdump

�  Unload using PSQL


ANALYZE


ANALYZE and Database Statistics �  Updated statistics are critical for the Query Planner to generate optimal

query plans –  When a table is analyzed table information about the data is stored

into system catalog tables

�  Always run ANALYZE after loading data

�  Always run ANALYZE after CREATE INDEX operations

�  Run ANALYZE after INSERT, UPDATE and DELETE operations that significantly changes the underlying data

�  The gp_autostats_on_change_threshold can be used in conjunction with gp_autostats_mode to auto analyze during these operations


default_statistics_target GUC

•  System generates statistics by sampling data •  Increase sampling for statistics collected for ALL columns •  Range from 1 to 1000 (default 25) •  Increasing the target value may improve query planner estimates •  The higher the value the longer stat collection will take


gp_analyze_relative_error GUC •  Affects sampling rate during statistics collection to determine cardinality in a column

–  For example, a value of .5 is equivalent to an acceptable error of 50%

•  Default .25 •  Decreasing the relative error fraction (accepting less errors) tells the system to sample more rows


ANALYZE [table [ (column [, ...] ) ]] •  For very large tables it may not be feasible to run ANALYZE on the entire table •  ANALYZE may be performed for specific columns •  Run ANALYZE for

–  Columns used in a JOIN condition –  Columns used in a WHERE clause –  Columns used in a SORT clause –  Columns used in a GROUP BY or HAVING Clause

!!


EXPLAIN and EXPLAIN ANALYZE


EXPLAIN � Displays the query execution plan for a query

� Query plans are a tree plan of nodes – Each node in the plan represents a single operator, such as

table scan, join, aggregation or a sort

� Valuable tool to understand the execution of an underperforming query to identify nodes or operators that are consuming the most resources and/or taking the most time


EXPLAIN Estimated Costs � Cost

–  Measured in units of disk page fetches

�  Rows –  The number of rows output by the plan node

�  Width –  Total bytes of all the rows output by the plan node


EXPLAIN and EXPLAIN ANALYZE � EXPLAIN displays the estimated costs for a query plan

� EXPLAIN ANALYZE runs the statement in addition to displaying the plan

�  Run EXPLAIN ANALYZE on queries to identify opportunities to improve query performance

�  To use on a DML statement without affecting the data use EXPLAIN ANALYZE in a transaction

–  BEGIN; EXPLAIN ANALYZE…; ROLLBACK;


EXPLAIN ANALYZE � EXPLAIN ANALYZE runs the statement in addition to

displaying the plan with additional information –  Total elapsed time (in milliseconds) to run the query –  Number of workers (segments) involved in a plan node operation –  Maximum number of rows returned by the segment (and its

segment ID) that produced the most rows for an operation –  If applicable, the memory used by the operation –  Time (in milliseconds) it took to retrieve the first row from the

segment that produced the most rows, and the total time taken to retrieve all rows from that segment


Reading EXPLAIN ANALYZE Output •  Query plans are a right to left plan tree of nodes read from the bottom up •  Each node feeds its results to the node directly above •  There is one line for each node in the plan tree •  Each node represents a single operation

–  Sequential Scan, Hash Join, Hash Aggregation, etc

•  The plan will also include motion nodes responsible for moving rows between segment instances

–  Redistribute Motion, Broadcast Motion, Gather Motion


EXPLAIN Plan Simple Example mydb=# explain SELECT gp_segment_id, count(*) FROM product GROUP BY gp_segment_id;

QUERY PLAN

----------------------------------------------------------------------------------------------

Gather Motion 2:1 (slice2; segments: 2) (cost=1.05..1.06 rows=1 width=4)

-> HashAggregate (cost=1.05..1.06 rows=1 width=4)

Group By: product.gp_segment_id

-> Redistribute Motion 2:2 (slice1; segments: 2) (cost=1.01..1.03 rows=1 width=4)

Hash Key: product.gp_segment_id

-> HashAggregate (cost=1.01..1.01 rows=1 width=4)

Group By: product.gp_segment_id

-> Seq Scan on product (cost=0.00..1.01 rows=1 width=4)

(8 rows)

Time: 181.299 ms


What to Look for in Explain Plans �  Number of slices

–  The number of slices in a query is an indicator of motion (broadcasts/redistributes). ▪  High number of slices = a lot of data movement

�  Estimated row counts

�  Partition Pruning

�  Costs – Can be misleading, but still should be reviewed

�  Good things: –  Hash Joins –  Hash Aggs

�  Not so good things (but possibly unavoidable): –  Nested Loop Joins –  Merge Joins –  Sorts

�  Things that may be good or bad: –  Broadcasts –  Redistributes


Lab


Lab Queries-Rnd1


�  Review the SQL for the six queries –  What do they do?

�  Run the queries using the run_queries.sh script


Analyzing Query Plans


Join Types

Inner Join Left Outer Join Right Outer Join

Full Outer Join Cross Join


Query Plan Operators •  Fast Operators

–  Sequential Scan –  Hash Join –  Hash Agg –  Redistribute Motion

• Slow Operators –  Nested Loop Join * –  Merge Join –  Sort –  Broadcast Motion *

Join iterators

Aggregation and sort iterators

Motion iterators

* Sometimes, broadcast and NLJ are fast and cannot be categorically defined as “slow.” •  Broadcast of relatively small tables can be very

efficient. Don’t always assume that Broadcast in the explain equates to slow.

•  NLJ for certain queries can be fast (when combined with an index). NLJs usually require an index for performance.


Row Elimination Greenplum often has to use disk space for temporary storage of data during

join executions. The Optimizer minimizes the amount of memory and disk space required by:

•  Projecting (copying) only those columns that the query requires. •  Doing single-table set selections first (qualifying rows). •  Putting only the smaller table into the spool whenever possible.

Note: Non-Equality join operators produce a (partial) Cartesian product. Join operators should always be equality conditions.


Row Redistribution / Motion

•  During join operations one of the major considerations of the Optimizer in deciding which rows to move is the Distribution Key.

•  Three general scenarios may occur when two tables are to be Merge

Joined (more on the Merge Join later):

§  The Join column is the Distribution Key for both tables.

§  The Join column is the Distribution Key of one of the tables.

§  The Join column is the Distribution Key of neither table.


Row Redistribution (1 of 3) The Join Column is the Distribution Key of both tables.

•  Joinable rows are already on the same target data segment, since equal primary index

values always hash to the same data segment.

•  No movement of data to other segments is required.

•  The rows are already sorted in hash sequence because of the way in which they are stored by the file system.

•  With no need to sort or move data, the join can take place immediately.

T1 T2A B C A B CUPI UPI

100 214 433 100 725 2

JOIN columns are from the same domain. No Redistribution needed.

SELECT ... FROM T1, T2 WHERE T1.A = T2.A


Row Redistribution (2 of 3) The Join Column is the Distribution Key of one of the tables. •  In this case, one table has its rows on the target Data Segments and one does not.

•  The rows of the second table must be redistributed to their target Data Segments by the hash code of the join column.

•  If the table is “small”, the Optimizer may decide to simply duplicate the entire table on all Data Segments instead of hash redistributing.

•  In either case, the the rows of one table will be copied to their target Data Segments.


255 345 225 867 255 566 SPOOLA B C

PI867 255 566

Redistribute T4 rows on column B.

SELECT ... FROM T3, T4 WHERE T3.A = T4.B


Row Redistribution (3 of 3) The Join Column is the Distribution Key of neither table.


456 777 876 993 228 777

SPOOL SPOOLA B C A B C

PI PI456 777 876 993 228 777

Redistribute T5 rows on B.Redistribute T6 rows on C.

SELECT ... FROM T5, T6 WHERE T5.B = T6.C

•  If the join columns from neither table are Distribution Key columns, then both tables may require preparation for joining. •  This may involve various combinations of hash distributions or table duplications. •  This approach involves the maximum amount of data movement.


Join Methods

Join METHODS are the means that Greenplum database employs to join two tables:

§  Sort Merge Join

§  Hash Join

§  Nested Loop Join

§  Product Join


Sort Merge Join Sort merge joins are commonly done when the join condition is based on equality. The sort merge join processing consists of the following steps:

•  Identify the smaller table. •  If necessary: put the qualifying data from one or both tables into temp space. •  If necessary: move, or “co-locate” the spool rows to the Data Segments based on the Join Column hash. •  If necessary: sort the remaining rows by Join Column Row Hash value. •  Compare those rows with matching Join Column Row Hash values.


Sort Merge Join - Illustration Merge Join Process: •  Identify the smaller table.

•  If necessary:

•  Put qualifying rows of one or both tables into temporary space or spill files.

•  Move (“co-locate”) the memory rows to data segments based on the join column row hash value.

•  Sort the temporary table rows into Join Column Hash sequence.

•  Compare the rows with matching join column row hash values!

A3

C4

B8

DATA

DATA

DATA

JOINROWHASH

B7

C4

C4

DATA

DATA

DATA

A3

B7

A3

DATA

DATA

DATA

JOINROWHASH

Table A

Table B


Hash Join •  Merge Join requires that both sides of the join have their qualifying rows in join

columns hash sequence. In addition, if the join columns are not the Primary Index columns, then some redistribution or duplication of rows precedes the sort.

•  In a Hash Join, the smaller table is sorted into join column hash sequence and then duplicated on all data segments. The larger table is then processed one row at a time. For those rows that qualify for joining the Row Hash Value of the join columns is used to do a binary search through the smaller table for a match.

•  The Optimizer can choose this join plan when qualifying rows of the smaller table can be held segment memory resident - the whole row set held in the each segment’s memory.


Hash Join - Illustration Hash Join Process: •  Identify the smaller table (A)

•  Duplicate it on every data segment

•  Sort into Join Column hash sequence.

•  Hold the rows in memory.

•  Use the join column hash value of the larger table to search memory for a match

A performance benefit of this join can result from the fact that there is no need for sorting larger join tables before performing the join!

A3

C4

B8

DATA

DATA

DATA

JOINROWHASH

B7

C4

DATA

DATA

A3

B7

A3

DATA

DATA

DATA

JOINROWHASH

Table A

Table B


Nested Loop Join Nested Loop Join can be one of the most efficient types of joins. For a Nested Loop Join to be done between Table 1 and Table 2 the following two conditions must occur:

1.  An equality value for the unique index of Table 1

2. A join on a column between Table 1 and an indexed column on Table 2.

Nested Loop Join Process: •  The system will retrieve rows from Table 1 based on the index value, then determine

the hash value in the Join Column to access matching rows in Table 2.

•  Nested Loop Joins are the only types of Join that don’t always use all of the Data Segments.


Nested Loop Join - Illustration


Product Join §  Product Join is a form of Nested Join. In a product Join, every qualifying row

of one table is compared to every qualifying row in the second table. Rows which match on WHERE conditions are saved.

§  Product Joins get their name from the fact that the required number of comparisons is a “product” of the number of qualifying rows from both tables. A Product Join between a table of 1,000 rows and a table of 100 rows would require 100,000 comparisons and a potential result set of 100,000 rows.

§  Because all rows of one side must be compared with all rows of the other, the smaller table is always duplicated on all Data Segments. Its rows are then compared to the Data Segment’s local rows of the larger table.


Product Joins (continued) Product Joins may be caused by any of the following:

•  Missing WHERE clause.

•  A Join condition not based on equality (NOT =, LESS, GREATER THAN)

•  Join conditions connected using OR.

•  Too few join conditions in the WHERE clause.

•  A referenced table not named in any join condition.

•  The Optimizer determining that it is less expensive than any other Join type.

•  The Optimizer determining that it is more expensive than other join types, but will lead to less expensive joins later in the plan such that the overall plan is less expensive. For example, in planning joins for a star/snowflake schema.


Product Join - Illustration Product Join process: •  Identify the Smaller Table.

•  Duplicate it in temporary space or memory on all data segments.

•  Join each row for Smaller Table to every row for Larger Table.

Ø  Number of comparisons = # qualified rows in table A * # qualified rows in table B

Ø  The internal comparisons become very costly when there are more rows than data that can be held in segment memory at one time.

A3

C4

B8

DATA

DATA

DATA

JOINROWHASH

B7

C4

C4

DATA

DATA

DATA

A3

B7

A3

DATA

DATA

DATA

JOINROWHASH

Table A

Table B


Scan Operators � Seq Scan on <table>

–  Heap table

� Append-only Scan on <table> –  Row-oriented AO table

� Append-only Columnar Scan on <table> –  Column-oriented AO table

�  Index Scan using <index> on <table>

� Bitmap Append-Only Row-Oriented Scan on <table>


Join Operators � Hash Join

–  Load smaller dataset into a hash table and scan larger dataset –  Fastest join method of joining a large dataset to a large dataset

� Nested Loop –  For each row in larger dataset, scan smaller dataset –  Fastest join for joining small datasets or ones limited by index use –  Also used for cartesian joins and range joins

� Merge Join –  Sort both datasets and merge together –  Fast for already-ordered data –  Very rare in the real world


Motion Operators � Broadcast

–  Every segment broadcasts (send) its rows to all other segments

–  Results in every segment having a copy of the entire table

� Redistribute –  Dynamic redistribution of rows needed from one table is sent

to other segment instances to perform a join

� Gather –  Segments send the resulting rows to the master host to be

returned to the client


EXPLAIN and Indexes � Review the query plan

–  Index scan – Bitmap Heap Scan – Bitmap Index Scan – BitmapAnd or BitmapOr


Example Query Plan


Reading a Query Plan

-  Scan categories_dim -  Broadcast motion



-  Scan two partitions and filter on order_day -  Append those two data sets together -  Hash on order_id to prepare for the hash join



-  At the same time, also scan two lineitems partitions and filter on order_day

-  Append those two data sets together



Perform the hash join



-  Redistribute motion on product_id -  Hash on product_id for hash join



-  Scan products_dim -  Hash join products to lineitems



-  Hash join categories to products/orders/lineitems



-  Hash aggregate the data on order_day -  Does not sort the data -  Operation occurs on all segments (local)



-  Redistribute data on order_day -  Second (and final) GROUP BY hash aggregate


Components of a Query Plan

-  Gather motion -  Funnel all data to a single stream to return to the user


Optimizing Query Plans �  Identify plan nodes where the estimated costs are very high

–  Does the estimated number of rows seem reasonable and the cost relative to the number of rows for the operation?

�  Ensure partition elimination is achieved if using partitioning

�  Review execution join order –  Build on the smaller tables or hash join result and probe with larger

tables –  Optimally the largest table is used for the final join or probe to

reduce the number of rows being passed up the tree to the topmost plan nodes

�  Ensure statistics are up to date!


Optimizing Joins � enable_nestloop

–  Validate the SQL and ensure that the results are what is intended –  To favor hash joins over nested loop joins ensure enable_nestloop

is set to off

� enable_mergejoin –  Merge join is based on the idea of sorting the left- and right-hand

tables in order and then scanning them in parallel –  To favor merge joins over hash joins set to on


Optimizing Sort/Aggregation Operations � Replace large sorts or aggregate operations with HashAgg

� enable_groupagg –  Default ON –  Enables or disables the query planner’s use of group aggregation

plan types

� To favor a HashAggregate operation over a sort and aggregate operation ensure enable_groupagg is on


Optimizing Motion � Eliminate large table broadcast motion for redistribute motion

� gp_segments_for_planner Default value: 0 –  If 0, then the value used is the actual number of primary segments –  If the wrong table is being motioned, before modifying

gp_segments_for_planner: ▪  Ensure that the join keys have been defined with the same datatype (especially for

join columns that are also distribution keys)

–  set gp_segments_for_planner to a high number ▪  For example, SET gp_segments_for_planner=1000000;

–  To influence the optimizer to broadcast a table (not redistribute it) set gp_segments_for_planner to a low number ▪  SET gp_segments_for_planner=2;


•  statement_mem Default value: 125MB –  Replaces work_mem when gp_resqueue_memory_policy=auto –  Allocates segment host memory per query –  Increasing improves the performance of a query

•  gp_workfile_compress_algorithm Default value: none –  When a hash aggregation or hash join operation spills to disk,

specifies the compression algorithm to use on the spill files –  If using zlib, it must be in your ${PATH} on all segments for user

gpadmin –  Recommended to compress spill files if the system is I/O bound

Optimizing Query Memory and Spill Files


Lab


Lab EXPLAIN


�  Run EXPLAIN ANALYZE for all six queries

�  Review and analyze the query plans –  Understanding query plans is the most important skill when

determining query performance and identifying opportunities for improving or optimizing query performance


VACUUM


VACUUM � VACUUM reclaims physical space on disk from deleted or

updated rows or aborted load/insert operations

� VACUUM collects table-level statistics such as the number of rows and pages

� Run VACUUM after –  Large DELETE operations –  Large UPDATE operations –  Failed load operations


Free Space Map � Expired rows are tracked in the free space map

� Free space map size must be large enough to hold all expired rows

� VACUUM can not reclaim space occupied by expired rows that overflow the free space map

–  VACUUM FULL reclaims all expired rows space ▪  Is an expensive operation ▪  Takes an exceptionally long time to finish

� max_fsm_pages

� max_fsm_relations


VACUUM ANALYZE •  Use VACUUM ANALYZE to vacuum the database and generate database statistics

–  May specify table and column names VACUUM ANALYZE [table [(column [, ...] )]]


Lab


Lab Tuning �  Tune the six queries to achieve optimal query run times and

rerun the queries –  Distributions –  Partitioning –  Indexes –  GUCs –  Etc…

�  Participants post their best query times for all six queries on the board

–  Class discussion on how the fastest query times where achieved


Introduction to Monitoring


gpstate and gpcheckperf � gpstate

–  Identify failed segments –  Configuration information for master and segment hosts

� gpcheckperf –  Assesses performance of a cluster

▪  Use to profile a system over time –  Identify segment hosts with hardware issues

▪  Disk I/O test ▪  Memory bandwidth test ▪  Network performance test


pg_log directory � Log messages written to pg_log directory on master and

segments data directory –  $ cd $MASTER_DATA_DIRECTORY/pg_log

� Check master log file first –  Search for WARNING, ERROR, FATAL, PANIC log level messages

� Rollover daily

� Use gplogfilter to search log files on segment hosts –  Uses gpssh –  $ gplogfilter -t


gp_toolkit � Use to query system catalogs and log files

� Contains views that can be accessed with SQL

� gp_bloat_diag view –  Tables that have bloat –  Require a VACUUM or VACUUM FULL

� gp_stats_missing view –  Tables that do not have statistics –  Require ANALYZE


Pivotal Command Center 1.2 �  Interactive graphical web application

–  Monitor system performance metrics –  Monitor system health –  Perform management tasks ▪  Start, stop, recover segments, etc

� $ gpcmdr --start [instance_name]

� $ gpcmdr --stop [instance_name]

� Configuration parameters for Command Center agents –  $MASTER_DATA_DIRECTORY/gpperfmon/conf/gpperfmon.conf


Command Center Database (gpperfmon)

� Data collection agents run on master and every segment host

–  Stores its data and metrics in a dedicated Command Center database

� Connect to database using psql, JDBC, ODBC –  $ psql gpperfmon

� Three types of tables –  Now, History and Tail


GPDB Internal Pivotal Use Only Questions and Support


GPDB Support for Field Engineers

� Advance Field Engineering Web Site https://sites.google.com/a/gopivotal.com/advanced-field-engineering/home

� Pivotal Socialcast –  https://gopivotal-com.socialcast.com/login –  Register for groups

▪  GPDB Champions ▪  Field Engineering ▪  Etc…

–  Post questions, get notifications


Appendix


UDFs


User Defined Functions

� 3 types of user defined functions (UDF) supported –  Query language functions written in SQL –  Procedural functions written in procedural languages like PL/

pgSQL, PL/Perl –  C language functions

� CREATE FUNCTION statement defines in the database

� CREATE LANGUAGE statement registers a new procedural language in the database


SQL Functions

� SQL functions execute an arbitrary list of SQL statements, returning the result of the last query in the list

CREATE FUNCTION double_salary(emp) RETURNS numeric AS $$! SELECT $1.salary * 2 AS salary;!$$ LANGUAGE SQL;!


Procedural Functions

� PDB supports various procedural languages (PL) –  PL/pgSQL is the most common (aka: stored procedure language)

� Adds control structures to the SQL language.

� Can perform complex computations.

�  Inherits all user-defined types, functions, and operators.

� See online Postgres documentation for details on procedural language use and syntax


C Language Functions �  User-defined functions can be written in C (or a language that can be

made compatible with C, such as C++).

�  C UDFs are compiled into dynamically loadable objects (also called shared libraries) and are loaded by the server on demand.

�  The loadable object must be deployed on all nodes in the PDB cluster

CREATE FUNCTION concat_text(text, text) RETURNS text! AS 'DIRECTORY/funcs', 'concat_text'! LANGUAGE C STRICT;!


Function Volatility

� Every function has a volatility classification –  VOLATILE, STABLE, or IMMUTABLE. –  VOLATILE is the default if the CREATE FUNCTION command does

not specify a category.

� The volatility category is a promise to the optimizer about the behavior of the function


GPDB Locks


Lock Mode Associated SQL Commands Conflicts With

ACCESS SHARE

SELECT ACCESS EXCLUSIVE

ROW SHARE SELECT FOR UPDATE, SELECT FOR SHARE EXCLUSIVE, ACCESS EXCLUSIVE

ROW EXCLUSIVE INSERT, COPY SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

SHARE UPDATE EXCLUSIVE

VACUUM (without FULL), ANALYZE SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

SHARE CREATE INDEX ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

SHARE ROW EXCLUSIVE

ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

EXCLUSIVE DELETE, UPDATE (1) ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

ACCESS EXCLUSIVE ALTER TABLE, DROP TABLE, TRUNCATE, REINDEX, CLUSTER, VACUUM FULL

ACCESS SHARE, ROW SHARE, ROW EXCLUSIVE, SHARE UPDATE EXCLUSIVE, SHARE, SHARE ROW EXCLUSIVE, EXCLUSIVE, ACCESS EXCLUSIVE

GPDB Locks


Lock Considerations •  UPDATE and DELETE statements in GPDB acquire a more restrictive lock than

some other database •  EXCLUSIVE rather than ROW EXCLUSIVE

•  During an insert, selects on heap will take longer as the data needs to be constructed from rollback

•  During a long running select, truncate is blocked until select is finished

•  Table partition operations are “ALTER TABLE” operations and require an ACCESS EXCLUSIVE

•  VACUUM FULL requires an ACCESS EXCLUSIVE

•  Check with pg_locks and pg_stat_activity system views


Other Performance Considerations


GPDB Optimization and Performance Tuning � Review optimization, tuning and best practices provided in

the applicable topic modules within this immersion –  Distributions –  Partitioning –  Storage Orientation –  Compression –  Indexes

–  Loading –  ANALYZE –  Query Plans –  VACUUM


Data Types and Byte Alignment � Lay out columns in heap tables as follows

–  8 byte first (bigint, timestamp) –  4 byte next (int, date) –  2 byte last (smallint)

� Put distribution and partition columns up front –  Two 4 byte columns = an 8 byte column

�  For example Int, Bigint, Timestamp, Bigint, Timestamp, Int (distribution key), Date (partition key), Bigint, Smallint --> Int (distribution key), Date (partition key), Bigint, Bigint, Timestamp, Bigint, Timestamp, Int, Smallint


Use OLAP Grouping Extensions

� Pivotal DB supports a number of built-in aggregate functions

� All are extensions to the standard GROUP BY clause – GROUP BY ROLLUP ( col1, col2, col3 ) – GROUP BY CUBE ( col1, col2, col3 ) – GROUP BY GROUPING SETS( (c1, c2), (c1, c3))


Window Functions �  Pivotal DB supports window functions �  These are used to apply an aggregation over partitions of the

result set –  For example, sum( population ) over ( partition by city )

�  They can also be used to produce row numbers which can be of use

–  For example, row_number() over ( order by id )

�  These are powerful and perform all of the work ‘in database’ providing speed advantages over front-end tools


Set Based versus Row Based � PL/SQL and other procedural languages utilize cursors to

operate on one record at a time –  Typically, these programs are written by programmers not

database experts –  Looping over a set of records returned by a query results in an

additional query plan per record

� Pivotal DB performs better when dealing with operations over a set of records

–  1 query plan for the whole set of records versus 1 main query plan + 1 per row


PL/SQL Conversion � Things to note when migrating from PL/SQL to PL/PGSQL

–  Pivotal HD doesn’t support procedures – only functions –  Pivotal HD doesn’t support packages –  Pivotal HD supports cursors, but they should not be used

� POC’s that require converting a large amount of PL/SQL should be tightly scoped to limit our exposure and to limit our resource commitment


PL/SQL Example DECLARE CURSOR product_cur IS SELECT * FROM product FOR UPDATE OF product_price; BEGIN FOR product_rec IN product_cur LOOP UPDATE product SET product_price = (product_rec.product_price * 0.97) WHERE CURRENT OF product_cur; END LOOP; END; Pivotal HD UPDATE product set product_price = product_price * 0.97;


Query Profiling and Tuning • Statistics are very important (ANALYZE) • Know your data

•  selectivity of columns •  number of rows •  Size of the table or partition (vacuum or not vacuum J)

• Are DB design elements helping performance? •  data type selection •  table partitioning •  indexes

• Top query tuning configuration parameters –  enable_* (temporarily disable a plan operation type) –  default_statistics_target, gp_analyze_relative_error (stats collected by

ANALYZE)


Computational Skew Example


Causes of Computational Skew � Operations on columns that have low cardinality and non-

uniform distribution –  For example, If a row_number function is used on state column of

customer table, it will result in more data flowing to segment that contains rows for ‘CA’, resulting in computational skew

�  Instead of 2 stage aggregation, one stage aggregation is picked by optimizer for columns with low cardinality

�  If lower stages of the plan have computational skew and no more redistribute/broadcast is needed for data

–  Inherits skew for lower level operators in the plan


How to Detect Using Command Center � Performance Monitor

–  Visible through ‘System Metrics’ à ‘Realtime (By Server)’ tab of Performance Monitor ▪  Start the query ▪  If query runs for long time on one node as compared to other nodes there is

computational skew –  Also through ‘Query Monitor’ à ‘Query Plan’ tab of Performance

Monitor ▪  Search for query in question through ‘Query Monitor’ tab ▪  Open ‘Query Plan’ tab ▪  On each operator there will be an attribute ‘CPU Skew’ that can help in detecting

the computational skew


How to Detect in a Query Plan � Each operator outputs max rows processed by one of

the segment and average rows processed by all segments for that operator

�  If max row is greater than average rows processed, one of the segments has done more work than others and there might be skew at that operator level

� For complex query plans it’s more easy to egrep on ‘Max|first row|Avg’


-> Sort (cost=9282847395.67..9284369187.77 rows=6340801 width=618) Sort Key: coplan.mdu, coplan.corp_franchise_tax_area, coplan.disconnect_code, coplan.tms_program_id Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 28963736 ms to first row, 30538370 ms to end, start offset by 5825 ms. Executor memory: 38244K bytes avg, 43450K bytes max (seg68). Work_mem used: 38244K bytes avg, 43450K bytes max (seg68). Work_mem wanted: 52319K bytes avg, 57648K bytes max (seg34) to lessen workfile I/O affecting 96 workers. -> Subquery Scan coplan (cost=8563442302.60..8574094847.32 rows=6340801 width=618) Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 23688003 ms to first row, 24541413 ms to end, start offset by 5826 ms. -> Window (cost=8563442302.60..8568007678.91 rows=6340801 width=618) Partition By: coplan.mdu Order By: coplan.quarter_hour_of_day_offset Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 23688003 ms to first row, 24478925 ms to end, start offset by 5826 ms. -> Sort (cost=8563442302.60..8564964094.71 rows=6340801 width=618) Sort Key: coplan.mdu, coplan.quarter_hour_of_day_offset Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 23688003 ms to first row, 24095876 ms to end, start offset by 5826 ms. Executor memory: 38500K bytes avg, 43450K bytes max (seg34). Work_mem used: 38500K bytes avg, 43450K bytes max (seg34). Work_mem wanted: 53469K bytes avg, 58538K bytes max (seg95) to lessen workfile I/O affecting 96 workers. -> Subquery Scan coplan (cost=7844037209.54..7854689754.25 rows=6340801 width=618) Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 19434546 ms to first row, 21846399 ms to end, start offset by 5826 ms. -> Window (cost=7844037209.54..7848602585.84 rows=6340801 width=618) Partition By: coplan.mdu Order By: coplan.tms_program_id Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 19434546 ms to first row, 21811833 ms to end, start offset by 5826 ms. -> Sort (cost=7844037209.54..7845559001.64 rows=6340801 width=618) Sort Key: coplan.mdu, coplan.tms_program_id Rows out: Avg 2299741.1 rows x 96 workers. Max 115592480 rows (seg17) with 19434546 ms to first row, 21502044 ms to end, start offset by 5826 ms.

Example


Solutions and Workarounds �  Rewrite of the query or changing the plan by using optimizer

GUCs is the typical workaround �  Use of forced broadcast of dimension table can help in eliminating

certain join-skews –  Forced broadcast happens only if GUCs like gp_segments_for_planner

are used

�  Rewrite the query by creating temp tables to eliminate the skew –  Temp table can be randomly distributed so that two stage aggregation

is forced

�  Difficult to work around issues related to OLAP queries


Original Query Workaround Example \set P2 20101213000000000 \set P3 20101214000000000 SELECT fivemin, sum(case when trade_price between bid_price and ask_price then 1 else 0 end) as within_nbbo, sum(case when trade_price < bid_price or trade_price > ask_price then 1 else 0 end) as outside_nbbo, sum(case when trade_price < bid_price or trade_price > ask_price then ((greatest(bid_price-trade_price, trade_price-ask_price) * trade_volume) / 10000)::numeric else 0 end) as outside_nbbo_$ FROM ( SELECT tt.symbol , (tt.event_ts / 100000) / 5 * 5 as fivemin , tt.trade_price , tt.trade_volume , tq.bid_price , tq.ask_price FROM ( SELECT t.symbol , t.event_ts , t.trade_price , t.trade_volume FROM taqtrades t, portfolios p WHERE t.event_ts between :P2 and :P3 AND p.symbol = t.symbol AND p.portfolio_id = 'MD000499’ ) tt , ( SELECT ets , sym , bid_price , ask_price, lead(ets, 1) over (partition by sym order by ets) as end_ts FROM ( SELECT tq.event_ts as ets , tq.symbol as sym , (max_event_agg( tq )).* FROM taqquotes tq , portfolios p WHERE tq.event_ts between :P2 and :P3 AND p.symbol = tq.symbol AND p.portfolio_id = 'MD000499' AND tq.national_bbo_indicator in ('1', '4', '6') GROUP BY ets , sym) it ) tq WHERE tq.sym = tt.symbol AND tt.event_ts >= tq.ets and tt.event_ts < tq.end_ts) foo GROUP BY 1 ORDER BY 1 asc


Workaround Example Part of The Plan With Skew Gather Motion 30:1 (slice7; segments: 30) (cost=91300101.88..91300155.93 rows=145 width=8) -> HashAggregate (cost=91300101.88..91300155.93 rows=145 width=8) Group By: "?column1?" -> Redistribute Motion 30:30 (slice6; segments: 30) (cost=91299918.11..91300037.02 rows=145 width=8) Hash Key: unnamed_attr_1 -> HashAggregate (cost=91299918.11..91299950.54 rows=145 width=8) Group By: t.event_ts / 100000 / 5 * 5 -> Hash Join (cost=13299375.51..87036744.22 rows=28421160 width=8) Hash Cond: t.symbol::text = tq.sym::text Join Filter: t.event_ts >= tq.ets AND t.event_ts < tq.end_ts -> Redistribute Motion 30:30 (slice2; segments: 30) (cost=15748.71..1270184.88 rows=119685 width=16) Hash Key: p.symbol -> Hash Join (cost=15748.71..1198374.41 rows=119685 width=16) Hash Cond: t.symbol::text = p.symbol -> Append (cost=0.00..814860.08 rows=834282 width=13) -> Append-only Columnar Scan on taqtrades_1_prt_other t (cost=0.00..1.00 rows=1 width=40) Filter: event_ts >= 20101213000000000::bigint AND event_ts <= 20101214000000000::bigint -> Append-only Columnar Scan on taqtrades_1_prt_p20101213 t (cost=0.00..407169.66 rows=834282 width=12) Filter: event_ts >= 20101213000000000::bigint AND event_ts <= 20101214000000000::bigint -> Append-only Columnar Scan on taqtrades_1_prt_p20101214 t (cost=0.00..407689.42 rows=1 width=12) Filter: event_ts >= 20101213000000000::bigint AND event_ts <= 20101214000000000::bigint -> Hash (cost=15546.05..15546.05 rows=541 width=4) -> Broadcast Motion 30:30 (slice1; segments: 30) (cost=0.00..15546.05 rows=541 width=4) -> Seq Scan on portfolios p (cost=0.00..15378.53 rows=19 width=4) Filter: portfolio_id = 'MD000499'::text -> Hash (cost=13183225.16..13183225.16 rows=267738 width=48) -> Subquery Scan tq (cost=13042662.87..13183225.16 rows=267738 width=48) -> Window (cost=13042662.87..13102903.85 rows=267738 width=56) Partition By: it.sym Order By: it.ets -> Sort (cost=13042662.87..13062743.19 rows=267738 width=56) Sort Key: it.sym, it.ets -> Redistribute Motion 30:30 (slice5; segments: 30) (cost=11200259.57..12121483.81 rows=267738 width=56) Hash Key: it.sym


Rewritten Query Example Workaround SET gp_segments_for_planner=1; -- results in forced broadcast that helps CREATE TEMPORARY TABLE my_tq_agg as SELECT tq.event_ts ets, tq.symbol as sym, bid_price, ask_price , lead( tq.event_ts ) over ( partition by tq.symbol order by tq.event_ts ) as end_ts FROM taqquotes tq, portfolios p WHERE tq.event_ts between :P2 and :P3 AND p.symbol = tq.symbol AND p.portfolio_id = :P1 AND tq.national_bbo_indicator in ('1', '4', '6') DISTRIBUTED RANDOMLY; CREATE TEMPORARY TABLE my_tt_agg as SELECT t.symbol, t.event_ts, t.trade_price, t.trade_volume FROM taqtrades t, portfolios p WHERE t.event_ts between :P2 and :P3 AND p.symbol = t.symbol AND p.portfolio_id = :P1 DISTRIBUTED RANDOMLY; ANALYZE my_tq_agg; ANALYZE my_tt_agg; SELECT (tt.event_ts / 100000) / 5 * 5 as fivemin , sum(CASE WHEN trade_price between bid_price and ask_price then 1 else 0 end) as within_nbbo , sum(case when trade_price < bid_price or trade_price > ask_price then 1 else 0 end) as outside_nbbo , sum(case when trade_price < bid_price or trade_price > ask_price then ((greatest(bid_price-trade_price, trade_price-ask_price) * trade_volume) / 10000.0)::numeric else 0 end) as outside_nbbo_$ FROM my_tt_agg tt , my_tq_agg tq WHERE tq.sym = tt.symbol AND tt.event_ts >= tq.ets AND tt.event_ts < tq.end_ts GROUP BY 1 ORDER BY 1 asc


Query Plan After Workaround Gather Motion 72:1 (slice3; segments: 72) (cost=170589014.20..170589783.80 rows=4276 width=40) Merge Key: partial_aggregation.unnamed_attr_1 -> Sort (cost=170589014.20..170589783.80 rows=4276 width=40) Sort Key: partial_aggregation.unnamed_attr_1 Executor memory: 33K bytes avg, 33K bytes max (seg0). Work_mem used: 33K bytes avg, 33K bytes max (seg0). -> HashAggregate (cost=170553602.63..170560951.48 rows=4276 width=40) Group By: "?column1?" Executor memory: 8249K bytes avg, 8273K bytes max (seg0). -> Redistribute Motion 72:72 (slice2; segments: 72) (cost=156150493.72..170545484.18 rows=4276 width=40) Hash Key: unnamed_attr_1 -> HashAggregate (cost=156150493.72..170539327.32 rows=4276 width=40) Group By: tt.event_ts / 100000 / 5 * 5 Executor memory: 8273K bytes avg, 8273K bytes max (seg0). -> Hash Join (cost=42672.80..139721504.61 rows=22818041 width=40) Hash Cond: tq.sym::bpchar = tt.symbol Join Filter: tt.event_ts >= tq.ets AND tt.event_ts < tq.end_ts Executor memory: 63349K bytes avg, 63349K bytes max (seg0). Work_mem used: 63349K bytes avg, 63349K bytes max (seg0). (seg40) Hash chain length 9009.6 avg, 32025 max, using 100 of 524341 buckets. -> Seq Scan on my_tq_agg tq (cost=0.00..19903.54 rows=22844 width=35) -> Hash (cost=29413.39..29413.39 rows=12530 width=41) Rows in: Avg 900960.0 rows x 72 workers. Max 900960 rows (seg0) with 1332 ms to end, start offset by -13056 ms. -> Broadcast Motion 72:72 (slice1; segments: 72) (cost=0.00..29413.39 rows=12530 width=41) -> Seq Scan on my_tt_agg tt (cost=0.00..11371.13 rows=12530 width=41)


Workload Management and Resource Queues


Workload Management

� Manage the number of concurrent active queries –  Balance CPU, memory, disk resources

� Uses role-based resource queues –  Limit size/number of queries in a queue –  Assign priority levels

�  If a role is created and not assigned to a resource queue, the role is assigned to the default queue pg_default


Connection Management • Control over how many users can be connected.

• Provides pooling (to allow large numbers) and caps (to restrict numbers if desired)

• Intelligently frees and reacquires temporarily idle session resources

User-Based Resource Queues • Each user is assigned to a resource queue that performs ‘admission control’ of queries into the database

• Allows DBAs to control the total number or total cost of queries allowed in at any point in time

Dynamic Query Prioritization • Patent pending technique of dynamically balancing resources across running queries

• Allows DBAs to control query priorities in real-time, or determine default priorities by resource queue

Greenplum Workload Management


Configurable Limits for Queues �  Active statement count

–  The maximum number of statements that can run concurrently

�  Active statement memory –  The total amount of memory that all queries submitted through this

queue can consume

�  Active statement priority –  This value defines a queue’s priority relative to other queues in

terms of available CPU

�  Active statement cost –  This value is compared with the cost estimated by the query

planner, measured in units of disk page fetches


How Resource Queue Limits Work

Queries are evaluated by First In, First Out


Enabling Workload Management �  Create the resource queues and set limits

–  CREATE RESOURCE QUEUE command

�  Assign a queue to one or more roles –  CREATE ROLE or ALTER ROLE command

�  Use workload management views to monitor and manage the resource queues

–  gp_toolkit.gp_resq_activity_by_queue –  gp_toolkit.gp_resqueue_status –  gp_toolkit.gp_resq_priority_backend –  gp_toolkit.gp_resq_activity –  gp_toolkit.gp_resq_role –  gp_toolkit.gp_resq_priority_statement


Resource Queues •  Controls query concurrency and priority •  Roles are assigned to Resource Queues

•  Queues can have multiple roles assigned •  A Role can only be assigned to one Resource Queue

•  Queue Limit options •  Count Limit – Max number of active queries in the queue •  Cost Limit – Max Query Planner cost of active queries in the queue •  Minimum Cost Limit – Queries below this cost are not queued but run

immediately •  Can be specified with either Max Count limit or

Max Cost limits •  Queue Priority

•  Specifies Min, Low, Medium, High, Max Resource utilization

Priority Weight

Min 100

Low 200

Medium 500

High 1000

Max 1,000,000


Resource Queues

Resource Queue

Resource Queue

Resource Queue

User Joe

Larry

Sue

Apache

BOBJ

ETL

Finance

Exec

Web

Batch

Group

Ad Hoc


Queue Limit: Active Statement Count

?????

Resource Queue

Active Statements Waiting Statements

Queue Limit = 4

Min Cost = 10K ?


25K 25K 75K

Resource Queue


Query Cost Limit = 100K

?

Queue Limit: Active Statement Cost

??

Min Cost = 10K ?


Resource Queue


Query Cost Limit = 100K

101K

?

Queue Limit: Active Statement Cost

Min Cost = 10K


Special Tips : Cost_overcommit

•  If a resource queue is limited based on a cost threshold, then the administrator can allow COST_OVERCOMMIT (the default).

• Resource queues with a cost threshold and overcommit enabled will allow a query that exceeds the cost threshold to run, provided that there are no other queries in the system at the time the query is submitted. The cost threshold will still be enforced if there are concurrent workloads on the system.


MoreVRP

• Provides monitoring • Realtime CPU and I/O throttling • Rules – engine for workload management • Variance & BI reports


MoreVRP


resource_queues.sql Example CREATE RESOURCE QUEUE etl_queue WITH (ACTIVE_STATEMENTS=2, PRIORITY=MIN);

CREATE RESOURCE QUEUE user_queue WITH (ACTIVE_STATEMENTS=60, PRIORITY=LOW);

CREATE RESOURCE QUEUE high_queue WITH (ACTIVE_STATEMENTS=20, PRIORITY=HIGH);

CREATE RESOURCE QUEUE app_queue WITH (ACTIVE_STATEMENTS=10, PRIORITY=MEDIUM);


•  gp_resqueue_priority –  Default value: on –  Enables or disables query prioritization –  When this parameter is disabled, existing priority settings are not

evaluated at query run time

•  gp_resqueue_memory_policy –  Default value: eager_free –  Enables pivotal memory management features –  When set to none, memory management is the same as in Pivotal DB

releases prior to 4.1. –  When set to auto or eager_free, query memory usage is controlled by

statement_mem and resource queue memory limits

Resource Queue GUCs


•  gp_connections_per_thread –  A value larger than or equal to the number of primary segments means that each slice in a

query plan will get its own thread when dispatching to segments –  Lower values will use more threads, which utilizes more resources on the master –  Reducing this value improves the performance of queries that run for a couple of seconds

•  gp_enable_direct_dispatch –  Enables or disables the dispatching of targeted query plans for queries that access data on a

single segment –  This significantly reduces the response time of qualifying queries as there is no interconnect

setup involved •  Direct dispatch requires more CPU utilization on the master

–  Improves performance of queries that have a filter on the distribution keys –  This needs to be accounted for when deciding on distribution keys for tables –  Specially helpful in high concurrency environments

•  gp_cached_segworkers_threshold –  A higher setting may improve performance for power-users that want to issue many complex

queries in a row –  Helpful in high concurrency environments

GUCs for Improving Short Running Queries


GPDB Immersion GUC Reference


•  random_page_cost (master/session/reload) Default value: 100 –  Sets the planner’s estimate of the cost of a nonsequentially fetched disk page –  Lower value increases the chances for index scan to be picked

•  enable_indexscan (master/session/reload) Default value: on –  Enables or disables the query planner’s use of index-scan plan types

•  enable_nestloop (master/session/reload) –  Default value: off –  Enables or disables the query planner’s use of nested-loop join plans –  This should be enabled for use of index in nested loop joins

•  enable_bitmapscan (master/session/reload) Default value: on –  Enables or disables the query planner’s use of bitmap-scan plan types. –  Generally bitmap scan provides faster access, however you can try disabling it in specifically if you are getting very few rows out of index

•  enable_seqscan (master/session/reload) Default value: on –  Disabling enable_seqscan results in use of index –  Use this parameter very carefully only as last resort

GUCs for Index Selection


Setting GUCs to Influence Index Usage •  Iterative tuning steps to favor index usage

–  Start by turning setting or confirming the following GUC settings •  enable_indexscan to on •  For joins via index lookup, set enable_nestloop to on

–  Start by lowering random_page_cost •  Set to 20 •  If still not using the index, then set it to 10

–  If still not using the index, increase seq_page_cost •  Set to 10 •  If still not using the index, then set it to 15

–  If still not using the index, set enable_seqscan off


gp_external_max_segs Optimization � Controls the maximum number of segments each gpfdist

serves

� Keep gp_external_max_segs and number of gpfdist processes an even factor

–  gp_external_max_segs / # of gpfdist processes should have a remainder of 0

� Default is 64


default_statistics_target GUC

•  System generates statistics by sampling data •  Increase sampling for statistics collected for ALL columns •  Range from 1 to 1000 (default 25) •  Increasing the target value may improve query planner estimates •  The higher the value the longer stat collection will take


gp_analyze_relative_error GUC •  Affects sampling rate during statistics collection to determine cardinality in a column

–  For example, a value of .5 is equivalent to an acceptable error of 50%

•  Default .25 •  Decreasing the relative error fraction (accepting less errors) tells the system to sample more rows


Optimizing Joins � enable_nestloop

–  Validate the SQL and ensure that the results are what is intended –  To favor hash joins over nested loop joins ensure enable_nestloop

is set to off

� enable_mergejoin –  Merge join is based on the idea of sorting the left- and right-hand

tables in order and then scanning them in parallel –  To favor merge joins over hash joins set to on


Optimizing Sort/Aggregation Operations � Replace large sorts or aggregate operations with HashAgg

� enable_groupagg –  Default ON –  Enables or disables the query planner’s use of group aggregation

plan types

� To favor a HashAggregate operation over a sort and aggregate operation ensure enable_groupagg is on


Optimizing Motion � Eliminate large table broadcast motion for redistribute motion � gp_segments_for_planner Default value: 0

–  If 0, then the value used is the actual number of primary segments –  If the wrong table is being motioned, before modifying

gp_segments_for_planner: ▪  Ensure that the tables have been analyzed ▪  Ensure that the join keys have been defined with the same datatype (especially for

join columns that are also distribution keys) –  set gp_segments_for_planner to a high numbe

▪  For example, SET gp_segments_for_planner=1000000; –  To influence the optimizer to broadcast a table (not redistribute it) set

gp_segments_for_planner to a low number ▪  SET gp_segments_for_planner=2;


•  statement_mem Default value: 125MB –  Replaces work_mem when gp_resqueue_memory_policy=auto –  Allocates segment host memory per query –  Increasing improves the performance of a query

•  gp_workfile_compress_algorithm Default value: none –  When a hash aggregation or hash join operation spills to disk,

specifies the compression algorithm to use on the spill files –  If using zlib, it must be in your ${PATH} on all segments for user

gpadmin –  Recommended to compress spill files if the system is I/O bound

Optimizing Query Memory and Spill Files


Thank you!


Questions?

[email protected]

a new platform for a new era - john funk immersion gav1.2.pdf · a new platform for a new era...

Documents