scaling postgresql with gridsql

PRESENTATION NAME

Scaling PostgreSQL
with GridSQL

Who Am I?

Jim MlodgenskiCo-organizer of NYCPUG

Founder of Cirrus Technologies

Former Chief Architect of EnterpriseDB

Agenda

What is GridSQL?

Architecture

Query Flow

Scaling

Limitations

What is GridSQL?

Shared-Nothing, distributed data architecture.Leverage the power of multiple commodity servers while appearing as a single database to the application

Essentially... Open Source Greenplum, Netezza or Teradata

GridSQL Details

Designed for Parallel Querying

Not just Read-Only, can execute UPDATE, DELETE

Data Loader for parallel loading

Standard connectivity via PostgreSQL compatible connectors: JDBC, ODBC, ADO.NET, libpq (psql)

What GridSQL is not?

A replication solution like Slony or Bucardo

A high availability solution like Streaming Replication in PostgreSQL 9.0

A scalable transactional solution like PostgresXC

An elastic, eventually consistent NoSQL database

Configuration

Can be configured for multiple logical nodes per physical serverTake advantage of multi-core processors

Tables may be either replicated or partitioned

Replicated tables for static lookup data or dimensionsPartitioned tables for large fact tables

Partitioning

Tables may simultaneously use GridSQL Partitioning with Constraint Exclusion PartitioningLarge queries scan a much smaller subset of data by using subtables

Since each subtable is also partitioned across nodes, they are scanned in parallel

Queries execute much faster

Architecture

Loosely coupled, shared-nothing architecture

Data repositoriesMetadata database

GridSQL database

GridSQL processesCentral coordinator

Agents

Query Optimization

Cost Based OptimizerTakes into account Row Shipping (expensive)

Looks for joins with replicated tablesCan be done locally

Looks for joins between tables on partitioned columns

Aggregation

First set of aggregates done in parallel at the nodes

Like groups of intermediate results shipped to same target node

Second aggregation done in parallel

Coordinator streams in node results, combining on the fly and sending to client result set, performing a merge sort if ORDER BY present

Two Phase Aggregation

SUMSUM(stat1)

SUM2(SUM(stat1)

AVGSUM(stat1) / COUNT(stat1)

SUM2 (SUM(stat1)) / SUM2 (COUNT(stat1))

Creating Tables

Tables can be partitioned or replicated

CREATE TABLE region (r_regionkey INTEGER NOT NULL, r_name CHAR(25) NOT NULL, r_comment VARCHAR(152)) REPLICATED;

Creating Tables

CREATE TABLE orders ( o_orderkey INTEGER NOT NULL, o_custkey INTEGER NOT NULL, o_orderstatus CHAR(1) NOT NULL, o_totalprice DECIMAL(15,2) NOT NULL, o_orderdate DATE NOT NULL, o_orderpriority CHAR(15) NOT NULL, o_clerk CHAR(15) NOT NULL, o_shippriority INTEGER NOT NULL, o_comment VARCHAR(79) NOT NULL) PARTITIONING KEY o_orderkey ON ALL;

DBT3: Query 1

SELECTl_returnflag, l_linestatus, sum(l_quantity) as sum_qty, sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - l_discount)) as sum_disc_price, sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as avg_price,avg(l_discount) as avg_disc,count(*) as count_orderFROM lineitemWHERE l_shipdate

scaling postgresql with gridsql

Technology