gnw05 - extending oracle databases with hadoop

Extending Databases With the

Full Power of HadoopTanel Poder

http://gluent.comHow Gluent

Does it!

Gluent - who we are

I also co-authored the Expert Oracle Exadata book

Tanel PoderCo-founder & CEO & still a performance geek

I was an independent consultant for many years, Oracle performance & scalability work.

Long term Oracle Database & Data Warehousing guys –

focused on performance & scale.

We got started in Dec 2014 and are ~20 people by now

Alumni 2009-2016

1. Intro• Why Hadoop? (3 minute overview)

2. Gluent Data Platform fundamentals

3. Offloading Oracle data to Hadoop

4. Updating Hadoop data in Oracle

5. Querying Hadoop data in Oracle

6. Sharing Hadoop & RDBMS data with multiple apps & databases

Agenda

Why Hadoop?• Scalability in Software!

• Open Data Formats• Future-proof!

• One Data, Many Engines!

Hive Impala Spark Presto Giraph

SQL-on-Hadoop is only one of many

applications of Hadoop

Graph ProcessingFull text search

Image ProcessingLog processing

Streams

Processing can be done close to data – awesome scalability!• One of the mantras of Hadoop:

• "Moving computation is cheaper than moving data"

• Hadoop data processing frameworks hide the complexity of MPP

• Now you can build a supercomputer from commodity “pizzaboxes” with cheap locally attached disks!

Node 1+ disks

Node 2+ disks

Node 3+ disks

Node 4+ disks

Node 5+ disks

Node 6+ disks

Node ...+ disks

Node N+ disks

Your code

Distributed storage + parallel execution framework Impala, Hive, Spark, Presto, HBase/coproc …

Your Code

“Affordable at Scale”

(Jeffrey Needham)

No SAN storage cost

& bottlenecks

No expensive big iron

hardware

One Data, Many Engines!• Decoupling storage from compute + open data formats =

flexible future-proof data platforms!

Parquet ORC XML Avro

Amazon S3

Parquet WebLog

Column-store

Impala SQLHive SQL Xyz…

Solr / Search SparkMR

Kudu APIlibparquet

Hadoop, NoSQL, RDBMS One way to look at it

HadoopData Lake

RDBMSComplex

TransactionsAnalytics

All Data! Scalable,Flexible

Sophisticated,Out-of-the-box,

Mature

“NoSQL” Transaction

Ingest

Scalable Simple

Transactions

Access to all enterprise data?

GluentOracle

TeradataPostgres

Big Data Sources

Open Data Formats

Gluent as a data virtualization

Oracle

GluentOracle

TeradataPostgres

Big Data Sources

Push computation

down to Hadoop

Hive, Impala, Spark, etc…

Gluent Extends Databases with the Full Power of Hadoop

Offload RDBMS Data

Query any Hadoop Data

Transparent: All queries work – no code changes required!

Data Offload

Gluent’s Data Offload tool Gluent provides the full orchestration for syncing entire tables to Hadoop

(and keeping them in sync with just a single

command)

Lots of steps for ensuring proper hybrid query

performance, partitioning and data correctness

No ETL development needed!

Before & After

Cheap Scalable Storage(e.g HDFS + Hive/Impala)

Hot DataDim.

Tables

HadoopNode

ExpensiveStorage

Expensive Storage

Time-partitioned fact table Hot DataCold DataDim.

Tables

Large Database Offloading

DW FACT TABLES in TeraBytes

HASH(customer_id)RA

Old Fact data rarely updated

Fact tables time-partitioned

DIMENSION TABLESin GigaBytes

Months to years of history

After multiple joins on dimension tables – a full scan is done on the fact

A few filter predicates directly on the fact, most predicates on dimension

tables

• 100% RDBMS table (no offload)

• “90/10” table (union-all)• Most, but not all data offloaded &

dropped from RDBMS

• “100/10” table• Entire table available in Hadoop

• “100/100” table

• 100% Hadoop table• Usually presented to other DBs

Hybrid table types

Decoupled Partitioning Models (examples)Ho

s of d

HASH(customer_id)

HASH(store_id)RANGE(customer_id)

HASH(customer_id)HASH(store_id)

DB table partitioned

weekly Hadoop monthly

Multi-level partitioning

No DB partitioning, full Hadoop partitioning

Data Model Virtualization (example)Large RDBMS tables always joined (“fact

to fact” joins)

Large “fact to fact” joins

offloaded to Hadoop

View (T1_T2)

Join executed in Hadoop SQL

View (T1_T2)

Join materialized in Hadoop

Accessing the “wide” table

does not require

further joining

• How to schedule regular (additional partition) offloading?• And incremental changed data offloading?

Offloading options

Offload table to Hadoop, append new partitions as needed:

./offload –t schema.table --older-than-days=30 -x

Enable batch incremental changed data syncing to Hadoop:

./offload –t schema.table --incremental-enabled -x

Enable incremental offloading, with DML & data update support:

./offload -t schema.table --updates-enabled –x

• Sometimes historical updates are needed• Or deletion of specific records

• Gluent allows to update Hadoop data from your familiar database• Using existing RDBMS SQL

• We call it “data patching” internally• Because it’s not meant for full blown hybrid OLTP on Hadoop

• You do get some transactional properties for the hybrid transaction• Atomicity (either all RDBMS + Hadoop updates go through or not)

Updating offloaded data?

SQL Query & Workload offloading?

• First I’ll show some demos:• demo.sql• demo_plsql.sql

• And then explain how it is working :-)

Gluent Demo

Gluent Smart Connector

SAN/Exadata

Hadoop Ecosystem

HDFSOnsite, Cloud

SQL on Hadoop

Hive / LLAPImpala

Oracle RDBMS

ReportsQueries

Spark SQL

Petabytes of dataScan billions rows/sRead hundreds GB/s

3. Return only requiredrows and columns to Oracle

1. Secret Sauce:Read SQL execution plan, analyze SQL,

variables

2. Run parts of SQL in HadoopPush down

scanning, filtering, aggregations, joins

For every query, billions of rows can be scanned in Hadoop and only relevant millions returned to Oracle

Data Feeds

Returned data size and processing is greatly reduced

• Gluent virtualizes individual tables – nodes of execution plan

Table scan & filter pushdown - How do we do it?

Everything else still runs in the RDBMS (100% compatible)

Offload SALES table access

node to Hadoop

FACTunion all

Gluent Data Offload creates Hybrid Schemas with virtualized tables

Offloaded&

droppedhistory

Original Schema

Dim Original Fact

Hybrid Schema

HadoopFACT_EXT

Synonyms

ETL SQL SQLSQL

ETL unchanged (as it does not need

to see a long fact history)

Report SQLs unchanged, just

logging in to different schema

25Gluent, Inc. - confidential

Smart Connector Architecture + Oracle direct filter pushdown

Hive / Impala SQL

App SQL

External TablePreprocessorGluent Smart

Connector

Data access

1. Oracle DB compiles SQL to execution plan

2. Some tables in plan are hybrid objects (Gluent external tables)

4. Smart Connector reads Oracle execution

plan memory

3. External table preprocessor

launches Gluent Smart Connector

5. Smart Connector

constructs “data access SQL” for

Hadoop

+ Filters+ Projectionpushdown

6. Hadoop SQL returns only

requested rows and columns to

connector

7. Connector returns table

results to Oracle External table

Execution plan reading: Oracle Conceptual viewSQL> SELECT SUM(object_id) FROM test_users u, test_objects o WHERE u.username = o.owner AND o.status = 'VALID' GROUP BY u.created;

-------------------------------------------------| Id | Operation | Name |-------------------------------------------------| 0 | SELECT STATEMENT | || 1 | HASH GROUP BY | ||* 2 | HASH JOIN | || 3 | TABLE ACCESS FULL | TEST_USERS ||* 4 | EXTERNAL TABLE ACCESS | TEST_OBJECTS |-------------------------------------------------

Predicate Information (identified by operation id):---------------------------------------------------

2 - access("U"."USERNAME"="O"."OWNER") 4 - filter("O"."STATUS"='VALID')

Column Projection Information (identified by operation id):-----------------------------------------------------------

1 - "U"."CREATED"[DATE,7], SUM("OBJECT_ID")[22] 2 - (#keys=1) "U"."CREATED"[DATE,7], "OBJECT_ID"[NUMBER,22] 3 - "U"."USERNAME"[VARCHAR2,30], "U"."CREATED"[DATE,7] 4 - "O"."OWNER"[VARCHAR2,30], "OBJECT_ID"[NUMBER,22]

Gluent identifies which tables in the plan are

“hybrid” and which are local

All required information is available in the execution

plan memory

V$SQL_PLAN is not enough, we read

what we need from SGA

What about bind vars?We read bind values

from process memory (PGA)

• 100% RDBMS table (no offload)

• “90/10” table (union-all)• Most, but not all data offloaded &

dropped from RDBMS

• “100/10” table• Entire table available in Hadoop

• “100/100” table

• 100% Hadoop table• Usually presented to other DBs

Hybrid table types

90/10 Hybrid Table: Under the hood• UNION ALL

• Latest partitions in Oracle: (SALES table)• Offloaded data in Hadoop: (SALES_EXT smart table)

SELECT PROD_ID, CUST_ID, TIME_ID, CHANNEL_ID, PROMO_ID, QUANTITY_SOLD, AMOUNT_SOLD

FROM SSH.SALESWHERE TIME_ID >= TO_DATE(' 2016-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')UNION ALLSELECT "prod_id", "cust_id", "time_id", "channel_id", "promo_id",

"quantity_sold", "amount_sold"FROM SSH_H.SALES_EXT WHERE "time_id" < TO_DATE(' 2016-01-01 00:00:00', 'SYYYY-MM-DD HH24:MI:SS')

CREATE VIEW hybrid_schema.SALES AS

Gluent, Inc. - confidential 29

Selective Offload Processing

Cheap Scalable Storage(e.g HDFS + Hive/Impala)

Hot DataDim.

Tables

HadoopNode

ExpensiveStorage

SALES union-all view

TABLE ACCESS FULLEXTERNAL TABLE ACCESS / DBLINK

SELECT c.cust_gender, SUM(s.amount_sold) FROM ssh.customers c, sales_v s WHERE c.cust_id = s.cust_id GROUP BY c.cust_gender

HASH JOIN

GROUP BY

SELECT STATEMENT

Partially Offloaded Oracle Execution Plan (90/10 table)

---------------------------------------------------------| Id | Operation | Name | ---------------------------------------------------------| 0 | SELECT STATEMENT | | | 1 | SORT GROUP BY ROLLUP | | |* 2 | HASH JOIN | | | 3 | TABLE ACCESS STORAGE FULL | TIMES | |* 4 | HASH JOIN | | |* 5 | TABLE ACCESS STORAGE FULL | CHANNELS | |* 6 | HASH JOIN | | |* 7 | TABLE ACCESS STORAGE FULL | PRODUCTS | |* 8 | HASH JOIN | | |* 9 | TABLE ACCESS STORAGE FULL | PROMOTIONS | | 10 | VIEW | SALES | | 11 | UNION-ALL | | | 12 | PARTITION RANGE ITERATOR | | | 13 | TABLE ACCESS STORAGE FULL| SALES | |* 14 | EXTERNAL TABLE ACCESS FULL| SALES_EXT | --------------------------------------------------------- ... 5 - filter("CH"."CHANNEL_CLASS"='Direct') 6 - access("P"."PROD_ID"="S"."PROD_ID") 7 - filter("P"."PROD_CATEGORY"='Peripherals and Accessories') 8 - access("PM"."PROMO_ID"="S"."PROMO_ID") 9 - filter("PM"."PROMO_NAME"='blowout sale') 14 - filter("TIME_ID"<TO_DATE(' 2015-06-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Visit Hadoop rowsource only when

access to old data required

MSSQL 90/10 table execution plan

Concatenation == UNION ALL

Remote Query to Hadoop

• We go beyond single-table virtualization – push down joins, aggregations

Aggregation & full join pushdown - How do we do it?

Everything else still runs in the RDBMS (100% compatible)Offload entire

execution plan branches

For transparent optimization on

Oracle:dbms_advanced_

rewrite

Connector runs more complex SQL in Hadoop

Adaptive Join Filter Pulldown

Offloaded table, read

direct predicates

Gluent Adaptive Join

Filter Pulldown

SELECT ...

FROM `SH`.`SALES_ALL_INT`

WHERE `PROMO_ID`=37 AND `PROD_ID`>=14 AND `PROD_ID`<=130 AND `PROD_ID` IN (SELECT "P"."PROD_ID"

FROM ["SH"."PRODUCTS" "P"] WHERE ("P"."PROD_CATEGORY"= 'Peripherals and Accessories')) AND

`CHANNEL_ID`>=3 AND `CHANNEL_ID`<=9 AND `CHANNEL_ID` IN (SELECT "CH"."CHANNEL_ID"

FROM ["SH"."CHANNELS" "CH"]

WHERE ("CH"."CHANNEL_CLASS"=

'Direct'))

Phases of Data(base) platform modernization

Enterprise Data Warehouse with no Offload• Starting point: Too expensive & too slow

ETLsrc

Oracle/ExadataDW

Offload Phase 1: Data Offload• Your ETL and queries remain unchanged

ETLsrc

Oracle/ExadataDW Hadoop

Offload Phase 2: Query Offload• ETL and original queries still unchanged• Some queries running directly on Hadoop

ETLsrc

Oracle/ExadataDW

Hadoop

Offload Phase 3: Hadoop-first• Some ETL and data feeds land directly in Hadoop• Oracle DB can access all Hadoop data

ETLsrc

Oracle/ExadataDW

Hadoop

Hadoop as a data sharing backend (data hub)

Database Database Database Database Database

New Big Data

Machine- Generated

New IoT

• present

1. Liberate enterprise data• RDBMS data is offloaded from isolated silos to scalable storage & processing

platforms, in open data formats (Hadoop/HDFS)

2. Use modern distributed processing platforms for heavy-lifting• As these platforms improve & evolve, so does your Gluent experience

3. Require no disruptive forklift upgrade or switchover• Your existing applications still log in to the RDBMS as they’ve always done

4. Require no application code rewrites• Access all your data - code doesn’t change, application architecture doesn’t change

Gluent is designed to…

• Next webinar:• Gluent Real World Results from Customer Deployments• 17. January 2017• Sign up here:

• https://gluent.com/event/gnw06-modernizing-enterprise-data-architecture-with-gluent-cloud-and-hadoop/

• Training:• Hadoop for Database Professionals (1-day overview training)• TBD February 2017• Register interest here:

• https://gluent.com/hadoop-for-database-professionals/

Next Steps?

• Gluent Whitepapers, including Advisor -->• https://gluent.com/whitepapers/

• Gluent New World videos (will include this one)• http://vimeo.com/gluent

• Podcast about moving to the “New World” • “Drill to Detail” podcast with Mark Rittman• http://www.drilltodetail.com/podcast/2016/12/6/drill-to-detail-ep12-gluent-and-the-new-world-of-hybrid-

data-with-special-guest-tanel-poder

More info about Gluent

We are hiring awesome developers & data engineers ;-)

http://gluent.com/careers

Thanks!!!

gnw05 - extending oracle databases with hadoop

Technology

big data - cleveland state...

introduction to nosql databases | hadoop quick introduction

extending lifespan with hadoop and r

extending hadoop for fun & profit

mapreduce for data warehouses - computer...

lessons learned from migration of a large-analytics platform...

annasaheb dange college of engineering & technology ·...

database workshop report - indico · oracle version 2.3...

extending dbmss with satellite databases · extending dbmss...

aws re:invent 2016: extending hadoop and spark to the aws...

benchmarking cloud databases - jboss developer ·...

hadoop , hadoop , hadoop !!!

integration of oracle and hadoop: hybrid databases ... ·...

scale-out databases for cern use cases strata hadoop world...

4, extending existing dependency theory to temporal...

osztott, skálázódó platform stream-feldolgozáshoz ·...

big data integration webinar: reducing implementation...

cs346: advanced databases graham cormode...

extending dbmss with satellite databases - distributed...

archive.fosdem.org · [0] tag . name : databases [1] tag ....