course ds314-px v3[1].0

220
Student Guide v3.0: February 28, 2003 DS314-PX DataStage Parallel Extender Essentials

Upload: nandvero

Post on 23-Nov-2014

130 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Course DS314-PX v3[1].0

Student Guide

v3.0: February 28, 2003

DS314-PX DataStage Parallel Extender Essentials

Page 2: Course DS314-PX v3[1].0

2

Copyright

 This document and the software described herein are the property of Ascential Software Corporation and its licensors and contain confidential trade secrets. All rights to this publication are reserved. No part of this document may be reproduced, transmitted, transcribed, stored in a retrieval system or translated into any language, in any form or by any means, without prior permission from Ascential Software Corporation. Copyright © 2003 Ascential Software Corporation. All rights Reserved Ascential Software Corporation reserves the right to make changes to this document and the software described herein at any time and without notice. No warranty is expressed or implied other than any contained in the terms and conditions of sale. 

Ascential Software Corporation50 Washington Street

Westboro, MA 01581-1021 USAPhone: (508) 366-3888

Fax: (508) 366-3669 

Ascential, DataStage, INTEGRITY, MetaRecon, MetaStage and MetaBroker are trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and registered trademarks are the property of the respective trademark holder.

 02-25-2003

  

Page 3: Course DS314-PX v3[1].0

3© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Processing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 4: Course DS314-PX v3[1].0

4© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

1a. Overview; Two Complete Jobs

1b. Client-Server Architecture

1c. The Job Sequencer

Page 5: Course DS314-PX v3[1].0

5© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Module 1:Introduction to Parallel Extender

In a nutshell:

Parallel Extender harnesses the power of parallel computers for processing large volumes of rows in a minimum amount of time

Page 6: Course DS314-PX v3[1].0

6© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Key PX Concepts

• Application scalability• Parallel computer systems• Flow-based programming• Explicit and implicit parallelisms• Pipeline and partition parallelisms• The Framework (the parallel engine)• Datasets (uniform set of rows in the Framework's internal

representation)• Table definitions/schemas (metadata)• Configuration files (only one active at a time, describes

H/W)

Page 7: Course DS314-PX v3[1].0

7© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Automatic, Flexible Scalability

With Parallel Extender:

• Don’t worry about:• How data is moved around• Today’s machine configuration• Possible deadlocks/synchronization bugs

• Job Designs (programs) are completely architecture-independent

– SMP or MPP, clustered SMP’s, SMP’s within MPP’s

Page 8: Course DS314-PX v3[1].0

8© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Origins of DataStage XE Parallel Extender

Orchestrate Program(sequential data flow)

Orchestrate Application Frameworkand Runtime System

Import

Clean 1

Clean 2

Merge Analyze

Configuration File

Centralized Error Handlingand Event Logging

Parallel access to data in files

Parallel access to data in RDBMS

Inter-node communications

Parallel pipelining

Parallelization of operations

Import

Clean 1

Merge Analyze

Clean 2

Relational Data

PerformanceVisualization

Flat Files

DataStage XE:Best-of-breed data integration platform

Orchestrate:Best-of-breed application scalability

DataStage XE Parallel Extender:Best-of-breed scalable data integration platformNo limitations on data volumes or throughput

Page 9: Course DS314-PX v3[1].0

9© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Synonyms

schema table definition

property format

underlying type SQL type + length [and scale]

virtual dataset link

record/field row/column

operator stage

step, flow, OSH command job

Framework DS engine

Page 10: Course DS314-PX v3[1].0

10© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Types of Scalability: Throughput and Speedup

adding more users andsmall jobs to the server

User Scalability

One application running faster against more data by using more processors.

1 processor10 Gbytes storage

10 processors100 Gbytes storage

100 processors1000 Gbytes storage

Application and Data Scalability

Parallel Extender: • focus is on data scalability (speedup)• can support both throughput and speedup

Throughput Speedup

Page 11: Course DS314-PX v3[1].0

11© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

User's View: Two Inputs, Parallel DeploymentUser 1) assembles a sequential flow (design) using the Designer 2) provides a configuration file…

…and gets: parallel access, propagation, transformation, and load.

The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file.

No need to modify or recompile your design!

Page 12: Course DS314-PX v3[1].0

12© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Traditional Batch Processing

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data

Clean Load

Disk Disk Disk

Without Parallel Extender

Page 13: Course DS314-PX v3[1].0

13© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Data Pipelining

Source

Transform

Target

Data Warehouse

Operational Data

Archived Data Clean Load

• Start a downstream process (e.g., "Clean") on the first rows while an upstream (Transform) process is still processing the later rows..• This eliminates intermediate storing to disk, a very expensive operation

Pipeline Multiprocessing

Page 14: Course DS314-PX v3[1].0

14© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Data Partitioning

Transform

SourceData

Transform

Transform

Transform

Node 1

Node 2

Node 3

Node 4

A-F

G- M

N-T

U-Z

Partitioning "partitions" the incoming set of rows in smaller subsets ("partitions") to be processed independently and concurrently by different nodes.

Partition Parallelism

Page 15: Course DS314-PX v3[1].0

15© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Putting It All Together: Parallel Dataflow

Source Target

Transform Clean Load

Pipelining

Par

titio

ning

SourceData

Data Warehouse

Partition and Pipeline Parallelismscan Occur Simultaneously

Data partitioning and pipelining are "orthogonal" mechanisms

Page 16: Course DS314-PX v3[1].0

16© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Putting It All Together: Parallel Dataflow with Repartioning on-the-fly

Without Landing To Disk!

Source Target

Transform Clean Load

Pipelining

SourceData Data

Warehouse

Par

titio

ning

Rep

artit

ioni

ng

A-FG- M

N-TU-Z

Customer last name Customer zip code Credit card number

Rep

artit

ioni

ng

Repartitioning

Page 17: Course DS314-PX v3[1].0

17© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Some Popular Stages

• Usual ETL sources/targets: - RDBMS, Sequential File, Data Set

• Combine Data: - Lookup, Joins, Merge - Aggregator

• Transform Data: - Transformer, Remove Duplicates

• Ancillary: - Row Generator, Peek, Sort

Page 18: Course DS314-PX v3[1].0

18© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

A Real-Life PX Job

Mini Star-Schema Warehousing:Merge and Aggregate

Page 19: Course DS314-PX v3[1].0

19© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Another Real-Life PX Job

• Householding

• Alternative representation of same job:

Page 20: Course DS314-PX v3[1].0

20© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

1a. Overview; Two Complete Jobs

1b. Client-Server Architecture

1c. The Job Sequencer

Page 21: Course DS314-PX v3[1].0

21© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Server – UNIX (AIX, Solaris, TRU64, HP-UX)

Designer Director ManagerAdministrator

Client - Microsoft® Windows NT/2000/XP

Client-Server ArchitectureOverview

Page 22: Course DS314-PX v3[1].0

22© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Client-Server ArchitectureOne Fat Server, Four Thin Clients

• Fat server does most of the work: – Compiles, Run programs, Generates output– Keeps Repository– Release 6.0: Unix (AIX, Solaris, Tru64, HP-UX)

• Four thin clients for administration and UI– Administrator, Manager, Director, & Designer– Need to be connected to Server to operate

• NT Server allows developing (but not running) PX jobs

• Clients can be bypassed – OSH from Unix prompt

Page 23: Course DS314-PX v3[1].0

23© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Client Hierarchy

• Clients: – Administrator: handles all user's projects

– Manager: import/export jobs, metadata…

– Director: runs, monitors jobs, displays stdout/stderr

– Designer: GUI for creating, editing jobs, schemas

• Also:– multiple users per server – multiple projects per user– multiple jobs (and related objects) per project

Page 24: Course DS314-PX v3[1].0

24© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment Scoping

• Environment defaults set at install for all users– Administrator can override settings for user, projects– Designer can override in "Job Properties" per job basis– Director can override Job Properties from one run to

the next, without recompile. Very handy to select on run basis level of • parallelism• reporting • debugging

Page 25: Course DS314-PX v3[1].0

25© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Administrator

• Add, remove projects

• Set project-wide attributes

Always checkRemember to renew license by that date!

Page 26: Course DS314-PX v3[1].0

26© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The AdministratorProjects

Add New Projects

Add or ModifyProject Properties

Project Listing

Page 27: Course DS314-PX v3[1].0

27© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The AdministratorProject Properties

Make sure these are checked!

Review / Modify a Project’s Environment Variable Settings

Runtime Column Propagation – From the Project / General tab

Page 28: Course DS314-PX v3[1].0

28© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment VariablesGeneral

Page 29: Course DS314-PX v3[1].0

29© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment VariablesParallel

Page 30: Course DS314-PX v3[1].0

30© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment VariablesOperator Specific

Page 31: Course DS314-PX v3[1].0

31© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment VariablesReporting

Page 32: Course DS314-PX v3[1].0

32© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Environment VariablesCompiler

Page 33: Course DS314-PX v3[1].0

33© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The ManagerOverview

Select this to providemore detailed view ofall jobs.

Page 34: Course DS314-PX v3[1].0

34© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The ManagerExporting to File

Backing up the job designs of one "Category" in a .dsx file

Export + Import:

The only way to move jobs between projects

Page 35: Course DS314-PX v3[1].0

35© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Manager Importing External Definitions

Use The Manager to Import:• COBOL Copybooks• Framework Schemas• Database Table Layouts• Custom stage

Page 36: Course DS314-PX v3[1].0

36© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Manager Usage Analysis

Use The Manager to analyze whereany given object is referenced.

Results show exactlywhere the object is being used.

Selecting Edit will open corresponding job

Page 37: Course DS314-PX v3[1].0

37© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Manager Configuration Editor

Use the Manager to Create / Edit / Check Configuration Files• Configuration Files are saved under DataStage Server directory path.

Page 38: Course DS314-PX v3[1].0

38© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Designer

Enter Server Name, Username & Password, and select appropriate Project

Create a New JobOpen an Existing Job

Open a recentlyaccessed job

Page 39: Course DS314-PX v3[1].0

39© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Drag Stages from Repository or Palette and drop on Parallel Canvas:

The DesignerParadigm

Page 40: Course DS314-PX v3[1].0

40© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Designer Stage Properties Editor

Required options arehighlighted red &marked with ?

Pull-down menu of valid options

Property Icons:Non-Repeating Property with no dependentsNon-Repeating Property with dependentsRepeating Property with no dependentsRepeating Property with dependents

Quick Help / Tips

Page 41: Course DS314-PX v3[1].0

41© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Designer Stage Input Link

Optional insertion of Sort and Rem. Dup.

Partitioner/Collectors inserted on Input links

View Input Column Definitions

One page per input link

Page 42: Course DS314-PX v3[1].0

42© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Designer Stage Output Link - Mapping Page

Column Mapping: Specify relationship between input and output columns, orhow output columns are derived.

Click & Dragto create mappings

Will attempt to automatically map input to output

Output columndefinitions

Page 43: Course DS314-PX v3[1].0

43© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Designer Stage Output Link - Column Page

Output columndefinitions

Make sure this is checked!

One page per output link

Page 44: Course DS314-PX v3[1].0

44© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Director

• Typical Use: Director allows user to select jobs already compiled (from Designer), fill-in run-specific environments and parameters, validate and:

1. Run2. Display log messages3. Inspect highlighted message4. Inspect previous/next message

more on these steps next slide…

Page 45: Course DS314-PX v3[1].0

45© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

1

2

The Director

3

4

Page 46: Course DS314-PX v3[1].0

46© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Typical Job Log Messages:

• Environment variables• Configuration File information• Framework Info/Warning/Error messages• Output from the Peek Stage• Additional info with "Reporting" environments• Tracing/Debug output

– Must compile job in trace mode– Adds overhead

The Director

Page 47: Course DS314-PX v3[1].0

47© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Job Level Environments

• Job Properties, from Menu Bar of Designer• Director will

prompt you before eachrun

Page 48: Course DS314-PX v3[1].0

48© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

1a. Overview; Two Complete Jobs

1b. Client-Server Architecture

1c. The Job Sequencer (optional material)

Page 49: Course DS314-PX v3[1].0

49© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Managing Multiple Jobs:The Job Sequencer Canvas

• ‘Job Sequencer’ canvas controls job execution, conditional to successful completion (or failure, etc.) of other jobs

– Job Activity stage reference job paths and activity options• Links between Job Activity stages specify the sequence of execution

• Triggers Tab states condition for execution of Job down specified link

– All sequence stages, from the Repository pane:

Page 50: Course DS314-PX v3[1].0

50© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Job Sequencer

• Job Sequence– FirstJob is executed first

– If the job results in a run with warnings, SecondJob is executed

• Each job is referenced using ‘Job Activity’ Stage• Sequence of execution is specified by linking the two stages

Page 51: Course DS314-PX v3[1].0

51© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Job Activity Stage

For job namedhere,4 Activity Options

Page 52: Course DS314-PX v3[1].0

52© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Job Activity Stage Triggers Tab

Condition options that trigger Activity

Page 53: Course DS314-PX v3[1].0

53© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Sequencer Stage

• Sequencer– Specify ANY or ALL option

– In above sequencer, if either GeneratorPeek or GeneratorPeekSortDataset runs successfully, then DS314PXSequencer is run

Sequencer

Page 54: Course DS314-PX v3[1].0

54© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Sequencer StageSequencer Tab

Two modes:

•Any: Start Activity if any input condition holds

•All: Start Activity only if all input conditions hold

Page 55: Course DS314-PX v3[1].0

55© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Notification

Notification Stage

Page 56: Course DS314-PX v3[1].0

56© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Notification Activity

Page 57: Course DS314-PX v3[1].0

57© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Notification Activity (con’t)

• Sample DataStage log from Mail Notification

Page 58: Course DS314-PX v3[1].0

58© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Notification Activity (con’t)

• E-Mail Message

Page 59: Course DS314-PX v3[1].0

59© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 60: Course DS314-PX v3[1].0

60© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

2. Concepts in Parallel Computing

2a. Zoology of Parallel H/W

2a. DS-PX Program Elements

2c. Three Types of Parallelisms

Page 61: Course DS314-PX v3[1].0

61© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Parallel processing = executing on multiple CPUs• Scalable processing = add more resources

(CPUs, RAM, and disks) to increase performance

1 2

3 4

5 6

Scalable Systems

Page 62: Course DS314-PX v3[1].0

62© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Scalable Systems: Examples

Two main types of scalable systems

• Symmetric MultiProcessors (SMP), shared memory– Sun Starfire™– IBM S80– Compaq GS Series– HP Superdome

• Clusters: UNIX systems connected via networks – Sun Cluster– Compaq TruCluster

• Massively Parallel Computers (MPP)– IBM SP (formerly SP/2)

Page 63: Course DS314-PX v3[1].0

63© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Scalable Architectures: Typology

• Symmetric Multiprocessor(SMP)

Shared Memory

CPU CPU CPU CPU

Shared Everything

• Loosely-Coupled Clusters• Massively Parallel Systems

(MPP)

CPU CPU CPU CPU

Memory Memory Memory Memory

Shared Nothing

Page 64: Course DS314-PX v3[1].0

64© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

SMP: Shared Everything

When used with Parallel Extender:• Data transport uses shared

memory• Simpler install and startup

Parallel Extender treats NUMA (NonUniform Memory Access) as plain SMP

Shared Memory

CPU CPU CPU CPU

Shared Everything

Page 65: Course DS314-PX v3[1].0

65© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

MPP:Shared Nothing

• Cluster of multiple independent systems (e.g., Unix boxes) connected by a medium-speed network, typically Ethernet.

• True MPP: cluster in a single box, connected by high-speed switch.

DS-PX treats them the same.

CPU CPU CPU CPU

Memory Memory Memory Memory

Shared Nothing

Page 66: Course DS314-PX v3[1].0

66© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

2. Concepts in Parallel Computing

2a. Zoology of Parallel H/W

2b. DS-PX Program Elements

2c. Three Types of Parallelisms

Page 67: Course DS314-PX v3[1].0

67© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DS-PX Program ElementsDatasets, Partitions, Nodes

• Dataset: uniform set of rows in the Framework's internal representation - Two flavors:

1. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage

2. virtual: *.v : links, in Framework format, NOT stored on disk

• Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).

Page 68: Course DS314-PX v3[1].0

68© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DS-PX Program ElementsPersistent Datasets

• Accessed from/to disk with DataSet Stage. • Two parts:

– Descriptor file: • contains metadata, data location, but NOT the data itself

– Data file(s) • contain the data

• multiple Unix files (one per node), accessible in parallel

input.ds

node1:/local/disk1/…node2:/local/disk2/…

record ( partno: int32; description: string; )

Page 69: Course DS314-PX v3[1].0

69© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Datasets remain data-partitioned when stored to disks, allowing parallel I/O in addition to parallel processing

• They implement end-to-end parallelism• Whenever possible, it is advantageous to use Datasets

Stages rather than SequentialFile Stages in flows such as

• Second advantage: Datasets Stages achieve the economy of data translation (see Module 4)

Persistent Datasets:Partitioned Storage

Page 70: Course DS314-PX v3[1].0

70© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Partitioners distribute rows into partitions No icon

– implement data-partition parallelism

• Collectors = inverse partitioners

• Live on input links of stages running – in parallel (partitioners)– sequentially (collectors)

• Use a choice of methods (see next Module)

DS-PX Program ElementsPartitioners and Collectors

Page 71: Course DS314-PX v3[1].0

71© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Quiz!

• True or False?Everything that has been data-partitioned must

eventually be collected

• Hint:Two slides up: leave the source SequentialFile as they are

but replace the target SequentialFile with a DataSet stage.

Answer: FALSE!

Counterexample:

Page 72: Course DS314-PX v3[1].0

72© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

PX Program ElementsStages

• Act on data.• Can be parallel or sequential• Parallel Extender supports:

– Over 40 Built-In stages– Custom stages

• Multiple Type• All use same generic intelligent stage editor• Supports Stream, Reference and Reject links

Page 73: Course DS314-PX v3[1].0

73© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Stages Control Partition Parallelism

• Execution mode (sequential/parallel) is controlled by Stage– default = parallel for most Ascential-supplied Stages– user can override default mode– parallel Stage inserts the default partitioner (Auto) on its input links – sequential Stage inserts the default collector (Auto) on its input links – user can override default

• execution mode (parallel/sequential) of Stage (Advanced tab)• choice of partitioner/collector

• Degree of parallelism (How many nodes?) is determined by the configuration file– Total number of logical nodes in nameless default pool,

or a subset using "constraints" (Advanced Topic).

Page 74: Course DS314-PX v3[1].0

74© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

2. Concepts in Parallel Computing

2a. Zoology of Parallel H/W

2b. DS-PX Program Elements

2c. Three Types of Parallelisms

Page 75: Course DS314-PX v3[1].0

75© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Sort

DerivationSample

Lookup

Constraint

• Reusable operators, data and metadata are connected into a logical flow Data driven vs. demand driven Eager vs. lazy Push vs. pull

Flow-Based ProgrammingI: Sequential Features

Page 76: Course DS314-PX v3[1].0

76© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Flow-Based ProgrammingII: Parallel Features

• Three types of parallelism may occur in this picture:>> 1 explicit

• e.g., constraint

>> 2 implicit • pipeline• data partitioning

Constraint

Page 77: Course DS314-PX v3[1].0

77© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Sort

DerivationSample

Lookup

Constraint

explicit

pipeline

• Explicit parallelism

• Implicit pipeline "parallelism"

• Implicit data-partition parallelism QUIZ! Where should the blue box go?

Three Types of Parallelism

data-partition

Page 78: Course DS314-PX v3[1].0

78© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Implicit Parallelisms: Partitioned and Pipeline

node 1

CPU 2

Stage 1

Stage 2

Pipeline'Parallelism'

Multiple data partitions:node 1 and node 2 execute Stage 1, each on their own partition,independently and concurrently

node 2

One data partition:CPU 2 starts executing Stage 2 before completion of Stage 1 by CPU 1

CPU 1

Stage 1

Data PartitionParallelism

Page 79: Course DS314-PX v3[1].0

79© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Another view on Pipeline Multiprocessing

• Sequential jobs– Process a row at a time all the way through all operations

• Pipelined processing– Operations run whenever data available– Except at boundaries– Operations in separate processes with inter-process buffering

import clean transform

import

clean

transformtime

Page 80: Course DS314-PX v3[1].0

80© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

…Parallel Extender creates at runtime N Unix processesfor each Stage, where N is the number of logical nodes defined in the configuration file(NOT in the Job Design, which is good for any value of N).

Runtime Unix Processes

Given a Job Design:

Page 81: Course DS314-PX v3[1].0

81© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Configuration File

{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }

1

43

2

Two key aspects:

1. # nodes declared

2. defining subset of resources"pool" for execution under "constraints," i.e., using a subset of resources(Advanced Topic)

Advanced topic!

Page 82: Course DS314-PX v3[1].0

82© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Config File Key Aspect #1: # Nodes Declared

• Better than editing one config file, keep several versions!

• In Director, you can switch config file before each run, e.g,:• '1-node' file - for sequential execution, lighter reports—handy for testing

- aims at max pipeline

• 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelisms

• 'BigN-nodes' file - aims at full data-partitioned parallelism

• Only one file is active while a job is running• that pointed by the environment variable:

$APT_CONFIG_FILE

# nodes declared in the config file needs not match # CPUs• Same configuration file can be used in development and target

machines

Page 83: Course DS314-PX v3[1].0

83© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Quiz! Is the this scenario plausible?

1. Anna Liz runs the flow below using a one-node configuration file. Her computer has one CPU only.

2. At dinner, she breaks a tooth on an olive pit.

3. At exactly midnight, the Tooth Fairy secretly adds a second CPU to Anna Liz's computer.

4. The next morning, Anna Liz runs the same flow with the same data and the same one-node configuration file.

5. Anna Liz's program runs much faster

The correct is answer is NO: the tooth fairy does not existYES! Pipeline Multiprocessing: 4 Unix processes keep both CPUs busy

SeqFileSeqFile DerivationSample

Page 84: Course DS314-PX v3[1].0

84© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

QUIZ!

• True or False?

The Sort stage buffers all the incoming rows (in the partition) before outputting the first outgoing row.

The Sort stage inhibits pipeline multiprocessing.

The Sort stage creates a synchronization point: the fastest node will wait for all nodes to complete the Sort Stage before they all attack the next stage.

Page 85: Course DS314-PX v3[1].0

85© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Answers

The Sort stage buffers all the incoming rows (in the partition) before outputting the first outgoing row.TRUE! The Sort stage needs to read all the rows before deciding which row to output first.

The Sort stage inhibits pipeline multiprocessing.TRUE! Buffering inhibits streaming: the stages downstream must wait until Sort completes reading all the rows.

The Sort stage creates a synchronization point: the fastest node will wait for all nodes to complete the Sort Stage before they all attack the next stage.FALSE!Nodes process independently their own instantiation of the stages, partitions do not communicate. The only way to insure synchronization is to land to disk, terminate the job, and start a new job -- See Lab 4a.

Page 86: Course DS314-PX v3[1].0

86© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Lab 1: Getting Around

• Lab 2: A Simple Parallel Extender Program

• Lab 3: Modify Job, Create a Dataset

Learn by doing!

LABS

Page 87: Course DS314-PX v3[1].0

87© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Custom Stages

Page 88: Course DS314-PX v3[1].0

88© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data

In a nutshell:

- To distribute rows among nodes, Parallel Extender employs an effective default method. The user can override the default with a choice of alternative methods. (Partitioning)

- The same applies for programs that require recollecting the rows into a sequential stream. (Collecting)

Page 89: Course DS314-PX v3[1].0

89© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data

• Partitioning breaks the dataset into smaller sets of rows (partitions), which can be processed independently by multiple nodes. Each node executes in parallel its own instantiation of the stages.

• Collecting brings back data partitions into a sequential stream.

Page 90: Course DS314-PX v3[1].0

90© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning Methods

Partitioning methods:

– (Auto) Parallel Extender decides (default): Same or Round Robin

– Same Existing partitioning is not altered – Round Robin Rows are alternated among partitions– Entire Each partition gets the entire dataset

(rows duplicated)– Random Rows randomly assigned to partitions– Hash Rows with same key column value go to the

same partition– Range Similar to hash, but partition mapping is

user-determined and partitions are ordered– Modulus Assigns each row of an input dataset to a

partition, as determined by a specified numeric key column in the input dataset

– DB2 Matches DB2 EEE hash semantics

Page 91: Course DS314-PX v3[1].0

91© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Same: rows retain current distribution

036

147

258

036

147

258

Row ID's

Partitioning and Collecting Data Partitioning Methods

Same = identity transform, does nothing

Page 92: Course DS314-PX v3[1].0

92© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data Partitioning Methods

Round Robin: rows are distributed evenly among partitions, as in dealing cards

…8 7 6 5 4 3 2 1 0

630

741

852

Page 93: Course DS314-PX v3[1].0

93© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Entire: each partition gets a complete copy of the data

• Useful for distributing

• lookup tables,

• parameters files, etc.

• WARNING:

• Increases the data volume!

…8 7 6 5 4 3 2 1 0

.

.3210

.

.3210

.

.3210

Partitioning and Collecting Data Partitioning Methods

Page 94: Course DS314-PX v3[1].0

94© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Hash: rows are distributed according to the values in one or more user-defined key columns

• Rows with identical values in key columns end up in the same partition

• Prevents "matching" rows (such as sought by the Remove Duplicates, Joins, and Merge Stages) from hiding in other partitions.

Partitioning and Collecting Data Partitioning Methods

Page 95: Course DS314-PX v3[1].0

95© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Hash (continued):

…0 3 2 1 0 2 3 2 1 1

0303

111

222

Values of key column

Partitioning and Collecting Data Partitioning Methods

Page 96: Course DS314-PX v3[1].0

96© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Range: an expensive refinement of Hash

• A given partition will contain only rows with key values within some "range."

• Must first run the 'Write Range Map' Stage.

Partitioning and Collecting Data Partitioning Methods

Page 97: Course DS314-PX v3[1].0

97© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Range (continued):

4 0 5 1 6 0 5 4 3

010

443

565

Partitioning and Collecting Data Partitioning Methods

•QUIZ! If incoming data is ordered on key, something bad happens. WHAT?

Page 98: Course DS314-PX v3[1].0

98© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data Partitioning Methods

Modulus: an inexpensive form of Hash

• Allows one key column, of type integer:

• Partition = key_value (mod # of partitions)

Page 99: Course DS314-PX v3[1].0

99© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data Collectors

• Collectors combine partitions of a dataset into a single input stream to a sequential Stage

data partitions (NOT links)

collector

sequential Stage

...

Page 100: Course DS314-PX v3[1].0

100© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Partitioning and Collecting Data Collectors (continued)

Collector Methods:

– (Auto) Eagerly read any row from any input partition (default, non-deterministic)

– Round Robin Patiently pick row from input partitions

– Ordered Read all rows from first partition, then second,… (e.g., print chapter 1, then 2,…)

– Sort Merge Read rows based on key columns, produce fully sorted from within-partition sorted (non-deterministic on un-keyed columns)

Page 101: Course DS314-PX v3[1].0

101© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Collectors:Non-Deterministic Execution

Collectors can cause surprises:

• Default (Auto) is "eager" to output rows and may be nondeterministic: row order may vary from run to run.

• This can be seen using the most common "effective collectors": the standard out/error

Page 102: Course DS314-PX v3[1].0

102© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The sequential Generator generates 4 rows with one integer column named "old" and taking values 0,1,2,3.

Simple Example of Partitioning & Collecting

Default Round Robin happens to make the values 0,1,2,3 label the partition #

The First Peek transports data and has a side-effect: it sends messages to the monitor (not shown). It also renames the column from "old" to "new"

The second peek sends data to the bit bucket and also sends messages

"old" "new"

Page 103: Course DS314-PX v3[1].0

103© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Job Log Shows Messages out of Flow Order

time

snapshot

Example (line 5 above):

SecondPeek,2: new: 2

= "SecondPeek, partition 2, showed value 2 in column "new"

SecondPeek sends messages prior to completion by FirstPeek

Page 104: Course DS314-PX v3[1].0

104© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

A What Happened? The Standard Out Acted as a Collector

3

10

2

Job Design:

Conceptually correct, but unpractical deployment:

1

0

2

3

Actual deployment:

Hidden collector: The ConductorNode collects messages and sends sequential stream to standard out

No collector!

Page 105: Course DS314-PX v3[1].0

105© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Snapshot during run:

Row in partition: 2 passed both Peeks1 and 3 passed FirstPeek only0 still upstream of FirstPeek

Sequence of Events

3

10

2

After job completion: 00

1232

Data transport (one row/partition, values 0,1,2,3)

Side effects: messages to log from First, Second Peek

p0

p2

p3

p1

22

33

11

partition IDs

Page 106: Course DS314-PX v3[1].0

106© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Synchronization Points

• Repeated runs yield different message orders, in violation of determinism ("same causes, same effects")

• SecondPeek's 1st message often precedes FirstPeek's last message, in apparent violation of flow order

– In Lab 4a, you will force flow order while maintaining parallelism: SecondPeek will wait for completion of FirstPeek. You will create a "synchronization point." (This lab is optional.)

Page 107: Course DS314-PX v3[1].0

107© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DUMP SCORE Output

Double-click

Mapping Node--> partition

Setting APT_DUMP_SCORE yields:

QUIZ!Why 9 Unix processes?

Confirms no combination occurred

Page 108: Course DS314-PX v3[1].0

108© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DataSets: Best of Both Worlds

• Our previous design faced a deployment dilemma:

– Conceptually correct but unpractical (N stdouts), or– Feasible (1 stdout) but conceptually weak (unnecessary collector)

• Adding a DataSet stage solves the dilemma and keeps the best of both worlds:

– Single file view: "twin.ds" but – Parallelism is maintained: N partitions stored in N actual files

– Note: other stages (FileSet, RDBMS) achieve the same

Page 109: Course DS314-PX v3[1].0

109© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Copy Stage

With 1 link in, 1 link out:

the Copy Stage is the ultimate "no-op" (place-holder):– Partitioners

– Sort / Remove Duplicates– Rename, Drop column

… can be inserted on: – input link (Partitioning): Partitioners, Sort, Remove Duplicates)

– output link (Mapping page): Rename, Drop.

Page 110: Course DS314-PX v3[1].0

110© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Column Generator Stage

• Affixes new column(s) to incoming rows• Intrinsic functions:

'part' = current partition number

'partcount' = number of partitions

• Same format properties as Row Generator, e.g.,

Type = Cycle

Initial value =

Increment =

Limit =

Partition number Initial value = 'part', Increment = 0)

Unique row ID Initial value = 'part', Increment = 'partcount')

usable as surrogate key

Page 111: Course DS314-PX v3[1].0

111© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 4a: Synchronization (optional)

• Lab 4b: Explore Partitioners and Collectors

Learn by doing!

Page 112: Course DS314-PX v3[1].0

112© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Page 113: Course DS314-PX v3[1].0

113© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data (SequentialFile)

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Custom Stages

Page 114: Course DS314-PX v3[1].0

114© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Sequential File StageImporting/Exporting Data

In a nutshell:

– If one knows how some data was written: Parallel Extender can read it!

– If one specifies how some data must be read: Parallel Extender can write it!

Sequential File

Page 115: Course DS314-PX v3[1].0

115© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Sequential File StageIntroduction

• The Framework processes only datasets• For files other than datasets, Parallel Extender

must perform import and export operations (i.e., format translations in addition to extract and load)

• External data formats fall in two major categories:– Favorable: the format translation is automatic or semi-automatic

• data stored in a relational database (DB2, Informix, Oracle, Teradata)

• data stored in a SAS data set

• data stored in a COBOL data file

– Other: user needs to specify manually external formats• everything else: flat text files, binary files

Use the SequentialFile Stage

Page 116: Course DS314-PX v3[1].0

116© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Importing/Exporting Data Data Compatibility

How is the data described to Parallel Extender?

• Relational data: automatic Parallel Extender gets the information directly from the RDBMS catalog

• SAS datasets: automaticParallel Extender gets the information directly from SAS header

• COBOL data from IBM mainframes: semi-automatic Parallel Extender can convert a COBOL record layout into an appropriate import/export table definition

• Generic data (binary from any source, Unix text files) user creates/imports metadata

– Table Definition

– Framework "schema"

Page 117: Course DS314-PX v3[1].0

117© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

From one Dimension to Two

• Memory is one-dimensional, tables are two-dimensional

• Two major steps: 1. recordization: identify rows

2. column parsing: within a row, identify columns

• Tools: fixed length, prefixes, delimiters

Page 118: Course DS314-PX v3[1].0

118© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

– Often additional data conversion is needed, e.g.,

– convert “numbers as text” into binary data

– may have to do character set conversion (EBCDIC to ASCII)

– may have to process packed decimal data

– may have to scan dates in various formats

Additional Format Conversions

Page 119: Course DS314-PX v3[1].0

119© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Using the SequentialFile Stage

Importing/Exporting Data

Both import and export of general files (text, binary) are performed by the SequentialFile Stage.

– Data import:

– Data export

QUIZ! True or false: the links here represent the flat files.

Page 120: Course DS314-PX v3[1].0

120© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Sequential File Stage

GUI can generate arbitrary Framework schemas

Page 121: Course DS314-PX v3[1].0

121© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Textual Representation of Schemas

Text is an equivalent alternative to the representation in the previous slide

• Text can be imported into a Table Definition• Schemafile option for one-time use by SequentialFile and

Generator Stages• Example:

record ( name : string[];zip : int32;income : sfloat;

)

Page 122: Course DS314-PX v3[1].0

122© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Another Look at Datasets

• Contain data organized as rows• Provide “single-file” view of data stored in parallel• Includes metadata information: “schema”• Supports many data types

First Name Last Name Birthdate Gender Salary Dependents

John Smith 1/3/62 M 28,500 1

Jane Doe 4/23/61 F 31,350 2

Ralph Mann 11/14/57 M 41,200 5

dataset

John Smith 1/3/62 M $28,500 1

row

Columns, defined by- schema's "unformatted core":

• column name• data type (string, date, etc.)• nullability

- format properties, for external representation, listed within {} in schemas

Page 123: Course DS314-PX v3[1].0

123© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Nullable Columns

All data types are nullable – Null columns do not have a value

– DS-PX null is represented by an out-of-band indicator

– Nulls can be detected by any stage

– Nulls can be converted to/from a value

– Null columns can be ignored by an stage, can trigger error or other action

Page 124: Course DS314-PX v3[1].0

124© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Textual-GUI Mapping

Unformatted Core = Columns Tab

Format Properties { } = Format Tab (in Parallel Stages) = Parallel Tab (in Table Definitions)

Page 125: Course DS314-PX v3[1].0

125© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Textual Equivalent:record {final_delim=end, delim=',', quote=double}

(

LastName :string[max=25];

FirstName :string[max=15];

Address :string[max=60];

City :string[max=20];

State :string[2];

Zip :string[5];

)

PX Program ElementsTable/Column Definitions

Page 126: Course DS314-PX v3[1].0

126© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Importing/Exporting Data Schema Import for COBOL

• COBOL schema import takes COBOL program, program excerpt, or data definition file (“copy book”)

• Creates schema from an FD (File Description) section, or 01 Level statement

• Generated schema can be used to import from or export to a COBOL data file.

Page 127: Course DS314-PX v3[1].0

127© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 5a: Simple Format Translation

• Lab 5b: Flat File Import

• Lab 6: Cobol Data Import (Optional)

Learn by doing!

Page 128: Course DS314-PX v3[1].0

128© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Page 129: Course DS314-PX v3[1].0

129© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 130: Course DS314-PX v3[1].0

130© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Combining Data

• There are two ways to combine data:

– Horizontally: Several inputs links; one output link (+ optional rejects) made of columns from different input links. E.g., • Joins

• Lookup

• Merge

– Vertically: One input link, output with column combining values from all input rows. E.g.,• Aggregator

Page 131: Course DS314-PX v3[1].0

131© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge

5b. Combining Data, Vertically: Aggregator5c. The Transformer

Page 132: Course DS314-PX v3[1].0

132© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Join, Lookup & MergeStages: Introduction

• These "three Stages" combine two or more input links according to values of user-designated "key" column(s).

• They differ mainly in:– Memory usage– Treatment of rows with unmatched key values– Input requirements (sorted, de-duplicated)

• Complete summary on slide 153

Page 133: Course DS314-PX v3[1].0

133© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Joins - Lookup - Merge:Not all Links are Created Equal!

Joins Lookup Merge

Primary Input: port 0 Left Source MasterSecondary Input(s): ports 1,… Right LU Table(s) Update(s)

• Parallel Extender distinguishes between:- The Primary Input (Framework port 0)- Secondary - in some cases "Reference" (other ports)

• Naming convention:

Tip: Check "Input Ordering" tab to make sure intended

Primary is listed first

Page 134: Course DS314-PX v3[1].0

134© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Join, Lookup & MergeStages: Behavior

We shall first use a simplest case, optimal input: • two input links: "left" as primary, "right" as secondary • sorted on key column (here "Citizen"), • without duplicates on key

Left link (primary input) Right link (secondary input)Revolution Citizen

1789 Lefty1776 M_B_Dextrous

Citizen ExchangeM_B_Dextrous NasdaqRighty NYSE

Page 135: Course DS314-PX v3[1].0

135© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

1. The Join Stage

Four types:

• 2 sorted input links, 1 output link – "left" on primary input, "right" on secondary input– Pre-sort make joins "lightweight": few rows need to be in RAM

• Inner• Left Outer• Right Outer• Full Outer

Page 136: Course DS314-PX v3[1].0

136© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Join Stage Editor

One of four variants:– Inner– Left Outer– Right Outer– Full Outer

Several key columns allowed

Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)

Page 137: Course DS314-PX v3[1].0

137© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Inner Join

• Transfers rows from both data sets whose key columns contain equal values to the output link

• Treats both inputs symmetrically

Output of innerjoin on key Citizen, using "Simplest Case" input :

Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq

Same output as lookup/reject and merge/drop

Page 138: Course DS314-PX v3[1].0

138© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Left Outer Join

• Transfers all values from the left link and transfers values from the right link only where key columns match.

Can perform lookup with: - Left link as Source - Right link as LU Table

Revolution Citizen Exchange1789 Lefty1776 M_B_Dextrous Nasdaq Empty string

Same output as lookup/continue and merge/keep

Page 139: Course DS314-PX v3[1].0

139© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Lookup using Join/LeftOuter

Check Link Ordering Tabto make sure intended Primary is listed first

Page 140: Course DS314-PX v3[1].0

140© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Right Outer Join

• Transfers all values from the right link and transfers values from the left link only where key columns match.

• Can perform lookup with:

- Right link as Source

- Left link as LU Table

Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq

0 Righty NYSE

Integer 0 Integer 0 Integer 0

Page 141: Course DS314-PX v3[1].0

141© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Full Outer Join

• Transfers rows from both data sets, whose key columns contain equal values, to the output link.

• It also transfers rows, whose key columns contain unequal values, from both input links to the output link.

• Creates new columns, with new column names!

Revolution leftRec_Citizen rightRec_Citizen Exchange1789 Lefty1776 M_B_Dextrous M_B_Dextrous Nasdaq

0 Righty NYSE

Page 142: Course DS314-PX v3[1].0

142© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

2. The Lookup Stage

Combines: – one source link, on primary input (0), with– one or more duplicate-free table links, on "reference" input ports (>0)

• no pre-sort necessary• allows multiple keys, LUTs• flexible exception handling for

source input rows with no match

Lookup

Sourceinput

One or more tables (LUTs)

Output Reject

0

1

2

0

1

Page 143: Course DS314-PX v3[1].0

143© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• If an Source key column value is present in no LUT, the following options are provided:

– fail: the lookup Stage reports an error and the step fails immediately.This is the default.

– drop: the input row with the failed lookup(s) is dropped

– continue: the input row is transferred to the output (output port 0), together with the successful table entries. The failed table entry(s) are not transferred, resulting in either default output values or null output values.

– output: the input row with the failed lookup(s) is transferred to a second output link, the "reject" link (output port 1). This output option generates the OSH reject option

• There is no option to capture unused LU Table entries (input ports > 0).

– There is nothing wrong with these! – Compare with the Merge Stage

In Case of Missing LU entry

Page 144: Course DS314-PX v3[1].0

144© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Lookup Stage

• Uses a key column as an index into a table usually containing other values associated with each key.

• Follows the Source/In-Memory Table model.

Key column: state_code

“TX”

[…]SC South CarolinaSD South DakotaTN TennesseeTX TexasUT UtahVT Vermont[…]

Lookup table

Index Associated Value

Page 145: Course DS314-PX v3[1].0

145© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Lookup Stage

• Use Cases: Table Lookup– Map small, dense numeric codes to long character

strings to improve storage density (the most common use)

– Column validation (if the key isn’t in the table, the key’s value isn’t permitted)

Page 146: Course DS314-PX v3[1].0

146© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Lookup Stage

• Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging)

• On a MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link

• On a SMP, no physical duplication of LUT occurs

Page 147: Course DS314-PX v3[1].0

147© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Lookup Stage Editor

"If Not Found" = Source input row with no LUT match

One of four options:– Fail [default]– Drop– Continue– Output (to reject link)

Check that Source is listed first

Page 148: Course DS314-PX v3[1].0

148© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LOOKUP: Action on Simplest Input

Revolution Citizen Exchange1789 Lefty1776 M_B_Dextrous Nasdaq

Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq

output link reject link

Same output as join/inner and merge/drop

Same output as join/leftouter merge/keep

Revolution Citizen1789 Lefty

Unmatched Source Entries

"Output" option:

"Continue" option:

QUIZ! Why the row in the reject link has only 2 columns?

Page 149: Course DS314-PX v3[1].0

149© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

3. The Merge Stage

• Combines – one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links.– Pre-sort makes merge "lightweight": few rows need to be in RAM (as

with joins, but opposite to lookup).• Follows the Master-Update model:

– Master row and one or more updates row are merged iff they have the same value in user-specified key column(s).

– A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored)

– Unmatched ("Bad") master rows can be either• kept (default: transferred to output port 0) • dropped

– Unmatched ("Bad") update rows in input link n in can be captured in a "reject" link in corresponding output link n.

– Matched update rows are consumed.

Page 150: Course DS314-PX v3[1].0

150© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Merge Stage

• Allows composite keys• Multiple update links• Matched update rows are consumed

• Unmatched updates ininput port n can be captured in output port n

• Lightweight:

Master One or more updates

Output Rejects

Merge

0

0

21

21

Page 151: Course DS314-PX v3[1].0

151© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Merge Stage Editor

Unmatched Master rows

One of two options:– Keep [default] – Drop

(Capture in reject link is NOT an option)

Unmatched Update rows option:

– Capture in reject link(s). Implemented by adding outgoing links

Page 152: Course DS314-PX v3[1].0

152© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Merge: Action on Simplest Input

Unmatched Master Mode = Keep

Revolution Citizen Exchange

1789 Lefty1776 M_B_Dextrous Nasdaq

Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq

• Same output as innerjoin and lookup/reject

• Same output as leftouterjoin and lookup/continue

Both options yield the same "reject" link of unmatched updates

Citizen ExchangeRighty NYSE

Unmatched Master Mode = Drop

Page 153: Course DS314-PX v3[1].0

153© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

In this table:

• , <comma> = separator between primary and secondary input links

(out and reject links)

Synopsis:Joins, Lookup, & Merge

Joins Lookup Merge

Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)Memory usage light heavy light

# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)

Mandatory Input Sort both inputs no all inputsDuplicates in primary input OK (x-product) OK Warning!Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | dropOptions on unmatched secondary NONE NONE capture in reject set(s)

On match, secondary entries are reusable reusable consumed

# Outputs 1 1 out, (1 reject) 1 out, (N rejects)Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries

Page 154: Course DS314-PX v3[1].0

154© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 7: Table Lookup

• Lab 8: InnerJoin

Learn by doing!

Page 155: Course DS314-PX v3[1].0

155© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge

5b. Combining Data, Vertically: Aggregator5c. The Transformer

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 156: Course DS314-PX v3[1].0

156© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Grouping and Summary Data

• Often we don’t want to look at data at the row level, but rather at some summary level:

– How many of each product have been ordered?• Group by product and return row count for each group

– What’s the average dollar amount of all orders?• No grouping, return average of transaction amount column

– How many customers do we have in each state and what is the average and stddev of transaction amount by state?• Group by state and return average and stddev of transaction amount

column. De-dupe on customer id within state group and return row count

• SQL supports “select sum(order_quantity) by product”, etc. SAS supports “PROC MEANS; by state; var transaction”, etc.

• Parallel Extender has the Aggregator Stage (a.k.a. group operator).

Page 157: Course DS314-PX v3[1].0

157© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Aggregator Stage

Purpose: perform data aggregations

Specify:• zero or more key columns that define the aggregation

units (or groups)• columns to be aggregated (or ‘reduced’)*• aggregation (reduction) functions include:

count (nulls/non-nulls) sum max/min/range

standard error %coeff. of variationsum of weights un/corrected sum of squaresvariance mean

standard deviation

• a grouping method (hash table or pre-sort)

* So-called because a 1D array is ‘reduced’ to a 0D scalar

WHAT(semantics)

HOW(performance)

Page 158: Course DS314-PX v3[1].0

158© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Aggregator Stage Editor

One of two Methods:– Sort– Hash

Allow several grouping keys

Name of result column

Column(s) to aggregate

Page 159: Course DS314-PX v3[1].0

159© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Grouping Methods

• Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed– doesn’t require sorted data– good when number of unique groups is small. Running tally for

each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM.

– Example: average family income by state, requires .05MB of RAM

• Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.– requires input sorted by grouping keys– can handle unlimited numbers of groups– Example: average daily balance by credit card

Page 160: Course DS314-PX v3[1].0

160© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 9: Grouping Data

Learn by doing!

Page 161: Course DS314-PX v3[1].0

161© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge

5b. Combining Data, Vertically: Aggregator5c. The Transformer

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 162: Course DS314-PX v3[1].0

162© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Transformer Stage

Page 163: Course DS314-PX v3[1].0

163© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Transformer Stage

The Transformer stage provides a one-stop location for you to easily create simple or complex transformations.

Output(s)RightHandSide

Input(s)Left

HandSide

Page 164: Course DS314-PX v3[1].0

164© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Mapping Source Columns

• To map source columns to target, select the columns to be mapped, drag & drop them on the target.• Metadata for all mapped columns are carried to target.

Target

Source

Source Metadata

Target Metadata

Page 165: Course DS314-PX v3[1].0

165© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Adding New Column

Add a new column: • Right click on the desired link and select

• {Append|Insert} New Column

• Selecting “Append New Column” will append a new column at the end• Selecting “Insert New Column” will insert a new column above the selected column

Page 166: Course DS314-PX v3[1].0

166© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Renaming New Column

Rename a new column: •To rename a column, change them on the metadata.

•SQL types and lengths can also be defined.

Rename column

Page 167: Course DS314-PX v3[1].0

167© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Constraint:

•A constraint applies to the entire row.

•A constraint specifies a condition under which incoming rows of data will be written to an output link

•If No constraint is specified, all records are passed through the link

Transformer implements Constraints and Derivations

Page 168: Course DS314-PX v3[1].0

168© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Constraint Window

Expression Editor Window

• Constraint is defined for each individual link• Checking the ‘Reject Row’ box will force only those records that did not meet the condition specified in the constraints.• No constraint is required for ‘Reject’ link.

Page 169: Course DS314-PX v3[1].0

169© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Derivation:

•A Derivation can be applied to each output column.

•Specifies the value to be moved to a output column

•Every output column must have a derivation

•An output column does not require an input column

Transformer implements Constraints And Derivations

Page 170: Course DS314-PX v3[1].0

170© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Transformer Menus

Operand Menu Operator Menu

Stage Variables, Constraints and Derivations are defined using the context sensitive expression editor.This can be accessed by clicking on the ellipses or by right clicking

Page 171: Course DS314-PX v3[1].0

171© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The Transformer'sExpression Editor Window

Context-sensitive menu:Easy access to transforms

Extensive list of availabletransformation functionsto select from:

Page 172: Course DS314-PX v3[1].0

172© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Transformer StageError Handling

Immediate notification whenthere’s a problem!

Page 173: Course DS314-PX v3[1].0

173© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Transformer: Stage Variables

Stage Variables: •Similar to program

variables

•Scope is limited to the

Transformer

•Use to simplify

derivations and

constraints

•Use to avoid duplicate

coding

•Retain values across

reads

•Use to accumulate

values and compare

current values with

prior reads

Page 174: Course DS314-PX v3[1].0

174© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Transformer: Execution Order

• Derivations in stage variables are executed first

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

Page 175: Course DS314-PX v3[1].0

175© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 10: The Transformer

Learn by doing!

Page 176: Course DS314-PX v3[1].0

176© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Page 177: Course DS314-PX v3[1].0

177© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Operators

Page 178: Course DS314-PX v3[1].0

178© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

RDBMS AccessSupported Databases

Parallel Extender provides high performance / scalable interfaces for:• DB2• Oracle• Informix• Teradata

Users must be granted specific privileges, depending on RDBMS.

Page 179: Course DS314-PX v3[1].0

179© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• DB2, ORACLE, INFORMIX, and TERADATA supported• Automatically convert RDBMS table layouts to/from

Parallel Extender Table Definitions• RDBMS nulls converted to/from nullable column values• Support for standard SQL syntax for specifying:

– column list for SELECT statement– filter for WHERE clause– open command, close command

• Can write an explicit SQL query to access RDBMS• Need to supply additional information in the SQL query

– A parallel RDBMS table is stored on multiple disks, connected to multiple CPUs, in a parallel system

– To optimize table access, only the CPU with a direct connection to the disk should read data from it, minimizing data movement

RDBMS AccessSupported Databases

Page 180: Course DS314-PX v3[1].0

180© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Parallel Database Connectivity

TraditionalTraditionalClient-ServerClient-Server Parallel ExtenderParallel Extender

SortSort

ClientClient

Parallel RDBMSParallel RDBMS

ClientClient

ClientClient

ClientClient

ClientClient

Parallel RDBMSParallel RDBMS

Parallel Server Running only RDBMS Each application has only one connection Suitable only for small data

Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes

ClientClient

LoadLoad

Page 181: Course DS314-PX v3[1].0

181© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Oracle Database StageExtracts

• Supports parallel database table access • Some queries cannot be run in parallel

– Queries containing a "GROUP BY" clause which are not also hash partitioned by same column.

– Queries performing a non-collocated join• Oracle Stage options:

– table Table_NameTable_Name specifies the name of the ORACLE table. The table must exist and you must have select privileges on the table.

– query Query Specifies an SQL query to read a table. The Query specifies the table and any processing that you want to perform on the table as it is read into Parallel Extender. This statement can contain joins, views, database links, synonyms, and so on.

Page 182: Course DS314-PX v3[1].0

182© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Oracle stage options (cont’d):– DB Options

{ user = username, password = password } (Required) Specifies either a username and password for connecting to ORACLE. These options are required by the Oracle stage.

Oracle Database StageExtracts

Page 183: Course DS314-PX v3[1].0

183© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Load data into Oracle in parallel• Oracle stage options:

– Write Method: Load or Upsert– Write Mode:

specifies the write mode of the stage.• append (default): New rows are appended to the table.• create: Create a new table. Parallel Extender reports an error if the

ORACLE table already exists. You must specify this mode or the replace mode if the ORACLE table does not exist.

• truncate: The existing table attributes (including schema) and the ORACLE partitioning keys are retained, but any existing rows are discarded. New rows are then appended to the table.

• replace: The existing table is first dropped and an entirely new table is created in its place. ORACLE uses the default partitioning method for the new table.

– Table Table_NameTable_Name specifies the name of the ORACLE table to write to.

Oracle Database StageLoads

Page 184: Course DS314-PX v3[1].0

184© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Oracle stage options (cont’d):– Truncate Column Names

Configures the stage to truncate Parallel Extender column names to 30 characters if it is longer.

– DB Options { user = username, password = password }

Oracle Stage EditorLoads

Page 185: Course DS314-PX v3[1].0

185© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2 and Viper

A good match by design

DB2 EEE and

DataStage Parallel Extender

Page 186: Course DS314-PX v3[1].0

186© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2’s Parallel Strengths

• Intelligent data distribution in multiple partitions

• Openness: table layouts described in system catalog

• Flexible table spaces• Co-located join plans• SP: Shared nothing architecture

Page 187: Course DS314-PX v3[1].0

187© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2’s Parallel Strength is Wasted Unless:

– Applications upstream and downstream of DB2 run in parallel to produce and consume as many streams as DB2 consumes and produces

– Records stream in parallel between DB2 and applications, and between applications, to build end-to-end data flows

... APP (N+1)APP (N+1)APP (N)APP (N) ...

Page 188: Course DS314-PX v3[1].0

188© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Leveraging DB2’s Parallel StrengthsTo Eliminate Switch Traffic and Attain Scalability

• Intelligent data distribution in multiple partitions

• Openness: tables layout described in system catalog

• Flexible table spaces

• Co-located join plans• SP: Shared nothing

• Match UDB hash semantics;make no assumption about data statistical distribution

• Read metadata from the UDB catalog and exploit local keyword access

• Use node groups attached to table spaces for automatic reader/writer placement

• Respect locality of joins• Favor local disks

DB2 DS/PX

Page 189: Course DS314-PX v3[1].0

189© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2 Stage Overview

• db2read:– Reads data in parallel from a DB2 table and automatically

places it into a Framework dataset, along with the table definition.

• db2write:– Writes data in parallel from a Framework dataset into a DB2

table. The operator automatically converts the DS-PX types of the columns to corresponding DB2 types.

• db2load:– The only difference between this option and db2write is that

the db2load option takes advantage of the fast DB2 loader technology when writing data to the database.

Page 190: Course DS314-PX v3[1].0

190© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2 Database Operatorsdb2read

• db2read Options:– server

• DB2 instance name (if different from default in user profile).

– dbname • database name (if different from default in user profile)

– table• TableName- this translates to “Select * from TableName”

– query• SQL query to read a table. This statement can contain joins,

views, database links, synonyms, and so on. Specifying partition name forces execution of the query in parallel on the processing nodes that have a partition of the table. If you do not specify a partition, the stage executes the query sequentially on a single processing node.

Page 191: Course DS314-PX v3[1].0

191© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2 Database Operatorsdb2write/db2load

• db2write/db2load Options:– dbname

• Database name (if not specified, uses default from user profile)– dboptions

• By default, DS-PX creates the table on all processing nodes in the default DB2 node group (DB2 PE) or table space (UDB), and uses the first column as the partitioning key.

• nodegroup = group defines the DB2 PE node group used to store the table.

– tablespace = t_spacedefines the DB2 UDB table space used to store the table.

– key = field0, ... key = fieldN specifies a partitioning key for the table. where:

– table • TableName- name of table to write to.

– mode• Determines the write mode of the operation (more on next page)

– server• Name of a DB2 server (if different from default).

Page 192: Course DS314-PX v3[1].0

192© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

DB2 Database Operatorsdb2write/db2load (continued)

• db2write/db2load Modes:– append

• appends new records to the table; the database user must have TABLE CREATE privileges. This mode is the default.

– create• creates a new table; the database user must have TABLE

CREATE privileges. DS-PX reports an error if the DB2 table already exists. You must specify this mode if the DB2 table does not exist.

– replace• drops the existing table and creates a new one in its place; the

database user must have TABLE CREATE privileges.

– truncate• retains the table attributes (including the schema) but discards

existing records and appends new ones; the database user must have TABLE DELETE privileges.

Page 193: Course DS314-PX v3[1].0

193© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 11: Extract & Load RDBMS Tables

Learn by doing!

Page 194: Course DS314-PX v3[1].0

194© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Page 195: Course DS314-PX v3[1].0

195© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Stages

Page 196: Course DS314-PX v3[1].0

196© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Creating custom stages

From Manager (or Designer):Repository pane:

Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}

Wrapped: this sectionBuild: next section

Overview

• Not covered in this course

• "Build" stages from within Parallel Extender

• "Wrapping” existing “Unix” executables

Page 197: Course DS314-PX v3[1].0

197© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Building “Wrapped” Stages

• In a nutshell:

You can “wrap” a legacy executable:• binary, • Unix command,• shell script

… and turn it into a bona fide Parallel Extender stage capable, among other things, of parallel execution,

… as long as the legacy executable is• amenable to data-partition parallelism

» no dependencies between rows• pipe-safe

» can read rows sequentially» no random access to data, e.g., use of fseek()

Page 198: Course DS314-PX v3[1].0

198© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Wrap and Generate

Fill these at least these two pages and click on

Any executable in your PATH

Avoid conflict with existing Stage or executable names

… and your new stage will appear in the Repository!

Page 199: Course DS314-PX v3[1].0

199© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The "Creator" Page

Tip: Conscientiously maintaining the Creator page for all your wrapped stages will eventually bring you fame and fortune.

Page 200: Course DS314-PX v3[1].0

200© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The "Properties" Page

Expresses the executable's Unix switches

Switch name, without the minus sign

Type of switch argument

Used by Director before each run

Page 201: Course DS314-PX v3[1].0

201© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

The "Wrapped" Page

Specifies Interfaces and Environment

Exit codes ($?) and other variables

Frameworkport #

Already built Interface Table Definition

Other options available

Page 202: Course DS314-PX v3[1].0

202© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Some Background often Helps

Stage Interface Table Definitions• Define link columns referenced by operator• Similar to the signature of a C function

output stage interface

input link

output link

4 Table Definitions in this picture:

input stage interface

Page 203: Course DS314-PX v3[1].0

203© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

• Layout interfaces describe what columns the stage:– needs for its inputs– creates for its outputs

• Two kinds of interfaces: dynamic and static

• Dynamic: adjusts to its inputs automatically(strongly typed parametric polymorphism)

– Ascential-supplied operators are dynamic

• Static: expects input to contain columns with specific names and types

– Custom stages

General Background:Interface schemas

Page 204: Course DS314-PX v3[1].0

204© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

How does the Wrapping Work?

– Define the schema for export and import

• Schemas become interface schemas of the operator and allow for by-name column access

– Define multiple inputs/outputs required by UNIX executable(see next slide)

import

export

stdout ornamed pipe

stdin ornamed pipe

UNIX executable

output schema

input schema

• QUIZ: Why does export precede import?

Page 205: Course DS314-PX v3[1].0

205© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Multiple In and Out Links

Now we are ready to complete the "Wrapped" Page:

Page 206: Course DS314-PX v3[1].0

206© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 12: Wrapped Unix Executables

Learn by doing!

Page 207: Course DS314-PX v3[1].0

207© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Agenda

1. Introduction to Parallel Extender

2. Concepts in Parallel Computing

3. Partitioning and Collecting Data

4. Importing/Exporting Data

5. Overview of Some Parallel Extender Stages

6. Using RDBMS with Parallel Extender

7. Wrapping Unix Executables

8. Building Native Operators

Page 208: Course DS314-PX v3[1].0

208© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Creating custom stages

From Manager (or Designer):Repository pane:

Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}

Wrapped: last section

Build: this section

Overview

• Not covered in this course

• "Build" stages from within Parallel Extender

• "Wrapping” existing “Unix” executables

Page 209: Course DS314-PX v3[1].0

209© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Building OperatorsSummary: the Wizard

• In a nutshell:

– The user performs the fun, glamorous tasks: encapsulate business logic and arithmetic in a custom operator

– An Parallel Extender wizard called “buildop” automatically performs the unglamorous, tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.

Page 210: Course DS314-PX v3[1].0

210© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

General Page

Identicalto Wrapped's,except:

Under the BuildTab, your program!

Page 211: Course DS314-PX v3[1].0

211© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Logic Tab forBusiness Logic

Enter Business C++ logic and arithmetic in four pages under the Logic tab

Main code section goes in Per-Record page, it will be applied to all rows

Page 212: Course DS314-PX v3[1].0

212© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Code Sections under Logic Tab

Temporary variables declared [and initialized] here

Logic here is executed once BEFORE processing the FIRST row

Logic here is executed once AFTER processing the LAST row

Page 213: Course DS314-PX v3[1].0

213© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

I/O and Transfer

Under Interface tab: Input, Output & Transfer pages

Optional renaming of output port from default "out0"

Write row

Input page: 'Auto Read'Read next row

In-RepositoryTable Definition

'False' setting,not to interfere with Transfer page

First line: output 0

Page 214: Course DS314-PX v3[1].0

214© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

I/O and Transfer

• Transfer from input in0 to output out0.• If page left blank or Auto Transfer = "False" (and RCP = "False")

Only columns in output Table Definition are written

First line:Transfer of index 0

Page 215: Course DS314-PX v3[1].0

215© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Building OperatorsNative Operators

• Example - sumNoTransfer– Add input columns "a" and "b"; ignores other columns

that might be present in input– Produces a new "sum" column– Do not transfer input columns

sumNoTransfera:int32; b:int32

sum:int32

Page 216: Course DS314-PX v3[1].0

216© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

No Transfer

NO TRANSFER

• Causes:

- RCP set to "False" in stage definitionand

- Transfer page left blank, or Auto Transfer = "False"

• Effects:

- input columns "a" and "b" are not transferred

- only new column "sum" is transferred

Compare with transfer ON…

From Peek:

Page 217: Course DS314-PX v3[1].0

217© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Transfer

TRANSFER

• Causes:

- RCP set to "True" in stage definition

or- Auto Transfer set to "True"

• Effects:- new column "sum" is transferred, as well as- input columns "a" and "b" and- input column "ignored" (present in input, but not mentioned in stage)

Page 218: Course DS314-PX v3[1].0

218© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Columns

• DS-PX type• Defined in

Table Definitions

• Value refreshed from row to row

Temp C++ variables

• C/C++ type• Need declaration (in

Definitions or Pre-Loop page)

• Value persistent throughout "loop" over rows, unless modified in code

Columns vs. Temporary C++ Variables

Page 219: Course DS314-PX v3[1].0

219© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

Adding a column with Row ID

Out Table;

YES!

QUIZ!

Wouldindex++work?

Page 220: Course DS314-PX v3[1].0

220© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003

LABS

• Lab 13: Build Stages

Learn by doing!