course ds314-px v3[1].0
TRANSCRIPT
Student Guide
v3.0: February 28, 2003
DS314-PX DataStage Parallel Extender Essentials
2
Copyright
This document and the software described herein are the property of Ascential Software Corporation and its licensors and contain confidential trade secrets. All rights to this publication are reserved. No part of this document may be reproduced, transmitted, transcribed, stored in a retrieval system or translated into any language, in any form or by any means, without prior permission from Ascential Software Corporation. Copyright © 2003 Ascential Software Corporation. All rights Reserved Ascential Software Corporation reserves the right to make changes to this document and the software described herein at any time and without notice. No warranty is expressed or implied other than any contained in the terms and conditions of sale.
Ascential Software Corporation50 Washington Street
Westboro, MA 01581-1021 USAPhone: (508) 366-3888
Fax: (508) 366-3669
Ascential, DataStage, INTEGRITY, MetaRecon, MetaStage and MetaBroker are trademarks of Ascential Software Corporation. Pick is a registered trademark of Pick Systems. Ascential Software is not a licensee of Pick Systems. Other trademarks and registered trademarks are the property of the respective trademark holder.
02-25-2003
3© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Processing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
4© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
1a. Overview; Two Complete Jobs
1b. Client-Server Architecture
1c. The Job Sequencer
5© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Module 1:Introduction to Parallel Extender
In a nutshell:
Parallel Extender harnesses the power of parallel computers for processing large volumes of rows in a minimum amount of time
6© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Key PX Concepts
• Application scalability• Parallel computer systems• Flow-based programming• Explicit and implicit parallelisms• Pipeline and partition parallelisms• The Framework (the parallel engine)• Datasets (uniform set of rows in the Framework's internal
representation)• Table definitions/schemas (metadata)• Configuration files (only one active at a time, describes
H/W)
7© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Automatic, Flexible Scalability
With Parallel Extender:
• Don’t worry about:• How data is moved around• Today’s machine configuration• Possible deadlocks/synchronization bugs
• Job Designs (programs) are completely architecture-independent
– SMP or MPP, clustered SMP’s, SMP’s within MPP’s
8© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Origins of DataStage XE Parallel Extender
Orchestrate Program(sequential data flow)
Orchestrate Application Frameworkand Runtime System
Import
Clean 1
Clean 2
Merge Analyze
Configuration File
Centralized Error Handlingand Event Logging
Parallel access to data in files
Parallel access to data in RDBMS
Inter-node communications
Parallel pipelining
Parallelization of operations
Import
Clean 1
Merge Analyze
Clean 2
Relational Data
PerformanceVisualization
Flat Files
DataStage XE:Best-of-breed data integration platform
Orchestrate:Best-of-breed application scalability
DataStage XE Parallel Extender:Best-of-breed scalable data integration platformNo limitations on data volumes or throughput
9© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Synonyms
schema table definition
property format
underlying type SQL type + length [and scale]
virtual dataset link
record/field row/column
operator stage
step, flow, OSH command job
Framework DS engine
10© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Types of Scalability: Throughput and Speedup
adding more users andsmall jobs to the server
User Scalability
One application running faster against more data by using more processors.
1 processor10 Gbytes storage
10 processors100 Gbytes storage
100 processors1000 Gbytes storage
Application and Data Scalability
Parallel Extender: • focus is on data scalability (speedup)• can support both throughput and speedup
Throughput Speedup
11© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
User's View: Two Inputs, Parallel DeploymentUser 1) assembles a sequential flow (design) using the Designer 2) provides a configuration file…
…and gets: parallel access, propagation, transformation, and load.
The design is good for 1 node, 4 nodes, or N nodes. To change # nodes, just swap configuration file.
No need to modify or recompile your design!
12© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Traditional Batch Processing
Source
Transform
Target
Data Warehouse
Operational Data
Archived Data
Clean Load
Disk Disk Disk
Without Parallel Extender
13© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Data Pipelining
Source
Transform
Target
Data Warehouse
Operational Data
Archived Data Clean Load
• Start a downstream process (e.g., "Clean") on the first rows while an upstream (Transform) process is still processing the later rows..• This eliminates intermediate storing to disk, a very expensive operation
Pipeline Multiprocessing
14© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Data Partitioning
Transform
SourceData
Transform
Transform
Transform
Node 1
Node 2
Node 3
Node 4
A-F
G- M
N-T
U-Z
Partitioning "partitions" the incoming set of rows in smaller subsets ("partitions") to be processed independently and concurrently by different nodes.
Partition Parallelism
15© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Putting It All Together: Parallel Dataflow
Source Target
Transform Clean Load
Pipelining
Par
titio
ning
SourceData
Data Warehouse
Partition and Pipeline Parallelismscan Occur Simultaneously
Data partitioning and pipelining are "orthogonal" mechanisms
16© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Putting It All Together: Parallel Dataflow with Repartioning on-the-fly
Without Landing To Disk!
Source Target
Transform Clean Load
Pipelining
SourceData Data
Warehouse
Par
titio
ning
Rep
artit
ioni
ng
A-FG- M
N-TU-Z
Customer last name Customer zip code Credit card number
Rep
artit
ioni
ng
Repartitioning
17© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Some Popular Stages
• Usual ETL sources/targets: - RDBMS, Sequential File, Data Set
• Combine Data: - Lookup, Joins, Merge - Aggregator
• Transform Data: - Transformer, Remove Duplicates
• Ancillary: - Row Generator, Peek, Sort
18© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
A Real-Life PX Job
Mini Star-Schema Warehousing:Merge and Aggregate
19© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Another Real-Life PX Job
• Householding
• Alternative representation of same job:
20© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
1a. Overview; Two Complete Jobs
1b. Client-Server Architecture
1c. The Job Sequencer
21© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Server – UNIX (AIX, Solaris, TRU64, HP-UX)
Designer Director ManagerAdministrator
Client - Microsoft® Windows NT/2000/XP
Client-Server ArchitectureOverview
22© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Client-Server ArchitectureOne Fat Server, Four Thin Clients
• Fat server does most of the work: – Compiles, Run programs, Generates output– Keeps Repository– Release 6.0: Unix (AIX, Solaris, Tru64, HP-UX)
• Four thin clients for administration and UI– Administrator, Manager, Director, & Designer– Need to be connected to Server to operate
• NT Server allows developing (but not running) PX jobs
• Clients can be bypassed – OSH from Unix prompt
23© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Client Hierarchy
• Clients: – Administrator: handles all user's projects
– Manager: import/export jobs, metadata…
– Director: runs, monitors jobs, displays stdout/stderr
– Designer: GUI for creating, editing jobs, schemas
• Also:– multiple users per server – multiple projects per user– multiple jobs (and related objects) per project
24© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment Scoping
• Environment defaults set at install for all users– Administrator can override settings for user, projects– Designer can override in "Job Properties" per job basis– Director can override Job Properties from one run to
the next, without recompile. Very handy to select on run basis level of • parallelism• reporting • debugging
25© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Administrator
• Add, remove projects
• Set project-wide attributes
Always checkRemember to renew license by that date!
26© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The AdministratorProjects
Add New Projects
Add or ModifyProject Properties
Project Listing
27© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The AdministratorProject Properties
Make sure these are checked!
Review / Modify a Project’s Environment Variable Settings
Runtime Column Propagation – From the Project / General tab
28© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment VariablesGeneral
29© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment VariablesParallel
30© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment VariablesOperator Specific
31© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment VariablesReporting
32© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Environment VariablesCompiler
33© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The ManagerOverview
Select this to providemore detailed view ofall jobs.
34© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The ManagerExporting to File
Backing up the job designs of one "Category" in a .dsx file
Export + Import:
The only way to move jobs between projects
35© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Manager Importing External Definitions
Use The Manager to Import:• COBOL Copybooks• Framework Schemas• Database Table Layouts• Custom stage
36© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Manager Usage Analysis
Use The Manager to analyze whereany given object is referenced.
Results show exactlywhere the object is being used.
Selecting Edit will open corresponding job
37© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Manager Configuration Editor
Use the Manager to Create / Edit / Check Configuration Files• Configuration Files are saved under DataStage Server directory path.
38© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Designer
Enter Server Name, Username & Password, and select appropriate Project
Create a New JobOpen an Existing Job
Open a recentlyaccessed job
39© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Drag Stages from Repository or Palette and drop on Parallel Canvas:
The DesignerParadigm
40© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Designer Stage Properties Editor
Required options arehighlighted red &marked with ?
Pull-down menu of valid options
Property Icons:Non-Repeating Property with no dependentsNon-Repeating Property with dependentsRepeating Property with no dependentsRepeating Property with dependents
Quick Help / Tips
41© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Designer Stage Input Link
Optional insertion of Sort and Rem. Dup.
Partitioner/Collectors inserted on Input links
View Input Column Definitions
One page per input link
42© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Designer Stage Output Link - Mapping Page
Column Mapping: Specify relationship between input and output columns, orhow output columns are derived.
Click & Dragto create mappings
Will attempt to automatically map input to output
Output columndefinitions
43© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Designer Stage Output Link - Column Page
Output columndefinitions
Make sure this is checked!
One page per output link
44© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Director
• Typical Use: Director allows user to select jobs already compiled (from Designer), fill-in run-specific environments and parameters, validate and:
1. Run2. Display log messages3. Inspect highlighted message4. Inspect previous/next message
more on these steps next slide…
45© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
1
2
The Director
3
4
46© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Typical Job Log Messages:
• Environment variables• Configuration File information• Framework Info/Warning/Error messages• Output from the Peek Stage• Additional info with "Reporting" environments• Tracing/Debug output
– Must compile job in trace mode– Adds overhead
The Director
47© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Job Level Environments
• Job Properties, from Menu Bar of Designer• Director will
prompt you before eachrun
48© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
1a. Overview; Two Complete Jobs
1b. Client-Server Architecture
1c. The Job Sequencer (optional material)
49© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Managing Multiple Jobs:The Job Sequencer Canvas
• ‘Job Sequencer’ canvas controls job execution, conditional to successful completion (or failure, etc.) of other jobs
– Job Activity stage reference job paths and activity options• Links between Job Activity stages specify the sequence of execution
• Triggers Tab states condition for execution of Job down specified link
– All sequence stages, from the Repository pane:
50© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Job Sequencer
• Job Sequence– FirstJob is executed first
– If the job results in a run with warnings, SecondJob is executed
• Each job is referenced using ‘Job Activity’ Stage• Sequence of execution is specified by linking the two stages
51© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Job Activity Stage
For job namedhere,4 Activity Options
52© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Job Activity Stage Triggers Tab
Condition options that trigger Activity
53© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Sequencer Stage
• Sequencer– Specify ANY or ALL option
– In above sequencer, if either GeneratorPeek or GeneratorPeekSortDataset runs successfully, then DS314PXSequencer is run
Sequencer
54© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Sequencer StageSequencer Tab
Two modes:
•Any: Start Activity if any input condition holds
•All: Start Activity only if all input conditions hold
55© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Notification
Notification Stage
56© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Notification Activity
57© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Notification Activity (con’t)
• Sample DataStage log from Mail Notification
58© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Notification Activity (con’t)
• E-Mail Message
59© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
60© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
2. Concepts in Parallel Computing
2a. Zoology of Parallel H/W
2a. DS-PX Program Elements
2c. Three Types of Parallelisms
61© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Parallel processing = executing on multiple CPUs• Scalable processing = add more resources
(CPUs, RAM, and disks) to increase performance
1 2
3 4
5 6
Scalable Systems
62© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Scalable Systems: Examples
Two main types of scalable systems
• Symmetric MultiProcessors (SMP), shared memory– Sun Starfire™– IBM S80– Compaq GS Series– HP Superdome
• Clusters: UNIX systems connected via networks – Sun Cluster– Compaq TruCluster
• Massively Parallel Computers (MPP)– IBM SP (formerly SP/2)
63© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Scalable Architectures: Typology
• Symmetric Multiprocessor(SMP)
Shared Memory
CPU CPU CPU CPU
Shared Everything
• Loosely-Coupled Clusters• Massively Parallel Systems
(MPP)
CPU CPU CPU CPU
Memory Memory Memory Memory
Shared Nothing
64© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
SMP: Shared Everything
When used with Parallel Extender:• Data transport uses shared
memory• Simpler install and startup
Parallel Extender treats NUMA (NonUniform Memory Access) as plain SMP
Shared Memory
CPU CPU CPU CPU
Shared Everything
65© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
MPP:Shared Nothing
• Cluster of multiple independent systems (e.g., Unix boxes) connected by a medium-speed network, typically Ethernet.
• True MPP: cluster in a single box, connected by high-speed switch.
DS-PX treats them the same.
CPU CPU CPU CPU
Memory Memory Memory Memory
Shared Nothing
66© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
2. Concepts in Parallel Computing
2a. Zoology of Parallel H/W
2b. DS-PX Program Elements
2c. Three Types of Parallelisms
67© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DS-PX Program ElementsDatasets, Partitions, Nodes
• Dataset: uniform set of rows in the Framework's internal representation - Two flavors:
1. persistent: *.ds : stored on multiple Unix files in Framework format read and written using the DataSet Stage
2. virtual: *.v : links, in Framework format, NOT stored on disk
• Partition: subset of rows in a dataset earmarked for processing by the same node (virtual CPU, declared in a configuration file).
68© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DS-PX Program ElementsPersistent Datasets
• Accessed from/to disk with DataSet Stage. • Two parts:
– Descriptor file: • contains metadata, data location, but NOT the data itself
– Data file(s) • contain the data
• multiple Unix files (one per node), accessible in parallel
input.ds
node1:/local/disk1/…node2:/local/disk2/…
record ( partno: int32; description: string; )
69© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Datasets remain data-partitioned when stored to disks, allowing parallel I/O in addition to parallel processing
• They implement end-to-end parallelism• Whenever possible, it is advantageous to use Datasets
Stages rather than SequentialFile Stages in flows such as
• Second advantage: Datasets Stages achieve the economy of data translation (see Module 4)
Persistent Datasets:Partitioned Storage
70© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Partitioners distribute rows into partitions No icon
– implement data-partition parallelism
• Collectors = inverse partitioners
• Live on input links of stages running – in parallel (partitioners)– sequentially (collectors)
• Use a choice of methods (see next Module)
DS-PX Program ElementsPartitioners and Collectors
71© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Quiz!
• True or False?Everything that has been data-partitioned must
eventually be collected
• Hint:Two slides up: leave the source SequentialFile as they are
but replace the target SequentialFile with a DataSet stage.
Answer: FALSE!
Counterexample:
72© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
PX Program ElementsStages
• Act on data.• Can be parallel or sequential• Parallel Extender supports:
– Over 40 Built-In stages– Custom stages
• Multiple Type• All use same generic intelligent stage editor• Supports Stream, Reference and Reject links
73© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Stages Control Partition Parallelism
• Execution mode (sequential/parallel) is controlled by Stage– default = parallel for most Ascential-supplied Stages– user can override default mode– parallel Stage inserts the default partitioner (Auto) on its input links – sequential Stage inserts the default collector (Auto) on its input links – user can override default
• execution mode (parallel/sequential) of Stage (Advanced tab)• choice of partitioner/collector
• Degree of parallelism (How many nodes?) is determined by the configuration file– Total number of logical nodes in nameless default pool,
or a subset using "constraints" (Advanced Topic).
74© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
2. Concepts in Parallel Computing
2a. Zoology of Parallel H/W
2b. DS-PX Program Elements
2c. Three Types of Parallelisms
75© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Sort
DerivationSample
Lookup
Constraint
• Reusable operators, data and metadata are connected into a logical flow Data driven vs. demand driven Eager vs. lazy Push vs. pull
Flow-Based ProgrammingI: Sequential Features
76© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Flow-Based ProgrammingII: Parallel Features
• Three types of parallelism may occur in this picture:>> 1 explicit
• e.g., constraint
>> 2 implicit • pipeline• data partitioning
Constraint
77© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Sort
DerivationSample
Lookup
Constraint
explicit
pipeline
• Explicit parallelism
• Implicit pipeline "parallelism"
• Implicit data-partition parallelism QUIZ! Where should the blue box go?
Three Types of Parallelism
data-partition
78© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Implicit Parallelisms: Partitioned and Pipeline
node 1
CPU 2
Stage 1
Stage 2
Pipeline'Parallelism'
Multiple data partitions:node 1 and node 2 execute Stage 1, each on their own partition,independently and concurrently
node 2
One data partition:CPU 2 starts executing Stage 2 before completion of Stage 1 by CPU 1
CPU 1
Stage 1
Data PartitionParallelism
79© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Another view on Pipeline Multiprocessing
• Sequential jobs– Process a row at a time all the way through all operations
• Pipelined processing– Operations run whenever data available– Except at boundaries– Operations in separate processes with inter-process buffering
import clean transform
import
clean
transformtime
80© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
…Parallel Extender creates at runtime N Unix processesfor each Stage, where N is the number of logical nodes defined in the configuration file(NOT in the Job Design, which is good for any value of N).
Runtime Unix Processes
Given a Job Design:
81© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Configuration File
{ node "n1" { fastname "s1" pool "" "n1" "s1" "app2" "sort" resource disk "/orch/n1/d1" {} resource disk "/orch/n1/d2" {"bigdata"} resource scratchdisk "/temp" {"sort"} } node "n2" { fastname "s2" pool "" "n2" "s2" "app1" resource disk "/orch/n2/d1" {} resource disk "/orch/n2/d2" {"bigdata"} resource scratchdisk "/temp" {} } node "n3" { fastname "s3" pool "" "n3" "s3" "app1" resource disk "/orch/n3/d1" {} resource scratchdisk "/temp" {} } node "n4" { fastname "s4" pool "" "n4" "s4" "app1" resource disk "/orch/n4/d1" {} resource scratchdisk "/temp" {} }
1
43
2
Two key aspects:
1. # nodes declared
2. defining subset of resources"pool" for execution under "constraints," i.e., using a subset of resources(Advanced Topic)
Advanced topic!
82© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Config File Key Aspect #1: # Nodes Declared
• Better than editing one config file, keep several versions!
• In Director, you can switch config file before each run, e.g,:• '1-node' file - for sequential execution, lighter reports—handy for testing
- aims at max pipeline
• 'MedN-nodes' file - aims at a mix of pipeline and data-partitioned parallelisms
• 'BigN-nodes' file - aims at full data-partitioned parallelism
• Only one file is active while a job is running• that pointed by the environment variable:
$APT_CONFIG_FILE
# nodes declared in the config file needs not match # CPUs• Same configuration file can be used in development and target
machines
83© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Quiz! Is the this scenario plausible?
1. Anna Liz runs the flow below using a one-node configuration file. Her computer has one CPU only.
2. At dinner, she breaks a tooth on an olive pit.
3. At exactly midnight, the Tooth Fairy secretly adds a second CPU to Anna Liz's computer.
4. The next morning, Anna Liz runs the same flow with the same data and the same one-node configuration file.
5. Anna Liz's program runs much faster
The correct is answer is NO: the tooth fairy does not existYES! Pipeline Multiprocessing: 4 Unix processes keep both CPUs busy
SeqFileSeqFile DerivationSample
84© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
QUIZ!
• True or False?
The Sort stage buffers all the incoming rows (in the partition) before outputting the first outgoing row.
The Sort stage inhibits pipeline multiprocessing.
The Sort stage creates a synchronization point: the fastest node will wait for all nodes to complete the Sort Stage before they all attack the next stage.
85© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Answers
The Sort stage buffers all the incoming rows (in the partition) before outputting the first outgoing row.TRUE! The Sort stage needs to read all the rows before deciding which row to output first.
The Sort stage inhibits pipeline multiprocessing.TRUE! Buffering inhibits streaming: the stages downstream must wait until Sort completes reading all the rows.
The Sort stage creates a synchronization point: the fastest node will wait for all nodes to complete the Sort Stage before they all attack the next stage.FALSE!Nodes process independently their own instantiation of the stages, partitions do not communicate. The only way to insure synchronization is to land to disk, terminate the job, and start a new job -- See Lab 4a.
86© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Lab 1: Getting Around
• Lab 2: A Simple Parallel Extender Program
• Lab 3: Modify Job, Create a Dataset
Learn by doing!
LABS
87© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Custom Stages
88© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data
In a nutshell:
- To distribute rows among nodes, Parallel Extender employs an effective default method. The user can override the default with a choice of alternative methods. (Partitioning)
- The same applies for programs that require recollecting the rows into a sequential stream. (Collecting)
89© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data
• Partitioning breaks the dataset into smaller sets of rows (partitions), which can be processed independently by multiple nodes. Each node executes in parallel its own instantiation of the stages.
• Collecting brings back data partitions into a sequential stream.
90© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning Methods
Partitioning methods:
– (Auto) Parallel Extender decides (default): Same or Round Robin
– Same Existing partitioning is not altered – Round Robin Rows are alternated among partitions– Entire Each partition gets the entire dataset
(rows duplicated)– Random Rows randomly assigned to partitions– Hash Rows with same key column value go to the
same partition– Range Similar to hash, but partition mapping is
user-determined and partitions are ordered– Modulus Assigns each row of an input dataset to a
partition, as determined by a specified numeric key column in the input dataset
– DB2 Matches DB2 EEE hash semantics
91© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Same: rows retain current distribution
036
147
258
036
147
258
Row ID's
Partitioning and Collecting Data Partitioning Methods
Same = identity transform, does nothing
92© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data Partitioning Methods
Round Robin: rows are distributed evenly among partitions, as in dealing cards
…8 7 6 5 4 3 2 1 0
630
741
852
93© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Entire: each partition gets a complete copy of the data
• Useful for distributing
• lookup tables,
• parameters files, etc.
• WARNING:
• Increases the data volume!
…8 7 6 5 4 3 2 1 0
.
.3210
.
.3210
.
.3210
Partitioning and Collecting Data Partitioning Methods
94© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Hash: rows are distributed according to the values in one or more user-defined key columns
• Rows with identical values in key columns end up in the same partition
• Prevents "matching" rows (such as sought by the Remove Duplicates, Joins, and Merge Stages) from hiding in other partitions.
Partitioning and Collecting Data Partitioning Methods
95© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Hash (continued):
…0 3 2 1 0 2 3 2 1 1
0303
111
222
Values of key column
Partitioning and Collecting Data Partitioning Methods
96© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Range: an expensive refinement of Hash
• A given partition will contain only rows with key values within some "range."
• Must first run the 'Write Range Map' Stage.
Partitioning and Collecting Data Partitioning Methods
97© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Range (continued):
4 0 5 1 6 0 5 4 3
010
443
565
Partitioning and Collecting Data Partitioning Methods
•QUIZ! If incoming data is ordered on key, something bad happens. WHAT?
98© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data Partitioning Methods
Modulus: an inexpensive form of Hash
• Allows one key column, of type integer:
• Partition = key_value (mod # of partitions)
99© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data Collectors
• Collectors combine partitions of a dataset into a single input stream to a sequential Stage
data partitions (NOT links)
collector
sequential Stage
...
100© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Partitioning and Collecting Data Collectors (continued)
Collector Methods:
– (Auto) Eagerly read any row from any input partition (default, non-deterministic)
– Round Robin Patiently pick row from input partitions
– Ordered Read all rows from first partition, then second,… (e.g., print chapter 1, then 2,…)
– Sort Merge Read rows based on key columns, produce fully sorted from within-partition sorted (non-deterministic on un-keyed columns)
101© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Collectors:Non-Deterministic Execution
Collectors can cause surprises:
• Default (Auto) is "eager" to output rows and may be nondeterministic: row order may vary from run to run.
• This can be seen using the most common "effective collectors": the standard out/error
102© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The sequential Generator generates 4 rows with one integer column named "old" and taking values 0,1,2,3.
Simple Example of Partitioning & Collecting
Default Round Robin happens to make the values 0,1,2,3 label the partition #
The First Peek transports data and has a side-effect: it sends messages to the monitor (not shown). It also renames the column from "old" to "new"
The second peek sends data to the bit bucket and also sends messages
"old" "new"
103© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Job Log Shows Messages out of Flow Order
time
snapshot
Example (line 5 above):
SecondPeek,2: new: 2
= "SecondPeek, partition 2, showed value 2 in column "new"
SecondPeek sends messages prior to completion by FirstPeek
104© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
A What Happened? The Standard Out Acted as a Collector
3
10
2
Job Design:
Conceptually correct, but unpractical deployment:
1
0
2
3
Actual deployment:
Hidden collector: The ConductorNode collects messages and sends sequential stream to standard out
No collector!
105© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Snapshot during run:
Row in partition: 2 passed both Peeks1 and 3 passed FirstPeek only0 still upstream of FirstPeek
Sequence of Events
3
10
2
After job completion: 00
1232
Data transport (one row/partition, values 0,1,2,3)
Side effects: messages to log from First, Second Peek
p0
p2
p3
p1
22
33
11
partition IDs
106© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Synchronization Points
• Repeated runs yield different message orders, in violation of determinism ("same causes, same effects")
• SecondPeek's 1st message often precedes FirstPeek's last message, in apparent violation of flow order
– In Lab 4a, you will force flow order while maintaining parallelism: SecondPeek will wait for completion of FirstPeek. You will create a "synchronization point." (This lab is optional.)
107© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DUMP SCORE Output
Double-click
Mapping Node--> partition
Setting APT_DUMP_SCORE yields:
QUIZ!Why 9 Unix processes?
Confirms no combination occurred
108© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DataSets: Best of Both Worlds
• Our previous design faced a deployment dilemma:
– Conceptually correct but unpractical (N stdouts), or– Feasible (1 stdout) but conceptually weak (unnecessary collector)
• Adding a DataSet stage solves the dilemma and keeps the best of both worlds:
– Single file view: "twin.ds" but – Parallelism is maintained: N partitions stored in N actual files
– Note: other stages (FileSet, RDBMS) achieve the same
109© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Copy Stage
With 1 link in, 1 link out:
the Copy Stage is the ultimate "no-op" (place-holder):– Partitioners
– Sort / Remove Duplicates– Rename, Drop column
… can be inserted on: – input link (Partitioning): Partitioners, Sort, Remove Duplicates)
– output link (Mapping page): Rename, Drop.
110© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Column Generator Stage
• Affixes new column(s) to incoming rows• Intrinsic functions:
'part' = current partition number
'partcount' = number of partitions
• Same format properties as Row Generator, e.g.,
Type = Cycle
Initial value =
Increment =
Limit =
Partition number Initial value = 'part', Increment = 0)
Unique row ID Initial value = 'part', Increment = 'partcount')
usable as surrogate key
111© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 4a: Synchronization (optional)
• Lab 4b: Explore Partitioners and Collectors
Learn by doing!
112© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
113© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data (SequentialFile)
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Custom Stages
114© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Sequential File StageImporting/Exporting Data
In a nutshell:
– If one knows how some data was written: Parallel Extender can read it!
– If one specifies how some data must be read: Parallel Extender can write it!
Sequential File
115© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Sequential File StageIntroduction
• The Framework processes only datasets• For files other than datasets, Parallel Extender
must perform import and export operations (i.e., format translations in addition to extract and load)
• External data formats fall in two major categories:– Favorable: the format translation is automatic or semi-automatic
• data stored in a relational database (DB2, Informix, Oracle, Teradata)
• data stored in a SAS data set
• data stored in a COBOL data file
– Other: user needs to specify manually external formats• everything else: flat text files, binary files
Use the SequentialFile Stage
116© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Importing/Exporting Data Data Compatibility
How is the data described to Parallel Extender?
• Relational data: automatic Parallel Extender gets the information directly from the RDBMS catalog
• SAS datasets: automaticParallel Extender gets the information directly from SAS header
• COBOL data from IBM mainframes: semi-automatic Parallel Extender can convert a COBOL record layout into an appropriate import/export table definition
• Generic data (binary from any source, Unix text files) user creates/imports metadata
– Table Definition
– Framework "schema"
117© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
From one Dimension to Two
• Memory is one-dimensional, tables are two-dimensional
• Two major steps: 1. recordization: identify rows
2. column parsing: within a row, identify columns
• Tools: fixed length, prefixes, delimiters
118© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
– Often additional data conversion is needed, e.g.,
– convert “numbers as text” into binary data
– may have to do character set conversion (EBCDIC to ASCII)
– may have to process packed decimal data
– may have to scan dates in various formats
Additional Format Conversions
119© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Using the SequentialFile Stage
Importing/Exporting Data
Both import and export of general files (text, binary) are performed by the SequentialFile Stage.
– Data import:
– Data export
QUIZ! True or false: the links here represent the flat files.
120© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Sequential File Stage
GUI can generate arbitrary Framework schemas
121© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Textual Representation of Schemas
Text is an equivalent alternative to the representation in the previous slide
• Text can be imported into a Table Definition• Schemafile option for one-time use by SequentialFile and
Generator Stages• Example:
record ( name : string[];zip : int32;income : sfloat;
)
122© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Another Look at Datasets
• Contain data organized as rows• Provide “single-file” view of data stored in parallel• Includes metadata information: “schema”• Supports many data types
First Name Last Name Birthdate Gender Salary Dependents
John Smith 1/3/62 M 28,500 1
Jane Doe 4/23/61 F 31,350 2
Ralph Mann 11/14/57 M 41,200 5
…
dataset
John Smith 1/3/62 M $28,500 1
row
Columns, defined by- schema's "unformatted core":
• column name• data type (string, date, etc.)• nullability
- format properties, for external representation, listed within {} in schemas
123© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Nullable Columns
All data types are nullable – Null columns do not have a value
– DS-PX null is represented by an out-of-band indicator
– Nulls can be detected by any stage
– Nulls can be converted to/from a value
– Null columns can be ignored by an stage, can trigger error or other action
124© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Textual-GUI Mapping
Unformatted Core = Columns Tab
Format Properties { } = Format Tab (in Parallel Stages) = Parallel Tab (in Table Definitions)
125© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Textual Equivalent:record {final_delim=end, delim=',', quote=double}
(
LastName :string[max=25];
FirstName :string[max=15];
Address :string[max=60];
City :string[max=20];
State :string[2];
Zip :string[5];
)
PX Program ElementsTable/Column Definitions
126© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Importing/Exporting Data Schema Import for COBOL
• COBOL schema import takes COBOL program, program excerpt, or data definition file (“copy book”)
• Creates schema from an FD (File Description) section, or 01 Level statement
• Generated schema can be used to import from or export to a COBOL data file.
127© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 5a: Simple Format Translation
• Lab 5b: Flat File Import
• Lab 6: Cobol Data Import (Optional)
Learn by doing!
128© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
129© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
130© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Combining Data
• There are two ways to combine data:
– Horizontally: Several inputs links; one output link (+ optional rejects) made of columns from different input links. E.g., • Joins
• Lookup
• Merge
– Vertically: One input link, output with column combining values from all input rows. E.g.,• Aggregator
131© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge
5b. Combining Data, Vertically: Aggregator5c. The Transformer
132© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Join, Lookup & MergeStages: Introduction
• These "three Stages" combine two or more input links according to values of user-designated "key" column(s).
• They differ mainly in:– Memory usage– Treatment of rows with unmatched key values– Input requirements (sorted, de-duplicated)
• Complete summary on slide 153
133© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Joins - Lookup - Merge:Not all Links are Created Equal!
Joins Lookup Merge
Primary Input: port 0 Left Source MasterSecondary Input(s): ports 1,… Right LU Table(s) Update(s)
• Parallel Extender distinguishes between:- The Primary Input (Framework port 0)- Secondary - in some cases "Reference" (other ports)
• Naming convention:
Tip: Check "Input Ordering" tab to make sure intended
Primary is listed first
134© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Join, Lookup & MergeStages: Behavior
We shall first use a simplest case, optimal input: • two input links: "left" as primary, "right" as secondary • sorted on key column (here "Citizen"), • without duplicates on key
Left link (primary input) Right link (secondary input)Revolution Citizen
1789 Lefty1776 M_B_Dextrous
Citizen ExchangeM_B_Dextrous NasdaqRighty NYSE
135© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
1. The Join Stage
Four types:
• 2 sorted input links, 1 output link – "left" on primary input, "right" on secondary input– Pre-sort make joins "lightweight": few rows need to be in RAM
• Inner• Left Outer• Right Outer• Full Outer
136© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Join Stage Editor
One of four variants:– Inner– Left Outer– Right Outer– Full Outer
Several key columns allowed
Link Order immaterial for Inner and Full Outer Joins (but VERY important for Left/Right Outer and Lookup and Merge)
137© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Inner Join
• Transfers rows from both data sets whose key columns contain equal values to the output link
• Treats both inputs symmetrically
Output of innerjoin on key Citizen, using "Simplest Case" input :
Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq
Same output as lookup/reject and merge/drop
138© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Left Outer Join
• Transfers all values from the left link and transfers values from the right link only where key columns match.
Can perform lookup with: - Left link as Source - Right link as LU Table
Revolution Citizen Exchange1789 Lefty1776 M_B_Dextrous Nasdaq Empty string
Same output as lookup/continue and merge/keep
139© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Lookup using Join/LeftOuter
Check Link Ordering Tabto make sure intended Primary is listed first
140© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Right Outer Join
• Transfers all values from the right link and transfers values from the left link only where key columns match.
• Can perform lookup with:
- Right link as Source
- Left link as LU Table
Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq
0 Righty NYSE
Integer 0 Integer 0 Integer 0
141© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Full Outer Join
• Transfers rows from both data sets, whose key columns contain equal values, to the output link.
• It also transfers rows, whose key columns contain unequal values, from both input links to the output link.
• Creates new columns, with new column names!
Revolution leftRec_Citizen rightRec_Citizen Exchange1789 Lefty1776 M_B_Dextrous M_B_Dextrous Nasdaq
0 Righty NYSE
142© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
2. The Lookup Stage
Combines: – one source link, on primary input (0), with– one or more duplicate-free table links, on "reference" input ports (>0)
• no pre-sort necessary• allows multiple keys, LUTs• flexible exception handling for
source input rows with no match
Lookup
Sourceinput
One or more tables (LUTs)
Output Reject
0
1
2
0
1
143© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• If an Source key column value is present in no LUT, the following options are provided:
– fail: the lookup Stage reports an error and the step fails immediately.This is the default.
– drop: the input row with the failed lookup(s) is dropped
– continue: the input row is transferred to the output (output port 0), together with the successful table entries. The failed table entry(s) are not transferred, resulting in either default output values or null output values.
– output: the input row with the failed lookup(s) is transferred to a second output link, the "reject" link (output port 1). This output option generates the OSH reject option
• There is no option to capture unused LU Table entries (input ports > 0).
– There is nothing wrong with these! – Compare with the Merge Stage
In Case of Missing LU entry
144© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Lookup Stage
• Uses a key column as an index into a table usually containing other values associated with each key.
• Follows the Source/In-Memory Table model.
Key column: state_code
“TX”
[…]SC South CarolinaSD South DakotaTN TennesseeTX TexasUT UtahVT Vermont[…]
Lookup table
Index Associated Value
145© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Lookup Stage
• Use Cases: Table Lookup– Map small, dense numeric codes to long character
strings to improve storage density (the most common use)
– Column validation (if the key isn’t in the table, the key’s value isn’t permitted)
146© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Lookup Stage
• Lookup Tables should be small enough to fit into physical memory (otherwise, performance hit due to paging)
• On a MPP you should partition the lookup tables using entire partitioning method, or partition them the same way you partition the source link
• On a SMP, no physical duplication of LUT occurs
147© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Lookup Stage Editor
"If Not Found" = Source input row with no LUT match
One of four options:– Fail [default]– Drop– Continue– Output (to reject link)
Check that Source is listed first
148© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LOOKUP: Action on Simplest Input
Revolution Citizen Exchange1789 Lefty1776 M_B_Dextrous Nasdaq
Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq
output link reject link
Same output as join/inner and merge/drop
Same output as join/leftouter merge/keep
Revolution Citizen1789 Lefty
Unmatched Source Entries
"Output" option:
"Continue" option:
QUIZ! Why the row in the reject link has only 2 columns?
149© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
3. The Merge Stage
• Combines – one sorted, duplicate-free master (primary) link with – one or more sorted update (secondary) links.– Pre-sort makes merge "lightweight": few rows need to be in RAM (as
with joins, but opposite to lookup).• Follows the Master-Update model:
– Master row and one or more updates row are merged iff they have the same value in user-specified key column(s).
– A non-key column occurs in several inputs? The lowest input port number prevails (e.g., master over update; update values are ignored)
– Unmatched ("Bad") master rows can be either• kept (default: transferred to output port 0) • dropped
– Unmatched ("Bad") update rows in input link n in can be captured in a "reject" link in corresponding output link n.
– Matched update rows are consumed.
150© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Merge Stage
• Allows composite keys• Multiple update links• Matched update rows are consumed
• Unmatched updates ininput port n can be captured in output port n
• Lightweight:
Master One or more updates
Output Rejects
Merge
0
0
21
21
151© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Merge Stage Editor
Unmatched Master rows
One of two options:– Keep [default] – Drop
(Capture in reject link is NOT an option)
Unmatched Update rows option:
– Capture in reject link(s). Implemented by adding outgoing links
152© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Merge: Action on Simplest Input
Unmatched Master Mode = Keep
Revolution Citizen Exchange
1789 Lefty1776 M_B_Dextrous Nasdaq
Revolution Citizen Exchange1776 M_B_Dextrous Nasdaq
• Same output as innerjoin and lookup/reject
• Same output as leftouterjoin and lookup/continue
Both options yield the same "reject" link of unmatched updates
Citizen ExchangeRighty NYSE
Unmatched Master Mode = Drop
153© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
In this table:
• , <comma> = separator between primary and secondary input links
(out and reject links)
Synopsis:Joins, Lookup, & Merge
Joins Lookup Merge
Model RDBMS-style relational Source - in RAM LU Table Master -Update(s)Memory usage light heavy light
# and names of Inputs exactly 2: 1 left, 1 right 1 Source, N LU Tables 1 Master, N Update(s)
Mandatory Input Sort both inputs no all inputsDuplicates in primary input OK (x-product) OK Warning!Duplicates in secondary input(s) OK (x-product) Warning! OK only when N = 1Options on unmatched primary NONE [fail] | continue | drop | reject [keep] | dropOptions on unmatched secondary NONE NONE capture in reject set(s)
On match, secondary entries are reusable reusable consumed
# Outputs 1 1 out, (1 reject) 1 out, (N rejects)Captured in reject set(s) Nothing (N/A) unmatched primary entries unmatched secondary entries
154© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 7: Table Lookup
• Lab 8: InnerJoin
Learn by doing!
155© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge
5b. Combining Data, Vertically: Aggregator5c. The Transformer
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
156© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Grouping and Summary Data
• Often we don’t want to look at data at the row level, but rather at some summary level:
– How many of each product have been ordered?• Group by product and return row count for each group
– What’s the average dollar amount of all orders?• No grouping, return average of transaction amount column
– How many customers do we have in each state and what is the average and stddev of transaction amount by state?• Group by state and return average and stddev of transaction amount
column. De-dupe on customer id within state group and return row count
• SQL supports “select sum(order_quantity) by product”, etc. SAS supports “PROC MEANS; by state; var transaction”, etc.
• Parallel Extender has the Aggregator Stage (a.k.a. group operator).
157© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Aggregator Stage
Purpose: perform data aggregations
Specify:• zero or more key columns that define the aggregation
units (or groups)• columns to be aggregated (or ‘reduced’)*• aggregation (reduction) functions include:
count (nulls/non-nulls) sum max/min/range
standard error %coeff. of variationsum of weights un/corrected sum of squaresvariance mean
standard deviation
• a grouping method (hash table or pre-sort)
* So-called because a 1D array is ‘reduced’ to a 0D scalar
WHAT(semantics)
HOW(performance)
158© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Aggregator Stage Editor
One of two Methods:– Sort– Hash
Allow several grouping keys
Name of result column
Column(s) to aggregate
159© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Grouping Methods
• Hash: results for each aggregation group are stored in a hash table, and the table is written out after all input has been processed– doesn’t require sorted data– good when number of unique groups is small. Running tally for
each group’s aggregate calculations need to fit easily into memory. Require about 1KB/group of RAM.
– Example: average family income by state, requires .05MB of RAM
• Sort: results for only a single aggregation group are kept in memory; when new group is seen (key value changes), current group written out.– requires input sorted by grouping keys– can handle unlimited numbers of groups– Example: average daily balance by credit card
160© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 9: Grouping Data
Learn by doing!
161© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages5a. Combining Data, Horizontally: Join/Lookup/Merge
5b. Combining Data, Vertically: Aggregator5c. The Transformer
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
162© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Transformer Stage
163© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Transformer Stage
The Transformer stage provides a one-stop location for you to easily create simple or complex transformations.
Output(s)RightHandSide
Input(s)Left
HandSide
164© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Mapping Source Columns
• To map source columns to target, select the columns to be mapped, drag & drop them on the target.• Metadata for all mapped columns are carried to target.
Target
Source
Source Metadata
Target Metadata
165© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Adding New Column
Add a new column: • Right click on the desired link and select
• {Append|Insert} New Column
• Selecting “Append New Column” will append a new column at the end• Selecting “Insert New Column” will insert a new column above the selected column
166© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Renaming New Column
Rename a new column: •To rename a column, change them on the metadata.
•SQL types and lengths can also be defined.
Rename column
167© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Constraint:
•A constraint applies to the entire row.
•A constraint specifies a condition under which incoming rows of data will be written to an output link
•If No constraint is specified, all records are passed through the link
Transformer implements Constraints and Derivations
168© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Constraint Window
Expression Editor Window
• Constraint is defined for each individual link• Checking the ‘Reject Row’ box will force only those records that did not meet the condition specified in the constraints.• No constraint is required for ‘Reject’ link.
169© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Derivation:
•A Derivation can be applied to each output column.
•Specifies the value to be moved to a output column
•Every output column must have a derivation
•An output column does not require an input column
Transformer implements Constraints And Derivations
170© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Transformer Menus
Operand Menu Operator Menu
Stage Variables, Constraints and Derivations are defined using the context sensitive expression editor.This can be accessed by clicking on the ellipses or by right clicking
171© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The Transformer'sExpression Editor Window
Context-sensitive menu:Easy access to transforms
Extensive list of availabletransformation functionsto select from:
172© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Transformer StageError Handling
Immediate notification whenthere’s a problem!
173© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Transformer: Stage Variables
Stage Variables: •Similar to program
variables
•Scope is limited to the
Transformer
•Use to simplify
derivations and
constraints
•Use to avoid duplicate
coding
•Retain values across
reads
•Use to accumulate
values and compare
current values with
prior reads
174© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Transformer: Execution Order
• Derivations in stage variables are executed first
• Constraints are executed before derivations
• Column derivations in earlier links are executed before later links
• Derivations in higher columns are executed before lower columns
175© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 10: The Transformer
Learn by doing!
176© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
177© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Operators
178© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
RDBMS AccessSupported Databases
Parallel Extender provides high performance / scalable interfaces for:• DB2• Oracle• Informix• Teradata
Users must be granted specific privileges, depending on RDBMS.
179© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• DB2, ORACLE, INFORMIX, and TERADATA supported• Automatically convert RDBMS table layouts to/from
Parallel Extender Table Definitions• RDBMS nulls converted to/from nullable column values• Support for standard SQL syntax for specifying:
– column list for SELECT statement– filter for WHERE clause– open command, close command
• Can write an explicit SQL query to access RDBMS• Need to supply additional information in the SQL query
– A parallel RDBMS table is stored on multiple disks, connected to multiple CPUs, in a parallel system
– To optimize table access, only the CPU with a direct connection to the disk should read data from it, minimizing data movement
RDBMS AccessSupported Databases
180© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Parallel Database Connectivity
TraditionalTraditionalClient-ServerClient-Server Parallel ExtenderParallel Extender
SortSort
ClientClient
Parallel RDBMSParallel RDBMS
ClientClient
ClientClient
ClientClient
ClientClient
Parallel RDBMSParallel RDBMS
Parallel Server Running only RDBMS Each application has only one connection Suitable only for small data
Parallel server runs APPLICATIONS Application has parallel connections to RDBMS Suitable for large data volumes
ClientClient
LoadLoad
181© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Oracle Database StageExtracts
• Supports parallel database table access • Some queries cannot be run in parallel
– Queries containing a "GROUP BY" clause which are not also hash partitioned by same column.
– Queries performing a non-collocated join• Oracle Stage options:
– table Table_NameTable_Name specifies the name of the ORACLE table. The table must exist and you must have select privileges on the table.
– query Query Specifies an SQL query to read a table. The Query specifies the table and any processing that you want to perform on the table as it is read into Parallel Extender. This statement can contain joins, views, database links, synonyms, and so on.
182© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Oracle stage options (cont’d):– DB Options
{ user = username, password = password } (Required) Specifies either a username and password for connecting to ORACLE. These options are required by the Oracle stage.
Oracle Database StageExtracts
183© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Load data into Oracle in parallel• Oracle stage options:
– Write Method: Load or Upsert– Write Mode:
specifies the write mode of the stage.• append (default): New rows are appended to the table.• create: Create a new table. Parallel Extender reports an error if the
ORACLE table already exists. You must specify this mode or the replace mode if the ORACLE table does not exist.
• truncate: The existing table attributes (including schema) and the ORACLE partitioning keys are retained, but any existing rows are discarded. New rows are then appended to the table.
• replace: The existing table is first dropped and an entirely new table is created in its place. ORACLE uses the default partitioning method for the new table.
– Table Table_NameTable_Name specifies the name of the ORACLE table to write to.
Oracle Database StageLoads
184© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Oracle stage options (cont’d):– Truncate Column Names
Configures the stage to truncate Parallel Extender column names to 30 characters if it is longer.
– DB Options { user = username, password = password }
Oracle Stage EditorLoads
185© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2 and Viper
A good match by design
DB2 EEE and
DataStage Parallel Extender
186© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2’s Parallel Strengths
• Intelligent data distribution in multiple partitions
• Openness: table layouts described in system catalog
• Flexible table spaces• Co-located join plans• SP: Shared nothing architecture
187© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2’s Parallel Strength is Wasted Unless:
– Applications upstream and downstream of DB2 run in parallel to produce and consume as many streams as DB2 consumes and produces
– Records stream in parallel between DB2 and applications, and between applications, to build end-to-end data flows
... APP (N+1)APP (N+1)APP (N)APP (N) ...
188© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Leveraging DB2’s Parallel StrengthsTo Eliminate Switch Traffic and Attain Scalability
• Intelligent data distribution in multiple partitions
• Openness: tables layout described in system catalog
• Flexible table spaces
• Co-located join plans• SP: Shared nothing
• Match UDB hash semantics;make no assumption about data statistical distribution
• Read metadata from the UDB catalog and exploit local keyword access
• Use node groups attached to table spaces for automatic reader/writer placement
• Respect locality of joins• Favor local disks
DB2 DS/PX
189© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2 Stage Overview
• db2read:– Reads data in parallel from a DB2 table and automatically
places it into a Framework dataset, along with the table definition.
• db2write:– Writes data in parallel from a Framework dataset into a DB2
table. The operator automatically converts the DS-PX types of the columns to corresponding DB2 types.
• db2load:– The only difference between this option and db2write is that
the db2load option takes advantage of the fast DB2 loader technology when writing data to the database.
190© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2 Database Operatorsdb2read
• db2read Options:– server
• DB2 instance name (if different from default in user profile).
– dbname • database name (if different from default in user profile)
– table• TableName- this translates to “Select * from TableName”
– query• SQL query to read a table. This statement can contain joins,
views, database links, synonyms, and so on. Specifying partition name forces execution of the query in parallel on the processing nodes that have a partition of the table. If you do not specify a partition, the stage executes the query sequentially on a single processing node.
191© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2 Database Operatorsdb2write/db2load
• db2write/db2load Options:– dbname
• Database name (if not specified, uses default from user profile)– dboptions
• By default, DS-PX creates the table on all processing nodes in the default DB2 node group (DB2 PE) or table space (UDB), and uses the first column as the partitioning key.
• nodegroup = group defines the DB2 PE node group used to store the table.
– tablespace = t_spacedefines the DB2 UDB table space used to store the table.
– key = field0, ... key = fieldN specifies a partitioning key for the table. where:
– table • TableName- name of table to write to.
– mode• Determines the write mode of the operation (more on next page)
– server• Name of a DB2 server (if different from default).
192© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
DB2 Database Operatorsdb2write/db2load (continued)
• db2write/db2load Modes:– append
• appends new records to the table; the database user must have TABLE CREATE privileges. This mode is the default.
– create• creates a new table; the database user must have TABLE
CREATE privileges. DS-PX reports an error if the DB2 table already exists. You must specify this mode if the DB2 table does not exist.
– replace• drops the existing table and creates a new one in its place; the
database user must have TABLE CREATE privileges.
– truncate• retains the table attributes (including the schema) but discards
existing records and appends new ones; the database user must have TABLE DELETE privileges.
193© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 11: Extract & Load RDBMS Tables
Learn by doing!
194© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
195© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Stages
196© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Creating custom stages
From Manager (or Designer):Repository pane:
Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}
Wrapped: this sectionBuild: next section
Overview
• Not covered in this course
• "Build" stages from within Parallel Extender
• "Wrapping” existing “Unix” executables
197© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Building “Wrapped” Stages
• In a nutshell:
You can “wrap” a legacy executable:• binary, • Unix command,• shell script
… and turn it into a bona fide Parallel Extender stage capable, among other things, of parallel execution,
… as long as the legacy executable is• amenable to data-partition parallelism
» no dependencies between rows• pipe-safe
» can read rows sequentially» no random access to data, e.g., use of fseek()
198© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Wrap and Generate
Fill these at least these two pages and click on
Any executable in your PATH
Avoid conflict with existing Stage or executable names
… and your new stage will appear in the Repository!
199© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The "Creator" Page
Tip: Conscientiously maintaining the Creator page for all your wrapped stages will eventually bring you fame and fortune.
200© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The "Properties" Page
Expresses the executable's Unix switches
Switch name, without the minus sign
Type of switch argument
Used by Director before each run
201© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
The "Wrapped" Page
Specifies Interfaces and Environment
Exit codes ($?) and other variables
Frameworkport #
Already built Interface Table Definition
Other options available
202© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Some Background often Helps
Stage Interface Table Definitions• Define link columns referenced by operator• Similar to the signature of a C function
output stage interface
input link
output link
4 Table Definitions in this picture:
input stage interface
203© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
• Layout interfaces describe what columns the stage:– needs for its inputs– creates for its outputs
• Two kinds of interfaces: dynamic and static
• Dynamic: adjusts to its inputs automatically(strongly typed parametric polymorphism)
– Ascential-supplied operators are dynamic
• Static: expects input to contain columns with specific names and types
– Custom stages
General Background:Interface schemas
204© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
How does the Wrapping Work?
– Define the schema for export and import
• Schemas become interface schemas of the operator and allow for by-name column access
– Define multiple inputs/outputs required by UNIX executable(see next slide)
import
export
stdout ornamed pipe
stdin ornamed pipe
UNIX executable
output schema
input schema
• QUIZ: Why does export precede import?
205© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Multiple In and Out Links
Now we are ready to complete the "Wrapped" Page:
206© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 12: Wrapped Unix Executables
Learn by doing!
207© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Agenda
1. Introduction to Parallel Extender
2. Concepts in Parallel Computing
3. Partitioning and Collecting Data
4. Importing/Exporting Data
5. Overview of Some Parallel Extender Stages
6. Using RDBMS with Parallel Extender
7. Wrapping Unix Executables
8. Building Native Operators
208© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Creating custom stages
From Manager (or Designer):Repository pane:
Right-Click on Stage Type > New Parallel Stage > {Custom | Build | Wrapped}
Wrapped: last section
Build: this section
Overview
• Not covered in this course
• "Build" stages from within Parallel Extender
• "Wrapping” existing “Unix” executables
209© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Building OperatorsSummary: the Wizard
• In a nutshell:
– The user performs the fun, glamorous tasks: encapsulate business logic and arithmetic in a custom operator
– An Parallel Extender wizard called “buildop” automatically performs the unglamorous, tedious, error-prone tasks: invoke needed header files, build the necessary “plumbing” for a correct and efficient parallel execution.
210© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
General Page
Identicalto Wrapped's,except:
Under the BuildTab, your program!
211© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Logic Tab forBusiness Logic
Enter Business C++ logic and arithmetic in four pages under the Logic tab
Main code section goes in Per-Record page, it will be applied to all rows
212© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Code Sections under Logic Tab
Temporary variables declared [and initialized] here
Logic here is executed once BEFORE processing the FIRST row
Logic here is executed once AFTER processing the LAST row
213© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
I/O and Transfer
Under Interface tab: Input, Output & Transfer pages
Optional renaming of output port from default "out0"
Write row
Input page: 'Auto Read'Read next row
In-RepositoryTable Definition
'False' setting,not to interfere with Transfer page
First line: output 0
214© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
I/O and Transfer
• Transfer from input in0 to output out0.• If page left blank or Auto Transfer = "False" (and RCP = "False")
Only columns in output Table Definition are written
First line:Transfer of index 0
215© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Building OperatorsNative Operators
• Example - sumNoTransfer– Add input columns "a" and "b"; ignores other columns
that might be present in input– Produces a new "sum" column– Do not transfer input columns
sumNoTransfera:int32; b:int32
sum:int32
216© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
No Transfer
NO TRANSFER
• Causes:
- RCP set to "False" in stage definitionand
- Transfer page left blank, or Auto Transfer = "False"
• Effects:
- input columns "a" and "b" are not transferred
- only new column "sum" is transferred
Compare with transfer ON…
From Peek:
217© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Transfer
TRANSFER
• Causes:
- RCP set to "True" in stage definition
or- Auto Transfer set to "True"
• Effects:- new column "sum" is transferred, as well as- input columns "a" and "b" and- input column "ignored" (present in input, but not mentioned in stage)
218© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Columns
• DS-PX type• Defined in
Table Definitions
• Value refreshed from row to row
Temp C++ variables
• C/C++ type• Need declaration (in
Definitions or Pre-Loop page)
• Value persistent throughout "loop" over rows, unless modified in code
Columns vs. Temporary C++ Variables
219© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
Adding a column with Row ID
Out Table;
YES!
QUIZ!
Wouldindex++work?
220© 2003.Ascential Software Corporation. All rights reserved. Confidential and proprietary information. v3.0 February 28, 2003
LABS
• Lab 13: Build Stages
Learn by doing!