abinitio training
Post on 30-Dec-2015
329 Views
Preview:
DESCRIPTION
TRANSCRIPT
04/19/2304/19/23 11
ABINITIO TRAININGABINITIO TRAINING
04/19/2304/19/23 22
DAY ONEDAY ONE Introduction to Data warehouseIntroduction to Data warehouse ETLETL AbInitioAbInitio AbInitio FeaturesAbInitio Features ArchitectureArchitecture GDEGDE CO>Operating SystemCO>Operating System EMEEME Setting up EnvironmentSetting up Environment Data set types and ComponentsData set types and Components Data types and DMLData types and DML I/P File, O/P file, Intermediate file and Lookup fileI/P File, O/P file, Intermediate file and Lookup file Filter by expression, Replicate, Reformat and RedefineFilter by expression, Replicate, Reformat and Redefine
04/19/2304/19/23 33
Introduction to Data warehouseIntroduction to Data warehouse
A Data Warehouse is a A Data Warehouse is a Subject-oriented, Subject-oriented, Integrated, Integrated, Time variant and Time variant and Non-volatile collection of data in support of Non-volatile collection of data in support of
management’s decision-making process.management’s decision-making process.
04/19/2304/19/23 44
ETLETL
Reading the source data.Reading the source data. Applying business, transformation, and technical rules.Applying business, transformation, and technical rules. Loading the data.Loading the data.
04/19/2304/19/23 55
AbInitioAbInitio
AbInitio is Latin for “From the Beginning”.
AbInitio software is a general-purpose data processing platform for mission critical applications such as:
•Data warehousing•Batch Processing•Click-stream analysis•Data movement•Data transformation
04/19/2304/19/23 66
AbInitio FeaturesAbInitio Features
Transformation of disparate sources.Transformation of disparate sources. Aggregation and other processing.Aggregation and other processing. Referential integrity checking.Referential integrity checking. Database loading.Database loading. Extraction for external processing.Extraction for external processing. Aggregation and loading of data marts.Aggregation and loading of data marts. Processing just about any form and volume of data.Processing just about any form and volume of data. Parallel sort/merge processing.Parallel sort/merge processing. Data transformation.Data transformation. Re hosting of corporate data.Re hosting of corporate data. Parallel execution of existing application.Parallel execution of existing application.
04/19/2304/19/23 77
ArchitectureArchitecture
User ApplicationUser Application
Development EnvironmentDevelopment Environment
GDE ShellGDE Shell
Component Library User defined component 3Component Library User defined component 3rdrd party component EME party component EME
AbInitio CO> Operating SystemAbInitio CO> Operating System
Native Operating SystemNative Operating System
04/19/2304/19/23 88
GDEGDE
04/19/2304/19/23 99
CO>Operating SystemCO>Operating System
Parallel and distributed application execution.Parallel and distributed application execution. Control.Control. Data Transport.Data Transport. Transactional semantics at the application level.Transactional semantics at the application level. Check pointing.Check pointing. Monitoring and debugging.Monitoring and debugging. Parallel file management.Parallel file management. Metadata driven components.Metadata driven components.
04/19/2304/19/23 1010
CO>Operating SystemCO>Operating System
AbInitio Co>Operating system runs onAbInitio Co>Operating system runs on Sun SolarisSun Solaris IBM AIXIBM AIX Hewlett-Packard HP-UXHewlett-Packard HP-UX Siemens Pyramid Reliant UnixSiemens Pyramid Reliant Unix IBM DYNIX/ptxIBM DYNIX/ptx Silicon Graphics IRIXSilicon Graphics IRIX Red Hat LinuxRed Hat Linux Windows NT 4.0(x86)Windows NT 4.0(x86) Windows NT 2000 (x86)Windows NT 2000 (x86) Compaq Tru64 UNIXCompaq Tru64 UNIX IBM OS/390IBM OS/390 NCR MP-RASNCR MP-RAS
04/19/2304/19/23 1111
EMEEME
Repository for version controllingRepository for version controlling Used for DocumentationUsed for Documentation
04/19/2304/19/23 1212
Setting up EnvironmentSetting up Environment
04/19/2304/19/23 1313
Data set types and ComponentsData set types and Components
Data Set Component Data Set Component Flow Components Flow Components Transform Components Transform Components Partitioning Components Partitioning Components
04/19/2304/19/23 1414
Data types and DMLData types and DML
TypesTypes
BaseBase VoidVoid NumberNumberIntegerInteger StringString DecimalDecimal DateDate RealReal DatetimeDatetime
CompoundCompound VectorVector RecordRecord UnionUnion
04/19/2304/19/23 1515
DMLDML
To define the complete record structure.To define the complete record structure. Can be defined either in grid mode or in Can be defined either in grid mode or in
text mode.text mode. Can be stored under a file name which Can be stored under a file name which
can be referred multiple times or can be can be referred multiple times or can be embedded. embedded.
04/19/2304/19/23 1616
I/P File, O/P file, Intermediate file I/P File, O/P file, Intermediate file and Lookup fileand Lookup file
Input File:Input File:
Reads the data records from a serial file or multi file in the file system.Reads the data records from a serial file or multi file in the file system.
Output File: Output File:
Writes the data records to a serial file or a multi file in the file system.Writes the data records to a serial file or a multi file in the file system.
Intermediate File:Intermediate File: Write data records to file in the middle of the graph.Write data records to file in the middle of the graph.
Helps in debugging and further processing of intermediate fileHelps in debugging and further processing of intermediate file..Lookup File:Lookup File: Represents one or multiple serial files or a multiple of data records small Represents one or multiple serial files or a multiple of data records small
enough to be held in main memory, letting a transform function retrieve enough to be held in main memory, letting a transform function retrieve records much more quickly than it could retrieve them if they were stored records much more quickly than it could retrieve them if they were stored on disk.on disk.
Look up file is not connected to other components in graph.Look up file is not connected to other components in graph.
04/19/2304/19/23 1717
Filter By Expression, Replicate, Filter By Expression, Replicate, Reformat and RedefineReformat and Redefine
Filter by Expression:Filter by Expression: Enables user to track down a particular record or records, or to put Enables user to track down a particular record or records, or to put
together a sample of records to assists with analysistogether a sample of records to assists with analysis.. Allows filter the data based on expression that identifies only the Allows filter the data based on expression that identifies only the
records that you needrecords that you need.. can also be used for data validation.can also be used for data validation.
Replicate:Replicate: Used when user want to make multiple copies of a flow for separate Used when user want to make multiple copies of a flow for separate
processing.processing.
04/19/2304/19/23 1818
Filter By Expression, Replicate, Filter By Expression, Replicate, Reformat and RedefineReformat and Redefine
Changes the record format of data Changes the record format of data records by dropping fields, or by records by dropping fields, or by using DML expressions to add using DML expressions to add fields, combine fields, or transform fields, combine fields, or transform the data in the records.the data in the records.
manipulates one record at a time manipulates one record at a time and does work like validation and and does work like validation and cleansing e.g. deleting bad values, cleansing e.g. deleting bad values, setting default values, setting default values, standardizing field formats or standardizing field formats or rejecting records with invalid date rejecting records with invalid date
etcetc….….
04/19/2304/19/23 1919
Filter By Expression, Replicate, Filter By Expression, Replicate, Reformat and RedefineReformat and Redefine
Transformation rules are defined for transform (0)Transformation rules are defined for transform (0).. Use of Reformat component is to “Clean” input data so that all of the Use of Reformat component is to “Clean” input data so that all of the
records conform to the same conventionrecords conform to the same convention
Redefine:Redefine:
Copies data records from its input to its output without changing the Copies data records from its input to its output without changing the values in the data records. values in the data records.
Used to change or rename fields in a record format without Used to change or rename fields in a record format without
changing the values in the records.changing the values in the records.
04/19/2304/19/23 2020
DAY TWODAY TWO
Sort, Sort within Group, Dedup SortSort, Sort within Group, Dedup Sort Rollup and ScanRollup and Scan Reject, Error Handling and DebuggingReject, Error Handling and Debugging
04/19/2304/19/23 2121
Sort, Sort within Group, Dedup SortSort, Sort within Group, Dedup Sort
Sort :Sort : Used to sort group of Used to sort group of
records in a specific records in a specific order with a key.order with a key.
Looks at all the Looks at all the records in the flow records in the flow before it produces the before it produces the final output.final output.
04/19/2304/19/23 2222
Sort, Sort within Group, Dedup SortSort, Sort within Group, Dedup Sort
Sort Within Group:Sort Within Group: Refines the order of Refines the order of
sorted dataset by further sorted dataset by further sorting according to an sorting according to an order specified by a minor order specified by a minor key parameter within an key parameter within an order specified by a major order specified by a major key parameter.key parameter.
Imposes an order on Imposes an order on those records according those records according to the minor key to the minor key
04/19/2304/19/23 2323
Sort, Sort within Group, Dedup SortSort, Sort within Group, Dedup Sort
Dedup Sort:Dedup Sort: Used to remove duplicate Used to remove duplicate
records (a group of records records (a group of records that share the same key), that share the same key), keeping a single record.keeping a single record.
What it does:What it does: First sort the data.First sort the data. Set the key for grouping Set the key for grouping in the dedup component.in the dedup component. Finally choose which Finally choose which duplicate to keep.duplicate to keep.
04/19/2304/19/23 2424
Rollup and ScanRollup and Scan
Rollup:Rollup: Produces a single record Produces a single record
form a group of records form a group of records identified by a common identified by a common key (or keys).key (or keys).
Useful for summarizing Useful for summarizing groups of records i.e. groups of records i.e. totals, averages, max, totals, averages, max, min etc.min etc.
04/19/2304/19/23 2525
Rollup and ScanRollup and Scan
Scan:Scan: Generates a series of cumulative Generates a series of cumulative
summary records –such as successive summary records –such as successive year- to-date totals – for groups of data year- to-date totals – for groups of data records.records.
Produces intermediate summary records.Produces intermediate summary records.
04/19/2304/19/23 2626
Reject, Error Handling and Reject, Error Handling and DebuggingDebugging
Invalid data will go to Rejected Port. Invalid data will go to Rejected Port. Setting reject-threshold parameter inside Setting reject-threshold parameter inside
the component.the component. GDE has a built in debugger capability.GDE has a built in debugger capability. Add a Watcher File. Add a Watcher File.
04/19/2304/19/23 2727
DAY THREEDAY THREE
JoinJoin Multi FilesMulti Files ParallelismParallelism Partition and De PartitionPartition and De Partition Layout, Fan-in, Fan-out and All-to-AllLayout, Fan-in, Fan-out and All-to-All
04/19/2304/19/23 2828
JoinJoin
Join:Join: Used to combine data from Used to combine data from
two or more flows of records two or more flows of records based on a matching key (or based on a matching key (or keys).keys).
Join deals with two Join deals with two activities.activities.
1.Transforming data sources 1.Transforming data sources with with
different record format.different record format. 2.Combining data sources with 2.Combining data sources with
the same record format.the same record format.
04/19/2304/19/23 2929
JoinJoin Join types:Join types:
Inner JoinInner Join Full outer JoinFull outer Join Explicit JoinExplicit JoinInner Join:Inner Join: Uses only records with matching keys on both inputs.Uses only records with matching keys on both inputs.Full Outer Join:Full Outer Join: Uses all records from both inputs Uses all records from both inputs If a record from one does not have a matching record If a record from one does not have a matching record in the other input, a NULL record is used for the in the other input, a NULL record is used for the
missing record missing record
04/19/2304/19/23 3030
JoinJoin
Explicit Join:Explicit Join: Uses all records in Uses all records in
one specified input one specified input (Based upon (Based upon True/False), but True/False), but records with matching records with matching keys in the other keys in the other inputs are optional. inputs are optional. Again a NULL record Again a NULL record is used for the is used for the missing records.missing records.
04/19/2304/19/23 3131
Multi FilesMulti Files
Essentially the “global view” of a set of ordinary Essentially the “global view” of a set of ordinary files, each of which may be located anywhere files, each of which may be located anywhere the AbInitio Co-Operating System is installed.the AbInitio Co-Operating System is installed.
Each partition of a multi file is an ordinary file.Each partition of a multi file is an ordinary file. Resides in multi directories.Resides in multi directories. Identified using URL syntax with “mfile:” as the Identified using URL syntax with “mfile:” as the
protocol part.protocol part. One Control File.One Control File.
04/19/2304/19/23 3232
ParallelismParallelism Processing of datasets in parallel for better performance.Processing of datasets in parallel for better performance. Types of ParallelismTypes of Parallelism 1.Componet1.Componet 2.Pipeline2.Pipeline 3.Data3.Data
Component Parallelism:Component Parallelism: When more than one component is running at the same time on When more than one component is running at the same time on
different data streams.different data streams. Comes “for free” with Graph Programming.Comes “for free” with Graph Programming.Limitation:Limitation: Scales to no. of “branches” a graph.Scales to no. of “branches” a graph.
04/19/2304/19/23 3333
ParallelismParallelism
Pipeline Parallelism:Pipeline Parallelism: When two or more connected components process data When two or more connected components process data
one by one.one by one.Limitation:Limitation:
Scales to length of “branches” in a graph.Scales to length of “branches” in a graph. Some operations, like sorting, do not pipelineSome operations, like sorting, do not pipeline
Data Parallelism:Data Parallelism: Occurs when multiple copies of a process act on Occurs when multiple copies of a process act on
different sets of data at the same time.different sets of data at the same time. Process the whole more quickly using multiple CPU at Process the whole more quickly using multiple CPU at
the same time the same time
04/19/2304/19/23 3434
Partition and De PartitionPartition and De Partition
Partition:Partition: Used to divide data sets into multiple sets for Used to divide data sets into multiple sets for
further processing.further processing.Types:Types: The component The component Partition by ExpressionPartition by Expression partitions partitions
data by dividing it according to a DML expressiondata by dividing it according to a DML expression.. The component The component Partition by KeyPartition by Key partitions data by partitions data by
grouping it by a key, like dealing cards into piles grouping it by a key, like dealing cards into piles according to their suitaccording to their suit
04/19/2304/19/23 3535
Partition and De PartitionPartition and De Partition The Component The Component Partition with Load BalancePartition with Load Balance Partitions Partitions
Data by Dynamic load balancing. More data goes to Data by Dynamic load balancing. More data goes to CPUs that are less busy and vice versa, thus maximizing CPUs that are less busy and vice versa, thus maximizing throughput.throughput.
The Component The Component Partition by PercentagePartition by Percentage Partitions Partitions Data by Distributing it, so the output is proportional to Data by Distributing it, so the output is proportional to fraction of 100.fraction of 100.
The Component The Component Partition by RangePartition by Range Partitions Data by Partitions Data by Dividing it evenly among nodes, based on a key and a Dividing it evenly among nodes, based on a key and a set of partitioning ranges.set of partitioning ranges.
The Component Partition by The Component Partition by Round-robinRound-robin Partitions Partitions Data by Distributing it evenly, in block size chunks, Data by Distributing it evenly, in block size chunks, across the output partitions, like dealing cards.across the output partitions, like dealing cards.
04/19/2304/19/23 3636
Partition and De PartitionPartition and De Partition
De Partition:De Partition: Read data from multiple flows or operations and are Read data from multiple flows or operations and are
used to recombine data records from different flows.used to recombine data records from different flows. Opposite to Partition.Opposite to Partition.Types:Types: The The ConcatenateConcatenate component produces a single output component produces a single output
flow that contains first all the records from the first input flow that contains first all the records from the first input partition, then all the records from the second input partition, then all the records from the second input partition, and so on.partition, and so on.
The The GatherGather component collects inputs from multiple component collects inputs from multiple partitions in an arbitrary manner, and produces a single partitions in an arbitrary manner, and produces a single output flow. It does not maintain sort order, but is the most output flow. It does not maintain sort order, but is the most efficient departitioned.efficient departitioned.
04/19/2304/19/23 3737
Partition and De PartitionPartition and De Partition
The The InterleaveInterleave component collects records component collects records from many sources in round-robin fashion. from many sources in round-robin fashion. The effect is like taking a card from each The effect is like taking a card from each player in turn, forming a deck of cards.player in turn, forming a deck of cards.
The The MergeMerge components collets inputs from components collets inputs from multiple sorted partitions and maintains the multiple sorted partitions and maintains the sort order.sort order.
04/19/2304/19/23 3838
Layout, Fan-in, Fan-out and All-to-AllLayout, Fan-in, Fan-out and All-to-All
Layout:Layout: Determines the location of a resource.Determines the location of a resource. Either serial or parallel.Either serial or parallel.
Fan-In:Fan-In: After data partition when departition components After data partition when departition components
collects data from different flows a special symbol collects data from different flows a special symbol comes into flow.comes into flow.
Fan-Out:Fan-Out: When partition components divides dataset into When partition components divides dataset into
multiple sets for further processing a special multiple sets for further processing a special symbol comes into flow.symbol comes into flow.
All-to-All:All-to-All:
04/19/2304/19/23 3939
DAY FOURDAY FOUR
DBC File, Input Table, Output Table, Join DBC File, Input Table, Output Table, Join with DB with DB
Sub graph, Phasing, Check point, Sub graph, Phasing, Check point, Recovery Recovery
Normalize, Denormalize Sorted Normalize, Denormalize Sorted
04/19/2304/19/23 4040
DBC File, Input Table, Output DBC File, Input Table, Output Table, Join with DBTable, Join with DB
DBC File:DBC File: Required for AbInitio while connecting to Required for AbInitio while connecting to any Database system.any Database system. By default it comes with extension .dbcBy default it comes with extension .dbc DBC file fieldsDBC file fields
The The dbms_versiondbms_version field is the version of your database. field is the version of your database. The The db_homedb_home field is the location of your database software ( ORACLE_HOME) field is the location of your database software ( ORACLE_HOME) The The db_namedb_name field is the value of the identifier for your database instance. For field is the value of the identifier for your database instance. For
Oracle, this the value of the ORACLE_SID environment variable. For SQL*Net, Oracle, this the value of the ORACLE_SID environment variable. For SQL*Net, use @use @db_name db_name
The The db_nodesdb_nodes field is a list of database-accessible nodes with Ab Initio installed. field is a list of database-accessible nodes with Ab Initio installed. Note: Note: If Oracle is on an SMP machine, you usually use one host name unless If Oracle is on an SMP machine, you usually use one host name unless you are running Oracle OPS (parallel), then you may need a list of all the you are running Oracle OPS (parallel), then you may need a list of all the database runs on. database runs on.
The The #user#user comment and comment and #password#password comment fields list your name and comment fields list your name and password. If your database is Oracle and you are password. If your database is Oracle and you are identified externally, identified externally, leave leave these fields as commentsthese fields as comments
04/19/2304/19/23 4141
DBC File, Input Table, Output DBC File, Input Table, Output Table, Join with DBTable, Join with DB
Input Table:Input Table: Unloads data records Unloads data records
from a database into from a database into an AbInitio graph.an AbInitio graph.
Allowing you to Allowing you to specify as the source specify as the source either a database either a database table, or an SQL table, or an SQL statement that selects statement that selects data records from one data records from one or more tables.or more tables.
04/19/2304/19/23 4242
DBC File, Input Table, Output DBC File, Input Table, Output Table, Join with DBTable, Join with DB
Output Table:Output Table: Loads data records from Loads data records from
a graph into a database.a graph into a database. Specify the records Specify the records
destination either directly destination either directly as a single database as a single database table, or through an SQL table, or through an SQL statement that inserts statement that inserts records into one or more records into one or more tables.tables.
By default calls the By default calls the database fast loader to database fast loader to perform the output perform the output operation(s). operation(s).
04/19/2304/19/23 4343
DBC File, Input Table, Output DBC File, Input Table, Output Table, Join with DBTable, Join with DB
Join with DB:Join with DB:
Joins records from the flow or flows Joins records from the flow or flows connected to its input port with records connected to its input port with records read directly from a database, and outputs read directly from a database, and outputs new records containing data based on, or new records containing data based on, or calculated from, the joined records.calculated from, the joined records.
04/19/2304/19/23 4444
Sub Graph, Phasing, Check Point, Sub Graph, Phasing, Check Point, RecoveryRecovery
Sub Graph:Sub Graph: A logical sub set of a graph.A logical sub set of a graph. Used for manageability.Used for manageability.
Phasing:Phasing: Breaking an application into separate processing unit. Breaking an application into separate processing unit.
Breaking an application into phases limits the contention Breaking an application into phases limits the contention for:for:
• Main memory.Main memory.• Processor(s).Processor(s).
Breaking an application into phases costs:Breaking an application into phases costs:• Disk space Disk space
04/19/2304/19/23 4545
Sub Graph, Phasing, Check Point, Sub Graph, Phasing, Check Point, RecoveryRecovery
Check Point:Check Point:
Any phase break can be a checkpointAny phase break can be a checkpoint
Recovery:Recovery:
04/19/2304/19/23 4646
Normalize, Denormalize SortedNormalize, Denormalize Sorted
Normalize:Normalize: Generates multiple data records from each input Generates multiple data records from each input
data record; you can specify the number of data record; you can specify the number of output records, or the number of output records output records, or the number of output records can depend on a field or fields in each input data can depend on a field or fields in each input data record.record.
Separate a data record with a vector field into Separate a data record with a vector field into several individual records, ach containing one several individual records, ach containing one element of the vector.element of the vector.
Generates a series of output data records of Generates a series of output data records of each input data record by calling a transform each input data record by calling a transform function repeatedly.function repeatedly.
04/19/2304/19/23 4747
Normalize, Denormalize SortedNormalize, Denormalize Sorted
Denormalize Sorted:Denormalize Sorted: Consolidates groups of related data Consolidates groups of related data
records into a single output record with a records into a single output record with a vector field for each group.vector field for each group.
Optionally computes summary fields in the Optionally computes summary fields in the output record for each group. Denormalize output record for each group. Denormalize sorted requires grouped input.sorted requires grouped input.
04/19/2304/19/23 4848
DAY FIVEDAY FIVE
Memory Management Memory Management Dead Lock Dead Lock Sandbox Setting, Graph and Project Sandbox Setting, Graph and Project
ParameterParameter User defined function and Built-in User defined function and Built-in
functionsfunctions
04/19/2304/19/23 4949
Memory ManagementMemory Management
Memory requires for Sorting, Rollup and Memory requires for Sorting, Rollup and JoinJoin
Input must be sorted vs In-Memory SortInput must be sorted vs In-Memory Sort AI_GRAPH_MAX_CORE_SETTINGAI_GRAPH_MAX_CORE_SETTING
04/19/2304/19/23 5050
Dead LockDead Lock
How to avoid Dead Lock :How to avoid Dead Lock : Use Concatenate and Merge with care.Use Concatenate and Merge with care. Use flow buffering (the GDE Default for a new Use flow buffering (the GDE Default for a new
graph)graph)..[*Automatic Flow Buffering “is enabled”][*Automatic Flow Buffering “is enabled”]
Insert a phase break before the departitioner.Insert a phase break before the departitioner. Don’t serialize data unnecessarily; repartition Don’t serialize data unnecessarily; repartition
instead of departition. instead of departition.
04/19/2304/19/23 5151
Sandbox Setting, Graph and Sandbox Setting, Graph and Project ParameterProject Parameter
Sandbox Setting:Sandbox Setting: Work space is called “Sand Box”Work space is called “Sand Box” Setting up a standard working Setting up a standard working
environment helps a development team or environment helps a development team or other team work together.other team work together.
Allows an application to be designed to be Allows an application to be designed to be portable.portable.
04/19/2304/19/23 5252
Sandbox Setting, Graph and Sandbox Setting, Graph and Project ParameterProject Parameter
Default sandbox directoriesDefault sandbox directories $AI_RUN—run directory$AI_RUN—run directory $AI_DML—record format files$AI_DML—record format files $AI_XFR—transform files$AI_XFR—transform files $AI_MP—graphs$AI_MP—graphs $AI_DB—database config files$AI_DB—database config files
04/19/2304/19/23 5353
Sandbox Setting, Graph and Sandbox Setting, Graph and Project ParameterProject Parameter
A parameter is simply a name value pair A parameter is simply a name value pair with a number of additional attributes.with a number of additional attributes.
Parameters that reside in your sandbox Parameters that reside in your sandbox are known as are known as sandbox parametersandbox parameter, they , they set the context of your sandbox. Those set the context of your sandbox. Those that reside in the repository are called that reside in the repository are called project parametersproject parameters..
Graph parametersGraph parameters only apply to the only apply to the graph in which they are defined.graph in which they are defined.
04/19/2304/19/23 5454
User defined function and Built-in User defined function and Built-in functionsfunctions
Work like as AI built in function.Work like as AI built in function. Global usability across application.Global usability across application. Like built in function stores as .XFRLike built in function stores as .XFR
Built in functionsBuilt in functions
Next_in_sequence()Next_in_sequence()
Is_blank()Is_blank()
Is_defined() etc……Is_defined() etc……
top related