question&ans abi

Upload: lingaiah-naidu

Post on 07-Apr-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 Question&Ans Abi

    1/12

    QuestionAnswer==========================================================

    Phases vsCheckpoints

    Phases - are used to break the graph into pieces. Temporary files created during aphase will be deleted after its completion. Phases are used to effectively

    separately manage resource-consuming (memory, CPU, disk) parts of theapplication.

    Checkpoints - created for recovery purposes. These are points where everything iswritten to disk. You can recover to the latest saved point - and rerun from it.

    You can have phase breaks with or without checkpoints.

    xfrA new sandbox will have many directories: mp, dml, xfr, db, ... . xfr is a directorywhere you put files with extension .xfr containing your own custom functions(and then use : include "somepath/xfr/yourfile.xfr"). Usually XFR stores mapping.

    three types ofparallelism

    1) Data Parallesim - data (partitionning of data into parallel streams for parallelprocessing).

    2) Componnent Paralelism (execute simultaneously on different branches of thegraph)

    3) Pipeline (sequential).

    MFS

    Multi-File System

    m_mkfs - create a multifile (m_mkfs ctrlfile mpfile1 ... mpfileN)

    m_ls - list all the multifilesm_rm - remove the multifilem_cp - copy a multifile

    m_mkdir - to add more directories to existing directory structure

    Memoryrequirementsof a graph

    Each partition of a component uses: ~ 8 MB + max-core (if any) Add size of lookup files used in phase (if multiple components use same

    lookup only count it once) Multiply by degree of parallelism. Add up all components in a phase; that

    is how much memory is used in that phase.

    Select the largest-memory phase in the graph

    How tocalculate aSUM

    SCANROLLUPSCANWITHROLLUPScan followed by Dedup sort and select the last

  • 8/6/2019 Question&Ans Abi

    2/12

    dedup sortwith null key

    If we don't use any key in the sort component while using the dedup sort,then the output depends on the keep parameter.

    first - only the first record last - only last record

    unique_only - there will be no records in the output file.

    join onpartitionedflow

    file1 (A,B,C) , file2 (A,B,D). We partition both files by "A", and then join by"A,B". IS it OK? Or should we partition by "A,B" ? Not clear.

    checkin,checkout

    You can do checkin/checkout using the wizard right from the GDE using versionsand tags

    how to have

    differentpasswords forQA andproduction

    parameterize the .dbc file - or use environmental variable.

    How to getrecords 50-75out of 100

    use scan and filter m_dump -start 50 -end 75

    use next_in_sequence() function and filter by expression component(next_in_sequence() >50 && next_in_sequence()

  • 8/6/2019 Question&Ans Abi

    3/12

    departitioning departitioning - removing partitionning (gather an merge component)

    re-partitioning - change the number of partitions (eg, from 2 to 4 flows)

    lookup file for large amounts of data use MFS lookup file (instead of serial)

    indexingNo indexes as such. But there is an "output indexing" using reformat and doingnecessary coding in transform part.

    Environmentproject

    Environment project - special public project that exists in every Ab Initioenvironment. It contains all the environment parameters required by the private orpublic projects which constitute AI Standard Environment.

    Aggregate vsRollup

    Aggregate - old componentRollup - newer, extended, recommended to use instead of Agregate.(built-in functions like sum count avg min max product, ...)

    EME, GDE,Co-operatingsytem

    EME = Enterprise Metdata Environment. Functions (repository, versioncontrol, statistical analysis, dependency analysis). It is on the server sideand holds all the projects (metadata of transformations, config info, sourceand target info: graph dml xfr ksh sql, etc..). This is where youcheckin/checkout. /Project dir of EME contains common directories for allapplication sandboxes connected to it. It also helps in dependency analysisof codes. Ab Initio has series of air commands to manipulate repositoryobjects.

    GDE = Graphical Devlopment Environment (on the client box)

    Co-operating sytem = Ab Initio server installed on top of native (unix) os

    on the server

    fencing

    fencing means job controlling on priority basis.In AI it actually refers to customized phase breaking. A well fenced graph meansno matter what is source data volume process will not cough in dead locks. Itactually limits the number of simultaneous processes.

    Fencing - changing a priority of a jobPhasing - managing the resources to avoid deadlocks.For example, limiting the number of simultaneous processes(by breaking the graph into phases, only 1 of which can run at any given time)

    Continuouscomponents

    Continuous components - produce useful output file while running continously.For example, Continuous rollup, Continuous update batch subscribe

    2

  • 8/6/2019 Question&Ans Abi

    4/12

    QuestionAnswer==========================================================

    deadlockDeadlock is when two or more processes are requesting the same resource. Toavoid use phasing and resource pooling.

    environment

    AB_HOME - where co>operating system is installed AB_AIR_ROOT - default location for EME datastore sandboxes standard environment AI_SORT_MAX_CORE, AI_HOME, AI_SERIAL, AI_MFS, etc.

    from unix prompt: env | grep AI

    wrapperscript

    unix script to run graphs

    multistagecomponent

    A multistage component is a component which transforms input records in 5

    stages (1.input select, 2.temporary initialization, 3.processing, 4. output selection,5.finalize). So it is a transform component which has packages. Examples: scanNormalize and Denormalize, rollup scan normalize and denormalize sorted.

    DynamicDML

    Dynamic DML is used if the input metadata can change. Example: at differenttime different input files are recieved for processing which have different dml. inthat case we can use flag in the dml and the flag is first read in the input filerecieved and according to the flag its corresponding dml is used.

    fan in, fanout

    fan out - partition component (increase parallelism)

    fan in departition component (decrease parallelism)

    locka user can lock the graph for editing so that others will see the message and cannot edit the same graph.

    join vslookup

    Lookup is good for spped for small files (will load whole file in memory). Forlarge files use join. You may need to increase the maxcore limit to handle bigjoins.

    multi updatemulti update executes SQL statements - it treats each input record as a completelyseparate piece of work.

    scheduler

    We can use Autosys, Control-M, or any other external scheduler.

    We can take care of dependencies in many ways. For example, if scriptsshould run sequentially, we can arrange for this in Autosys, or we cancreate a wrapper script and put there several sequential commands (nohupcommand1.ksh & ; nohup command2.ksh &; etc). We can even create aspecial graph in Ab Initio to execute individual scripts as needed.

  • 8/6/2019 Question&Ans Abi

    5/12

    Api andUtilitymodes ininput table

    These are database interfaces (api - uses SQL, utility - bulk loads, whatevervendor provides)

    lookup file

    lookup file component. Functions: lookup, lookup_count, lookup_next,lookup_match, lookup_local.

    Lookups are always used with combination of the reformat components.

    Callingstored procin DB

    You can call stored proc (for example, from input component). In fact, you caneven write SP in Ab Initio. Make it "with recompile" to assure good performance.

    Frequentlyusedfunctions

    string_ltrim, string_lrtrim, string_substring, reinterpret_as, today(), now()

    datavalidation

    is_valid, is_null, is_blank, is_defined

    driving port

    When joining inputs (in0, in1, ...) one of the ports is used as "driving (by default -in0). Driving input is usually the largest one. Whereas the smallest can have"Sorted-Input" parameter be set to "Input need not be sorted" because it will beloaded completely in memory.

    Ab Initio vsInformaticafor ETL

    Ab Initio benefits: parallelism built in, mulitifile system, handles huge amounts ofdata, easy to build and run. Generates scripts which can be easily modified asneeded )if something couldn't be done in ETL tool itself). The scripts can be easilyscheduled using any external scheduler - and easily integrated with other systems.

    Ab Initio doesn't require a dedicated administrator.

    Ab Initio doesn't have built-in CDC capabilities (CDC = Change Data Capture).

    Ab Initio allows to (attach error / reject files) to each transformation and captureand analyze the message and data separately (as opposed to Informatica which hasjust one huge log). Ab Initio provides immediate metrics for each component.

    override key

    override key option is used when we need to join 2 fields which have different

    field names.

    control filecontrol file should be in the multifile directory (contains the addresses of the serialfiles)

    max-core max-core parameter (for example, sort 100 MBytes) specifies the amount ofmemory used by a component (like Sort or Rollup) - per partition - before spillingto disk. Usually you don't need to change it - just use default value. Setting it too

  • 8/6/2019 Question&Ans Abi

    6/12

    high may degrade the performance because of OS swapping and degrading of theperformance of other components.

    InputParameters

    graph > select parameters tab > click "create" - and create a parameter. Usage:$paramname. Edit > parameters. These parameters will be substituted during run

    time. You may need to declare you parameter scope as formal.

    ErrorTrapping

    Each component has reject, error, and log ports. Reject captures rejected records,Error captures corresponding error, and log captures the execution statistics of thecomponent. You can control reject status of each component by setting rejectthreshold to either Never Abort, Abort on first reject, or setting ramp/limit. Youcan also use force_error() function in transform function.

    3

    QuestionAnswer==========================================================

    How to see resourceusage

    In GDE goto options View > Tracking Details - will see each component'sCPU and memory usage, etc.

    assign keyscomponent

    Easy and saves development time. Need to understand how to feedparameters, and you can't control it easily.

    Join in DB vs join in

    Ab Initio

    Scenario 1 (preferred): we run query which joins 2 tables in DBand gives us the result in just 1 DB component.

    Scenario 2 (much slower): we use 2 database components, extractall data - and join them in Ab Initio.

    Join with DBnot recommended if number of records is big. It is better to retrieve thedata out - and then join in Ab Initio.

    Data SkewParameter showing how data is unevenly distributed between partitions.

    skew = (partition size - avg.part.size)* 100 / (size of the largest partition)

    dbc vs cfg

    .dbc - database configuration file (dbname, nodes, version user/pwd) -

    resides in the db directory

    .cfg - any tyoe of config file. for example, remote connection config(name of remote server, user/pwd to connect to db, location of OS onremote machine, connection method). .cfg file resides in the config dir.

    compilation errors depth not equal data format error etc...

  • 8/6/2019 Question&Ans Abi

    7/12

    depth error : we get this error.. when two components connected togetherbut does't match there layout

    types of partitions broadcast pbyexpression pbyroundrobin pbykey pwithloadbalance

    unused portwhen joining, used records go to the output port, unused records - to theunused port

    tuning performance Go parallel using partitionning. Roundrobin partitionning givesgood balance.

    Use Multi-file system (MFS). Use Ad Hoc MFS to read many serial files in parallel, and use

    concat component. Once data is partitionned - do not switch it to serial and back.

    Repartition instead. Do not acceess large filess via NFS - use FTP instead use lookup local rather than lookup (especially for big lookups). Use rollup and Filter as soon as possible to reduce number of

    records. Ideally do it in the source (database ?) before you get thedata.

    Remove unnecessary components. For example, instead of usingfilter by exp, you can implement the same function inreformat/Join/Rollup. Another example - when joining data from 2files, use union function instead of adding an additional componentfor removing duplicates.

    use gather instead of concatenate. it is faster to do a sort after a partitino, than to do a sort before a

    partition. try to avoid using a join with the "db" component. when getting data from database - make sure your queries are fast

    (use indexes, etc.). If possible, do necessary selection / aggregation/ sorting in the database before getting data into Ab Initio.

    tune Max_core for Optimal performance (for sort depends on thesize of the input file).

    Note - If in-memory join cannot fit its non-driving inputs in theprovided MAX-CORE, then it will drop all the inputs to disk andin-memory does not make sence.

    Using phase breaks let you allocate more memory in individual

    components - thus improving performance. Use checkpoint after sort to land data on disk Use Join and rollup in-memory feature When joining very small dataset to a very large dataset it is more

    efficient to broadcast the small dataset to MFS using broadcastcomponent, or use the small file as lookup. But for large datasetdon't use broadcast as a partitioner.

  • 8/6/2019 Question&Ans Abi

    8/12

    Use Ab Initio layout instead of database default to achieve parallelloads

    Change AB_REPORT parameter to increased monitoring duration Use catalogs for reusability Components like join/ rollup should have the option "Input must

    be sorted"if they are placed after a sort component.

    minimize number of sort components. Minimize usage of sortedjoin component, and if possible replace them by in-memoryjoin/hash join. Use only required fields in the sort reformat joincomponents. Use "Sort within Groups" instead of just Sort whendata was already presorted.

    Use phasing/flow buffers in case of merge sorted joins Minimize the use of regular expression functions like re_index in

    the transfer functions Avoid repartitioning of data unnecessarily. When splitting records

    into more than two flows, use Reformat rather than Broadcastcomponent.

    For joining records from 2 flows use Concatenate componentONLY when there is a need to follow some specific order injoining records. If no order is required then it is preferable to useGather component.

    Instead of putting many Reformat components consecutively, useoutput indexes parameter in the first Reformat component andmention the condition there.

    delta table Delta table maintain the sequencer of each data table.

    Master (or base) table - a table on tp of which we create a view

    scan vs rolluprollup - performs aggregate calculations on groups, scan - calculatescumulative totals

    packages used in multistage components or transform components

    Reformat vs"Redefine Format"

    Reformat - deriving new data by adding/dropping fields

    Redefine format - rename fields

    Conditional DML DML which is separated based on a condition

    SORTWITHINGROUP

    The prerequisit for using sortwithingroup is that the data is alreadysorted by the major key. sortwithingroup outputs the data once ithas finished reading the major key group. It is like an implicitphase.

  • 8/6/2019 Question&Ans Abi

    9/12

    passing a condition asa parameter

    Define a Formal Keyword Parameter of type string. For example, you callit FilterCondition, and you want it to do filtering on COUNT > 0 . Also inyour graph in your "Filter by expression" Component enter followingcondition: $FilterCondition

    Now on your command line or in wrapper script give the followingcommand

    YourGraphname.ksh -FilterCondition COUNT > 0

    Passing file name asa parameter

    #!/bin/ksh#Running the set up script on enviornmenttypeset PROJ_DIR $(cd $(dirname $0)/..; pwd). $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR#Exporting the script parameter1 to INPUT_FILE_NAMEif [ $# -ne 2 ];thenINPUT_FILE_PARAMETER_1 $1INPUT_FILE_PARAMETER_2 $2

    # This grpah is using the input filecd $AI_RUN./my_graph1.ksh $INPUT_FILE_PARAMETER_1# This graph also is using the input file../my_graph2.ksh $INPUT_FILE_PARAMETER_2exit 0;

    elseecho Insufficient parametersexit 1;

    fi-------------------------------------#!/bin/ksh

    #Running the set up script on enviornmenttypeset PROJ_DIR $(cd $(dirname $0)/..; pwd). $PROJ_DIR/ab_project_setup.ksh $PROJ_DIR

    #Exporting the script parameter1 to INPUT_FILE_NAMEexport INPUT_FILE_NAME $1

    # This grpah is using the input filecd $AI_RUN./my_graph1.ksh

    # This graph also is using the input file../my_graph2.ksh

    exit 0;

    How to removeheader and trailerlines?

    use conditional dml where you can separate detail from header and trailer.For validations use reformat with count :3 (out0:header out1:detailout2:trailer.)

    How to create a multifile system on

    first method: in GDE go to RUN > Execute Command - and run

  • 8/6/2019 Question&Ans Abi

    10/12

    Windows

    m_mkfs c:control c:dp1 c:dp2 c:dp3 c:dp4

    second method: double-click on the file component, and in portstab double-click on partitions - there you can enter the number ofpartitions.

    VectorA vector is simply an array. It is an ordered set of elements of the sametype (type can be any type, including a vector or a record).

    Dependency AnalysisDependency analysis will answer the questions regarding datalinage, thatis where does the data come from what applications prodeuce and dependon this data etc..

    4

    QuestionAnswer==========================================================

    Surrogatekey

    There are many ways to create a surrogate key. For example, you can usenext_in_sequence() function in your transform. Or you can use "Assign keyvalues" component. Or you can write a stored procedure - and call it.

    Note: if you use partitions, then do something like this:

    (next_in_sequence()-1)*no_of_partition()+this_partition()

    .abinitiorc

    This is a config file for ab initio - in user's home directory and in

    $AB_HOME/Config. It sets abinitio home path, configuration variables(AB_WORK_DIR, AB_DATA_DIR, etc.), login info (id, encrypted password),login methods for hosts for execution (like EME host, etc.), etc.

    .profileyour ksh init file ( environment, aliases, path variables, history file settings,command prompt settings, etc.)

    datamapping,datamodelling

    Hwo toexecute thegraph

    From GDE - whole graph or by phases. From checkpoint. Also using ksh scripts

    WriteMultiplefiles

    A component which allows to write simultaneously into multiple local files

    Testing Run the graph - see the results. Use components from Validate category.

  • 8/6/2019 Question&Ans Abi

    11/12

    Sandbox vsEME

    Sandbox is your private area where you develop and test. Only one project andone version can be in the sandbox at any time. The EME Datastore contains allversions of the code that have been checked into it (source control).

    Layout

    Where the data-files are and where the components are running. For example, for

    data - serial or partitioned (multi-file). The layout is defined by the location of thefile (or a control file for the multifile). In the graph the layout can propagateautomatically (for multifile you have to provide details).

    Latestversions

    April 2009: GDE ver.1.15.6, Co-operative system ver 2.14.

    Graphparameters

    menu edit > parameters - allows you to specify private parameters for the graph.They can be of 2 types - local and formal.

    Plan>ItYou can define pre- and post-processes, triggers. Also you can specify methods torun on success or on failure of the graphs.

    Frequentlyusedcomponents

    input file / output file input table / output table lookup / lookup_local reformat gather / concatenate join runsql join with db compression components filter by expression

    sort (single or multiple keys) rollup trash

    partition by expression / partition by key

    running onhosts

    co>operating system is layered on top of native OS (unix). When running fromGDE, GDE generates a script (according to "run" setings). Co>op system willexecute the scripts on different machines (using specified host settings andconnection methods, like rexec telnet rsh rlogin) - and then return error or successcodes back.

    conventionalloading vsdirectloading

    This is basically an Oracle question - regarding SQLLDR (SQL Loader) utility.Conventional load - using insert statements. All triggers will fire, all contraintswill be checked, all indexes will be updated.

    Direct load - data is written directly block by block. Can load into specificpartition. Some constraints are checked, indexes may be disabled - need to specifynative options to skip index maintenance.

  • 8/6/2019 Question&Ans Abi

    12/12

    semi-join

    abinitio online help gives 3 examples of joins: inner join, outer join, and semi join.

    for inner join 'record_requiredN' parameter is true for all "in" ports. for outer join it is false for all the "in" ports.

    for semi join it is true for both port (like InnerJoin), but the dedup option isset only on one side

    http://www.geekinterview.com/Interview-Questions/Data-Warehouse/Abinitio/page10