pl standard toolkit reference

Upload: ramanavg

Post on 08-Aug-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/22/2019 Pl Standard Toolkit Reference

    1/78

    IBM InfoSphere StreamsVersion 2.0.0.4

    IBM Streams Processing LanguageStandard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    2/78

  • 8/22/2019 Pl Standard Toolkit Reference

    3/78

    IBM InfoSphere StreamsVersion 2.0.0.4

    IBM Streams Processing LanguageStandard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    4/78

    NoteBefore using this information and the product it supports, read the general information under Notices on page 63.

    Edition Notice

    This document contains proprietary information of IBM. It is provided under a license agreement and is protectedby copyright law. The information contained in this publication does not include any product warranties, and anystatements provided in this manual should not be interpreted as such.

    You can order IBM publications online or through your local IBM representative.

    v To order publications online, go to the IBM Publications Center at www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss

    v To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at www.ibm.com/planetwide

    When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in anyway it believes appropriate without incurring any obligation to you.

    Copyright IBM Corporation 2011, 2012.US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contractwith IBM Corp.

    http://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wsshttp://www.ibm.com/planetwidehttp://www.ibm.com/planetwidehttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
  • 8/22/2019 Pl Standard Toolkit Reference

    5/78

    Summary of changes

    This topic describes updates to this documentation for IBM InfoSphere StreamsVersion 2.0 (all releases).

    Note: The following revision characters are used in the InfoSphere Streamsdocumentation to indicate updates for Version 2.0.0.4:

    v In PDF files, updates are indicated by a vertical bar (|) to the left of eachnew or changed line of text.

    v In HTML files, updates are surrounded by double angle brackets(>> and

  • 8/22/2019 Pl Standard Toolkit Reference

    6/78

    v Multiple DirectoryScan operators can scan the same directory simultaneously ifthe processed files are moved to a different directory before generating theoutput tuple.

    v The DirectoryScan operator supports custom output functions to provideadditional information about the generated file.

    v The interface parameter is added to the TCPSource, TCPSink, and UDPSource

    operators to specify the network interface to use when registering the addresswith the name parameter.

    v The nConnections metric is added to the TCPSource and TCPSink operators toindicate the number of active TCP/IP connections.

    v The append parameter is added to the FileSink operator to append thegenerated tuples to the output file. For more information, see FileSink on page20.

    v The ignoreOpenErrors parameter is added to the FileSource operator to readsuccessive files if a file cannot be opened for reading. For more information, seeFileSource on page 15.

    v An optional output port is added to the FileSource operator to indicate the filesthat were processed and those that could not be opened successfully.

    v If an SPL program or a toolkit uses the new features that are added to theStandard Toolkit in IBM InfoSphere Streams Version 2.0.0.3 , you must set theStandard Toolkit version to 1.0.1 in the info.xml file. For more informationabout the info.xml file and how to set dependencies on other toolkits, see howto create toolkits in the IBM Streams Processing Language Toolkit DevelopmentReference.

    Updates for Version 2.0.0.2 (Version 2.0, Fix Pack 2)

    v The DirectoryScan operator uses change time (ctime) of the file to detect if thefile has been recreated. For more information, see DirectoryScan on page 23.

    v The hasHeaderLine parameter of the FileSource operator supports multiple linesof column names for csv format. For more information, see FileSource on page

    15.v A logic clause cannot be specified for the Export operator.

    v A config clause cannot be specified for the Import and Export operators.

    v If a file is moved to a directory that is on a different file system, a .renamesubdirectory might be created in the target directory for the file move operationto be atomic. For more information, see FileSink on page 20 and FileSourceon page 15.

    Updates for Version 2.0.0.1 (Version 2.0, Fix Pack 1)

    This guide was not updated for Version 2.0.0.1.

    iv IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    7/78

    Abstract

    This document describes the operators that are provided by the IBM StreamsProcessing Language (SPL) standard toolkit. This standard toolkit is specific to IBM

    InfoSphere Streams.

    Copyright IBM Corp. 2011, 2012 v

  • 8/22/2019 Pl Standard Toolkit Reference

    8/78

    vi IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    9/78

    Contents

    Summary of changes. . . . . . . . . iii

    Abstract. . . . . . . . . . . . . . . v

    Chapter 1. Relational Operators . . . . 1Filter . . . . . . . . . . . . . . . . . 1Functor . . . . . . . . . . . . . . . . 1Punctor . . . . . . . . . . . . . . . . 2Sort . . . . . . . . . . . . . . . . . 3Join . . . . . . . . . . . . . . . . . 5Aggregate . . . . . . . . . . . . . . . 9

    Chapter 2. Adapter Operators . . . . . 15FileSource . . . . . . . . . . . . . . . 15FileSink. . . . . . . . . . . . . . . . 20

    DirectoryScan. . . . . . . . . . . . . . 23TCPSource. . . . . . . . . . . . . . . 26TCPSink . . . . . . . . . . . . . . . 31UDPSource . . . . . . . . . . . . . . 34UDPSink . . . . . . . . . . . . . . . 37Export . . . . . . . . . . . . . . . . 39Import . . . . . . . . . . . . . . . . 39MetricsSink . . . . . . . . . . . . . . 41

    Chapter 3. Utility Operators . . . . . . 43Custom . . . . . . . . . . . . . . . . 43

    Beacon . . . . . . . . . . . . . . . . 43Throttle. . . . . . . . . . . . . . . . 44Delay . . . . . . . . . . . . . . . . 45Barrier . . . . . . . . . . . . . . . . 46Pair . . . . . . . . . . . . . . . . . 48Split . . . . . . . . . . . . . . . . . 49DeDuplicate . . . . . . . . . . . . . . 51Union . . . . . . . . . . . . . . . . 52ThreadedSplit . . . . . . . . . . . . . 53DynamicFilter . . . . . . . . . . . . . 54Gate . . . . . . . . . . . . . . . . . 55JavaOp . . . . . . . . . . . . . . . . 57

    Chapter 4. Compat Operators . . . . . 59V1TCPSource . . . . . . . . . . . . . . 59V1TCPSink . . . . . . . . . . . . . . 61Compat.Sample . . . . . . . . . . . . . 62

    Notices . . . . . . . . . . . . . . 63

    Copyright IBM Corp. 2011, 2012 vii

  • 8/22/2019 Pl Standard Toolkit Reference

    10/78

    viii IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    11/78

    Chapter 1. Relational Operators

    Filter

    DescriptionThe Filter operator removes tuples from a stream by passing along onlythose that satisfy a user-specified condition. Non-matching tuples may besent to a second optional output.

    Input PortsThe Filter operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.

    Output PortsThe Filter operator is configurable with one or two output ports. The firstoutput port is mandatory, non-mutating, and its punctuation mode isPreserving. The second output port is optional, non-mutating and itspunctuation mode is Preserving. The Filter operator requires that thestream type of the output port(s) match the stream type of the input port.The first output port will receive the tuples that match the filterexpression. The second output port, if present, will receive the tuples thatfail to match the filter expression.

    ParametersThe Filter operator has the following parameters:

    filter This is an optional parameter, which specifies the condition thatdetermines the tuples to be passed along by the Filter operator. Ittakes a single expression of type boolean as its value. When notspecified, it is assumed to be true.

    Windowing

    The Filter operator does not accept any window configurations.

    AssignmentsThe Filter operator does not allow assignments to output attributes. Theoutput tuple attributes are automatically forwarded from the input ones.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //3stream Youngs = Filter(Beat) //4{ //5

    param filter : age < 30u; //6} //7(stream Younger; stream Older) = Filter(Beat) //8{ //9

    param filter : age < 30u; //10} //11

    } //12

    Functor

    DescriptionThe Functor operator is used to transform input tuples into output ones,and optionally filter them as in a Filter operator. If you do not filter aninput tuple, any incoming tuple results in a tuple on each output port.

    Input PortsThe Functor operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious

    Copyright IBM Corp. 2011, 2012 1

  • 8/22/2019 Pl Standard Toolkit Reference

    12/78

    Output PortsThe Functor operator is configurable with one or more output ports. Theoutput ports are mutating and their punctuation mode is Preserving

    ParametersThe Functor operator has the following parameters:

    filter This is an optional parameter, which specifies the condition that

    determines which input tuples are to be operated on by theFunctor operator. It takes a single expression of type boolean as itsvalue. When not specified, it is assumed to be true, i.e., tuples aretransformed, but no filtering is performed.

    WindowingThe Functor operator does not accept any window configurations.

    AssignmentsThe Functor operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, theFunctor operator expects all output tuple attributes to be completelyassigned.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //3stream //5Annotated = Functor(Beat) //6

    { //7param filter : age >= 18u; //8output Annotated : login = lower(name), //9

    info = { young = (age1000000ul) }; //10} //11(stream Age; //12

    stream Salary) = Functor(Beat) //13{ //14

    param filter : age >= 18u; //15} //16

    } //17

    Punctor

    DescriptionThe Punctor operator is used to transform input tuples into output onesand add window punctuations to the output.

    Input PortsThe Punctor operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.

    Output PortsThe Punctor operator is configurable with a single output port. The outputport is mutating and its punctuation mode is Generating.

    ParametersThe Punctor operator has the following parameters:

    punctuate

    This is a mandatory parameter, which specifies the condition thatdetermines when a window punctuation is to be generated. It takesa single expression of type boolean as its value.

    position

    This is a mandatory parameter, which specifies the position of thegenerated window punctuation with respect to the current tuple.

    2 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    13/78

    The valid values are before and after. If the value is before, thepunctuation will be generated before the output tuple, otherwise itwill be generated after the output tuple.

    WindowingThe Punctor operator does not accept any window configurations.

    Assignments

    The Punctor operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, thePunctor operator expects all output tuple attributes to be completelyassigned.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //3stream //5Annotated = Punctor(Beat) //6

    { //7param punctuate : age >= 18u; //8

    position : after; // add a punctuation after the generated tuple, //9// if the age is >= 18 //10

    output Annotated : login = lower(name), //11info = { young = (age1000000ul) }; //12

    } //13} //14

    Sort

    DescriptionThe Sort operator is used to order tuples based on user-specified orderingexpressions and window configurations.

    Input PortsThe Sort operator is configurable with a single input port. The input portis non-mutating and its punctuation mode is WindowBound. The Sortoperator will process window marker punctuations when configured witha punctuation based window.

    Output PortsThe Sort operator is configurable with a single output port. The outputport is mutating and its punctuation mode is Generating. The Sortoperator will generate a punctuation after each batch of sorted tuples itoutputs. The Sort operator requires that the stream type for the outputport matches the stream type for the input port.

    ParametersThe Sort operator has the following parameters:

    sortBy This is a mandatory parameter that specifies one or moreexpressions to be used for sorting the tuples. The sort is performed

    in lexicographical manner in ascending order. I.e., the firstexpression will be used first for the comparison and in the case ofequality the second expression will be considered, and so on. Thedefault sort order of ascending implies that the output stream willproduce tuples in non-decreasing order. The sort order can bechanged using the order parameter.

    order This is an optional parameter that specifies either the global sortorder, or the sort order for the individual expressions that appearin the sortBy parameter. The valid values are ascending anddescending. When a single value is specified for the order

    Chapter 1. Relational Operators 3

  • 8/22/2019 Pl Standard Toolkit Reference

    14/78

    parameter it determines the global sort order. When multiplevalues are specified, then the number of values must match thenumber of sortBy expressions.

    partitionBy

    This is an optional parameter that is only valid for a Sort operatorconfigured with a partitioned window (see below). It specifies one

    or more expressions to be used for partitioning the input tuplesinto sub-windows, where all window and parameter configurationsapply to the sub-windows, independently.

    WindowingThe Sort operator supports the following window configurations:

    tumbling, (count | delta | time | punctuation)-based eviction(, partitioned (, partitionEvictionSpec)? )?

    sliding, count-based eviction, count-based trigger of 1(, partitioned (, partitionEvictionSpec)? )?

    For the tumbling variants, tuples are sorted when the window gets fulland are output at once. A window marker punctuation is output at the

    end.For the sliding variants, tuples are always kept in sorted order. Once thewindow gets full, every new tuple causes the first one in the sorted orderto be removed from the window and output. This type of sort is referredto as progressive sort.

    For the partitioned variants, the window specification applies to individualsub-windows identified by the partitionBy parameter.

    For the tumbling variants, the final punctuation marker does not flush thewindow (so as not to break invariants on the output), whereas for thesliding variants (progressive), the final punctuation marker does flush thewindow.

    AssignmentsThe Sort operator does not allow assignments to output attributes. Theoutput tuple attributes are automatically forwarded from the input ones.

    MetricsThe Sort operator has the following metrics:

    v nCurrentPartitions: The number of partitions currently in the windowfor the Sort operator.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //3// count based window //4stream Sorted0 = Sort(Beat) //5{ //6

    window //7

    Beat : tumbling, count(10); //8param //9

    sortBy : name, (float64)salary/(float64)age; //10} //11// count based partitioned window //12stream Sorted1 = Sort(Beat) //13{ //14

    window //15Beat : tumbling, count(10), partitioned; //16

    param //17partitionBy : name; //18sortBy : (float64)salary/(float64)age; //19

    } //20// count based window, with sort order //21stream Sorted2 = Sort(Beat) //22

    4 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    15/78

    { //23window //24

    Beat : tumbling, count(10); //25param //26

    sortBy : name, (float64)salary/(float64)age; //27order : descending; //28

    } //29// count based window, with sort order for each sortBy expression //30stream Sorted3 = Sort(Beat) //31{ //32

    window //33Beat : tumbling, count(10); //34param //35

    sortBy : name, (float64)salary/(float64)age; //36order : ascending, descending; //37

    } //38// punctuation based window //39stream Sorted4 = Sort(Beat) //40{ //41

    window //42Beat : tumbling, punct(); //43

    param //44sortBy : name, (float64)salary/(float64)age; //45

    } //46// time based window //47stream Sorted5 = Sort(Beat) //48{ //49

    window //50

    Beat : tumbling, time(10); //51param //52sortBy : name, (float64)salary/(float64)age; //53

    } //54// delta based window //55stream BeatId = Beacon() {} //56stream Sorted6 = Sort(BeatId) //57{ //58

    window //59BeatId : tumbling, delta(id, 10u); //60

    param //61sortBy : (float64)salary/(float64)age; //62

    } //63// progressive sort //64stream Sorted = Sort(Beat) //65{ //66

    window //67Beat : sliding, count(10); //68

    param //69sortBy : name, (float64)salary/(float64)age; //70} //71

    } //72

    Join

    DescriptionThe Join operator is used to correlate tuples from two streams based onuser-specified match predicates and window configurations. When a tupleis received on an input port, it is inserted into the window correspondingto the input port, which causes the window to trigger. As part of thetrigger processing, the tuple is compared against all tuples inside thewindow of the opposing input port. If the tuples match, then an output

    tuple will be produced for each match. If at least one output wasgenerated, a window punctuation will be generated after all the outputs.

    If equalityRHS and equalityLHS parameters are specified, the matching willbe done using a hash table. Otherwise a scan of the tuples in the windowwill be done to find the matches.

    In an outer join configuration, if a tuple does not get involved in a matchduring its stay in the join window, then it will be sent out to an outputport right before its eviction from the window. See the algorithmparameter for details.

    Chapter 1. Relational Operators 5

  • 8/22/2019 Pl Standard Toolkit Reference

    16/78

    Partitioning may be used to split the tuples into partitioned windows.

    Input PortsThe Join operator is configurable with two input ports. The input ports arenon-mutating and their punctuation mode is Oblivious.

    Output PortsThe Join operator is configurable with a single output port in the case of

    an inner join, one or two output ports in the case of a rightOuter orleftOuter join, and one or three output ports in the case of an outer join.The output ports are mutating. The punctuation mode is Generating forthe first output port and Free for any other output ports that may exist.The Join operator will generate a punctuation after each batch of joinedtuples it outputs on its first output port.

    ParametersThe Join operator has the following parameters:

    match This optional parameter specifies an expression of type boolean tobe used for matching the tuples. The expression could refer toattributes from both input ports. When omitted, the default valueof true is used.

    algorithm

    This optional parameter is used to specify the join algorithm to beused. The valid options are leftOuter, rightOuter, outer, andinner. In a left outer join, a tuple that is being evicted from the leftport's window and has never been involved in a match earlier ispaired with a default initialized tuple (whose attributes are defaultconstructed) from the right port and output. If a defaultTupleRHSparameter is specified, its value is used instead of the defaultconstructed tuple. A right outer join is similar, but applies to tuplesthat are being evicted from the right port's window and employsthe defaultTupleLHS parameter if present. An outer join is acombination of left and right outer joins. The default for this

    parameter is the inner join option, which does not perform anyaction upon eviction of tuples.

    For leftOuter and rightOuter joins, an optional second outputport can be specified. In this case, the evicted tuples that have nomatches are output on the second output port and are not joinedwith an empty tuple from the opposite window. The schema of thesecond output port must match that of the left input port in thecase of a leftOuter join and the right input port in the case of arightOuter join. For an outer join, optional second and thirdoutput ports can be specified. This means that the outer join canhave either one output port or three output ports. When specified,the second port is used to output evicted tuples from the left input

    port that have no matches and the third port is used to output theones from the right input port. The schemas of the second andthird output ports must match the schemas of the first and secondinput ports, respectively.

    defaultTupleLHS

    This optional parameter can be specified to indicate the tuple to beused from the left stream, for matching an expiring tuple from theright window that needs to be output as part of a right outer joinor outer join algorithm. It is only valid for join operators with asingle output port and those that have rightOuter or outer as the

    6 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    17/78

    join algorithm. It can take a single value of tuple type, which mustmatch the type of the tuples from the left stream.

    defaultTupleRHS

    This optional parameter can be specified to indicate the tuple to beused from the right stream, for matching an expiring tuple fromthe left window that needs to be output as part of a left outer join

    or outer join algorithm. It is only valid for join operators with asingle output port and those that have leftOuter or outer as thejoin algorithm. It can take a single value of tuple type, which mustmatch the type of the tuples from the right stream.

    equalityLHS

    This optional parameter is used to specify equality conditionexpressions from the left port. The number of expressions and theirtypes must match those from the equalityRHS parameter. Theexpressions could refer to attributes from the left input port only.

    equalityRHS

    This optional parameter is used to specify equality conditionexpressions from the right port. The number of expressions and

    their types must match those from the equalityLHS parameter. Theexpressions could refer to attributes from the right input port only.

    The equalityLHS and equalityRHS parameters can be used tospecify equi-join match predicates, which results in using ahash-based join implementation, rather than a nested-loop one.They are not mutually exclusive with the match parameter and can

    be used together.

    partitionByLHS

    This optional parameter specifies one or more expressions to beused for partitioning the input tuples from the left port intosub-windows, where all window and parameter configurationsapply to the sub-windows, independently. It can only be used if a

    partitioned window is defined for the left port (see below). Theexpressions could refer to attributes from the left input port only.

    partitionByRHS

    This optional parameter specifies one or more expressions to beused for partitioning the input tuples from the right port intosub-windows, where all window and parameter configurationsapply to the sub-windows, independently. It can only be used if apartitioned window is defined for the right port (see below). Theexpressions could refer to attributes from the right input port only.

    WindowingThe Join operator supports the following window configurations for agiven input port:

    sliding, (count | delta | time)-based eviction, count-based triggerof 1 (, partitioned (, partitionEvictionSpec)? )?

    All window configurations have a count-based trigger of 1. This meansthat every time a tuple is received on a port, it is inserted into its window,which triggers the join processing. The newly inserted tuple is matchedagainst the tuples resident in the window defined over the other inputport. In case of matches, a result is output for each match and a windowmarker punctuation is output at the end.

    Chapter 1. Relational Operators 7

  • 8/22/2019 Pl Standard Toolkit Reference

    18/78

    For the partitioned variants, the window specification applies to individualsub-windows identified by the partitionBy parameter corresponding tothe port. The left input port of the join cannot have a partitioned windowdefined unless a partitionByLHS parameter is specified. Similarly, the rightinput port of the join cannot have a partitioned window defined unless apartitionByRHS parameter is specified.

    AssignmentsThe Join operator allows assignments to output attributes. The outputtuple attributes whose assignments are not specified are automaticallyforwarded from the input ones. After the automatic forwarding, the Joinoperator expects all output tuple attributes to be completely assigned.

    MetricsThe Join operator has the following metrics:

    v nCurrentPartitionsLHS: The number of partitions currently in the lefthand side window for the Join operator.

    v nCurrentPartitionsRHS: The number of partitions currently in the lefthand side window for the Join operator.

    composite Main { //1

    graph //2stream BeatL = Beacon() {} //3stream BeatR = Beacon() {} //4// join with a match condition //5stream Join1 = Join(BeatL; BeatR) { //6

    window //7BeatL : sliding, count(100); //8BeatR : sliding, time(10); //9

    param //10match : BeatR.name == BeatL.firstName + " " + BeatL.lastName && //11

    department == "HR"; //12output //13

    Join1 : salary = salary * 2ul; //14} //15// equi-join with an additional match condition //16stream Join2 = Join(BeatL; BeatR) { //17

    window //18BeatL : sliding, count(100); //19BeatR : sliding, time(10); //20

    param //21match : department == "HR"; //22equalityLHS : BeatL.firstName + " " + BeatL.lastName; //23equalityRHS : name; //24

    output //25Join2 : salary = salary * 2ul; //26

    } //27// equi-join with multiple equality expressions //28stream Join3 = Join(BeatL; BeatR) { //29

    window //30BeatL : sliding, count(100); //31BeatR : sliding, time(10); //32

    param //33equalityLHS : BeatL.firstName + " " + BeatL.lastName, "HR"; //34equalityRHS : name, department; //35

    output //36Join3 : salary = salary * 2ul; //37

    } //38

    // single-sided partitioned join with a 0 sized window on the right hand side //39// and a partitioned window of 1 on the left hand side //40stream VWAP = Beacon() {} //41stream Quote = Beacon() {} //42stream //43

    Bargain = Join(VWAP; Quote) //44{ //45

    window //46VWAP : sliding, count(1), partitioned; //47Quote : sliding, count(0); //48

    param //49match : vwap > askprice*100.0d; //50partitionByLHS : VWAP.ticker; //51equalityLHS : VWAP.ticker; //52equalityRHS : Quote.ticker; //53

    8 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    19/78

    output //54Bargain : bargainIndex = exp(vwap-askprice*100.0d)*asksize; //55

    } //56// a left outer join with single output //57stream MsgLHS = Beacon() {} //58stream MsgRHS = Beacon() {} //59stream //60

    Msgs1 = Join(MsgLHS as L; MsgRHS as R) //61{ //62

    window //63

    L : sliding, count(0); //64R : sliding, delta(tm, 10ul), partitioned; //65param //66

    algorithm : leftOuter; //67partitionByRHS : R.kind; //68defaultTupleRHS : { message = "N/A", kind = 0u, tm = 0ul}; //69equalityLHS : L.message, L.kind; //70equalityRHS : R.message, R.kind; //71

    output //72Msgs1 : message1 = L.message, message2 = R.message; //73

    } //74// a right outer join with two outputs //75(stream Msgs2; //76

    stream MsgsRHS2) //77= Join(MsgLHS as L; MsgRHS as R) //78

    { //79window //80

    L : sliding, count(0); //81

    R : sliding, delta(tm, 10ul), partitioned; //82param //83algorithm : rightOuter; //84partitionByRHS : R.kind; //85equalityLHS : L.message; //86equalityRHS : R.message; //87

    output //89Msgs2 : message1 = L.message, message2 = R.message; //90

    } //91// an outer join with three outputs //92(stream Msgs3; //93

    stream MsgsLHS3; //94stream MsgsRHS3) //95

    = Join(MsgLHS as L; MsgRHS as R) //96{ //97

    window //98L : sliding, count(0); //99R : sliding, delta(tm, 10ul), partitioned; //100

    param //101algorithm : outer; //102partitionByRHS : R.kind; //103equalityLHS : L.message; //104equalityRHS : R.message; //105

    output //106Msgs3 : message1 = L.message, message2 = R.message; //107

    } //108// an outer join with a single output. //109//Discard unreferenced partitions after 60 seconds. //110stream //111

    Msgs4 = Join(MsgLHS as L; MsgRHS as R) //112{ //113

    window //114L : sliding, count(0); //115R : sliding, delta(tm, 10ul), partitioned, partitionAge(60.0); //116

    param //117algorithm : outer; //118partitionByRHS : R.kind; //119equalityLHS : L.message; //120equalityRHS : R.message; //121

    output //122Msgs4 : message1 = L.message, message2 = R.message; //123

    } //124} //125

    Aggregate

    DescriptionThe Aggregate operator is used to compute user-specified aggregationsover tuples gathered in a window.

    Chapter 1. Relational Operators 9

  • 8/22/2019 Pl Standard Toolkit Reference

    20/78

    Input PortsThe Aggregate operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is WindowBound. TheAggregate operator will process window marker punctuations whenconfigured with a punctuation based window.

    Output Ports

    The Aggregate operator is configurable with a single output port. Theoutput port is mutating and its punctuation mode is Generating. TheAggregate operator will generate a window punctuation after each batch ofaggregations it outputs.

    ParametersThe Aggregate operator has the following parameters:

    groupBy

    This an optional parameter that specifies one or more expressionsto be used for dividing the tuples in a window into groups. Whena window fires (a sliding window triggers or a tumbling windowflushes), one tuple with the user-specified aggregations iscomputed for each group in the window and these tuples are

    output as a batch. A window marker punctuation is output afterthe tuples.

    partitionBy

    This is an optional parameter that is only valid for an Aggregateoperator configured with a partitioned window (see below). Itspecifies one or more expressions to be used for partitioning theinput tuples into sub-windows, where all window and parameterconfigurations apply to the sub-windows, independently.

    aggregateIncompleteWindows

    This optional parameter of type boolean is valid only for slidingwindows. The default value is false. When set to true,aggregations will be done when trigger occurs, even if the window

    has not filled up. If set to false, triggers before the window is fullwill be ignored.

    WindowingThe Aggregate operator supports the following window configurations:

    tumbling, (count | delta | time | punctuation)-based eviction(, partitioned (, partitionEvictionSpec)? )?

    sliding, (count | delta | time)-based eviction, (count |delta|time)-based trigger (, partitioned (, partitionEvictionSpec)? )?

    For the tumbling variants, tuples are aggregated when the window getsfull (and flushes). The tuples containing the aggregates are output at once,followed by a window marker punctuation. Note that more than one tuplecan be output when the groupBy parameter is specified.

    For the sliding variants, tuples are aggregated when the window triggers.The tuples containing the aggregates are output at once, followed by awindow marker punctuation. Note that more than one tuple can be outputwhen the groupBy parameter is specified.

    The sliding windows for an Aggregate operator do not fire until thewindow is full for the first time unless aggregateIncompleteWindows istrue. This rule does not apply to sliding windows with time-based triggerpolicies. Such windows are assumed to be full when they start out.

    10 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    21/78

    Both for tumbling and sliding windows, when a time-based window withno tuples in it fires, just a window marker punctuation is output. When atumbling, punctuation-based window with no tuples in it receives awindow marker punctuation, just a window marker punctuation is output.

    For the partitioned variants, the window specification and parametersapply to individual sub-windows identified by the partitionBy parameter,

    as if there were separate Aggregate operators for each partition.The final punctuation marker does not flush any of the pending windows.

    AssignmentsThe Aggregate operator allows aggregated assignments to outputattributes. An aggregated assignment has an aggregation function appearingon the right-hand side of the assignment. The following aggregationfunctions are supported:

    v int32 Count(): number of tuples in the group.

    v int32 CountGroups(): number of groups in a window.

    v int32 CountAll(): number of tuples in the window.

    v list CountByGroup(): list of group sizes (number of tuples in the

    group) in a window.v T Any(T v): expression value (v) computed for any tuple in the

    group (useful for expressions that depend on the groupBy expressions).

    v T First(T v): expression value (v) computed for the first(earliest) tuple in the group.

    v T Last(T v): expression value (v) computed for the last (latest)tuple in the group.

    v list Collect(T v): collection of expression values (v's)computed for the tuples in the group.

    v list CollectDistinct(T v): collection of unique expressionvalues (v's) computed for the tuples in the group.

    v

    int32 CountDistinct(T v): number of distinct expressionvalues (v's) computed for the tuples in the group.

    v list CountByDistinct(T v): collection of cardinalitiesfor the distinct expression values (v's) computed for the tuples in thegroup, where the cardinality is the number of times the distinct valueappears. The order of entries in a CountByDistinct result matches theorder of entries in a corresponding CollectDistinct result.

    v T Average(T v): average of the expression values (v's)computed for the tuples in the group.

    v list Average(list v): list of per element averagesof the expression list values (v's) computed for the tuples in the group.All lists must have the same size.

    v

    T Sum(T v): sum of the expression values (v's) computedfor the tuples in the group.

    v T Sum(T v): same as above, but for strings (concatenation).

    v list Sum(list v): list of per element sums of theexpression list values (v's) computed for the tuples in the group. All listsmust have the same size.

    v T Max(T v): maximum of the expression values (v's)computed for the tuples in the group.

    Chapter 1. Relational Operators 11

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    22/78

    v list Max(list v): list of per element maximums ofthe expression list values (v's) computed for the tuples in the group. Alllists must have the same size.

    v T Min(T v): minimum of the expression values (v's)computed for the tuples in the group.

    Remember: The Min/Max aggregate functions do a column-wisemin/max on the lists. For example,

    Min([1,2,1], [1,1,2]) == [1,1,1] which is column-wise comparison.

    whereas, InfoSphere Streams Version 1.2 Min/Max aggregate functionsreturn the smallest/largest list. For example,

    Min([1,2,1], [1,1,2]) == [1,1,2] which is lexicographic comparison.

    v list Min(list v): list of per element minimums ofthe expression list values (v's) computed for the tuples in the group. Alllists must have the same size.

    v int32 MaxCount(T v): similar to Max, but returns the

    number of tuples for which the maximum value occurs, rather than themaximum value itself.

    v int32 MinCount(T v): similar to Min, but returns thenumber of tuples for which the minimum value occurs, rather than theminimum value itself.

    v K ArgMin(T v, K w) : the argument expression value(w) corresponding to the minimum of the objective expression values(v's) computed for tuples in the group.

    v list CollectArgMin(T v, K w) : similar toArgMin, but returns a list in case of more than one argumentminimizing the objective.

    v K ArgMax(T v, K w): the argument expression value(w) corresponding to the maximum of the objective expression values(v's) computed for tuples in the group.

    v list CollectArgMax(T v, K w) : similar toArgMax, but returns a list in case of more than one argument maximizingthe objective.

    v T SampleStdDev(T v): sample standard deviation of theexpression values (v's) computed for the tuples in the group.

    v T PopulationStdDev(T v): population standard deviationof the expression values (v's) computed for the tuples in the group.

    Output attributes missing assignments are automatically forwarded fromthe input ones using the Last aggregate.

    MetricsThe Aggregate operator has the following metrics:

    v nCurrentPartitions: The number of partitions currently in the windowfor the Aggregate operator.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //4

    // tumbling window with no group by //5stream //6

    Agg0 = Aggregate(Beat) //7{ //8

    12 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    23/78

    window //9Beat : tumbling, time(10.5); //10

    output //11Agg0 : maxSalary = Max(salary), ageOfMaxSalary = ArgMax(salary, age); //12

    } //13// tumbling window with group by //14stream //15

    Agg1 = Aggregate(Beat) //16{ //17

    window //18

    Beat : tumbling, punct(); //19param //20groupBy : country, city; //21

    output //22Agg1 : maxSalary = Max(salary); //23

    } //24// tumbling partitioned window with no group by //25stream //26

    Agg2 = Aggregate(Beat) //27{ //28

    window //29Beat : tumbling, delta(id, 10lu), partitioned; //30

    param //31partitionBy : country, city; //32

    output //33Agg2 : maxSalary = Max(salary), //34

    numPeopleWithMaxSalary = MaxCount(salary); //35} //36

    // tumbling partitioned window with group by //37stream //38Agg3 = Aggregate(Beat) //39

    { //40window //41

    Beat : tumbling, count(10), partitioned; //42param //43

    groupBy : city; //44partitionBy : country; //45

    output //46Agg3 : maxSalary = Max(salary), //47

    peopleWithMaxSalary = CollectArgMax(salary, name); //48} //49// sliding window with no group by //50stream //51

    Agg4 = Aggregate(Beat) //52{ //53

    window //54

    Beat : sliding, time(10.5), count(10); //55output //56Agg4 : maxSalary = Max(salary), ageOfMaxSalary = ArgMax(salary, age); //57

    } //58// sliding window with group by //59stream //60

    Agg5 = Aggregate(Beat) //61{ //62

    window //63Beat : sliding, count(10), count(1); //64

    param //65groupBy : country, city; //66

    output //67Agg5 : maxSalary = Max(salary); //68

    } //69// sliding partitioned window with no group by //70stream //71

    Agg6 = Aggregate(Beat) //72{ //73

    window //74Beat : sliding, delta(id, 10lu), count(10), partitioned; //75

    param //76partitionBy : country, city; //77

    output //78Agg6 : maxSalary = Max(salary), //79

    numPeopeWithMaxSalary = MaxCount(salary); //80} //81// sliding partitioned window with group by //82stream //83

    Agg7 = Aggregate(Beat) //84{ //85

    window //86Beat : sliding, count(10), time(1), partitioned; //87

    param //88

    Chapter 1. Relational Operators 13

  • 8/22/2019 Pl Standard Toolkit Reference

    24/78

    groupBy : city; //89partitionBy : country; //90

    output //91Agg7 : maxSalary = Max(salary), //92

    peopleWithMaxSalary = CollectArgMax(salary, name); //93} //94

    } //95

    14 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    25/78

    Chapter 2. Adapter Operators

    FileSource

    DescriptionThe FileSource operator reads data from a file and produces tuples as aresult.

    Input PortsThe FileSource operator has one optional input port. If present, the inputport schema must be a tuple with a single rstring attribute. Each tuplewill hold the file name to be read by the FileSource operator. Whileprocessing the tuple, the entire file will be read, and tuples generated bythe FileSource operator.

    Output PortsThe FileSource operator is configurable with two output ports. The firstoutput port is mutating and its punctuation mode is Generating. TheFileSource operator will output a window marker punctuation when thefile is read in full.

    The second output port is optional and must contain a tuple with twoattributes: one with an attribute of type rstring and one with an attributeof type int32. This stream generates tuples with the file name and 0 as theattribute values when the end of the file being read is reached. If a file failsto open, the stream generates tuples with the file name and the systemerror code. This allows a downstream operator to know which files wereprocessed, and which files could not be opened successfully.

    ParametersThe FileSource operator has the following parameters:

    file This is an optional parameter that specifies the name of the sourcefile. It must not be present if the FileSource operator has an inputport, otherwise it must be present. It is of type rstring. It is validfor the file parameter to refer to a named pipe, unless the hotFileparameter is set to true. hotFile is implemented using seek, andseek is not valid on a named pipe.

    format This optional parameter specifies the format of the file. Validvalues are txt, csv, bin, line, and block. The default format is csv.This parameter can only take a single value. The detaileddescriptions of individual format options are as follows:

    v txt: This format expects the file to be structured as a series oflines, where each line is a tuple literal, free of any type suffixes.

    String literals must be in double quotes. The # character can beused to mark comment lines. An example is as follows:

    # tuple{name="John", age=40}{name="Mary", age=35}

    v csv: This format expects the file to be structured as a series oflines, where each line is a list of comma separated values. Stringliterals that are used at the outermost level can appear withoutthe double quotes, unless they have a ,' character or escaped

    Copyright IBM Corp. 2011, 2012 15

  • 8/22/2019 Pl Standard Toolkit Reference

    26/78

    characters, in which case double quotes are required. Bothrstring and ustring values should appear as utf-8 encodedstrings. For fields missing in the csv formatted line (as in , ,),default constructed values will be used, unless the defaultTupleparameter is specified. The separator parameter may be used tochange the default separator of ,'. '.' is used as the decimal pointfor binary and decimal floating point data. The # character can

    be used to mark comment lines. An example is as follows

    # tuple

    John, 40, [{city="New York City",state="NY"},{city="Atlanta",state="GA"}]"Mary, and co.", 35, [{city="Toronto",state="ON"},{city="White Plains",state="NY"}]

    v bin: This format expects the file to be structured as a series oftuples in binary, using network byte order. Tuple attributes areassumed to be serialized in sequence to form a tuple.

    v line: This format expects the file to be structured as a series of

    lines. It also expects the output stream schema to contain asingle attribute of type rstring. Each line will be converted intoa tuple, where the line text (excluding the end of line marker)

    becomes the rstring attribute in the output tuple. The end ofline marker can be customized via the use of the eolMarkerparameter.

    v block: This format expects the file to be structured as a series ofbinary blocks. It also expects the output stream schema tocontain a single attribute of type blob. Each block will beconverted into a tuple. The block size can be customized via theuse of the blockSize parameter. The last block read from the filemay be less than blockSize bytes.

    hasHeaderLineThis optional attribute-free parameter of type boolean or uint32 isvalid only if the format is csv. If true, then the first line in the filewill be read and ignored. If false (the default), no lines will beskipped. If a uint32 expression is passed, that number of lines will

    be skipped. This allows column names to be present in the firstseveral lines of the file.

    ignoreOpenErrors

    This optional parameter of type boolean specifies if the FileSourceoperator will continue executing if the input file cannot be opened.If the ignoreOpenErrors parameter is set to true and an input filecannot be opened, the FileSource operator logs an error and

    proceeds with the next input file. If not present, or theignoreOpenErrors parameter is false, the FileSource operator willlog an error and terminate. By default, the ignoreOpenErrorsparameter is set to false.

    hasDelayField

    This optional parameter of type boolean is used to instruct theFileSource operator to expect an additional attribute whichspecifies a delay to be used to pace the generation of the outputtuples. By default, it is false. This parameter can only be usedwith txt, csv, and bin formats. The type of the delay attributemust be float64 and it is assumed to be in seconds. The delay

    16 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    27/78

    attribute must appear before the tuple. In the case of txt and csvformats the delay attribute is separated from the tuple via a singlecomma with optional spaces before and after it. For example, fortxt format:

    # tuple1.50, {name="John", age=40}

    1.75, {name="Mary", age=35}

    And for csv format:

    # tuple1.50, John, 401.75, Mary, 35

    defaultTuple

    This optional parameter can be specified to indicate the attributevalues to be used in case of missing values in the source data. It isonly valid for the csv format. It can take a single value of tupletype. This type must match the type of the output port tuples.

    parsing

    This optional parameter can be specified to customize the parsingbehavior of the FileSource operator. There are three valid values,namely: strict, permissive, and fast. When strict is specified,incorrectly formatted tuples will result in a runtime error andtermination of the operator. When permissive is specified,incorrectly formatted tuples will result in a runtime log entry to becreated, and the parser will make an effort to skip to the next tuple(formats txt and csv) and continue. If format is bin, the parser willclose the current file, and start reading the next file (if FileSourcehas an input stream). permissive can only be used with txt, csv,and bin formats. When fast is specified, the input file is assumed

    to be formatted correctly, and no runtime checks will beperformed. Incorrect input in fast mode causes undefinedbehaviour. The default parsing mode is strict.

    compression

    This optional parameter is used to specify that the source file iscompressed. There are three valid values, representing availablecompression algorithms. These are: zlib, gzip, and bzip2.

    encoding

    This optional rstring parameter can be used to specify thecharacter set encoding used in the input file. The contents of thefile will be converted to the UTF-8 character set from the givencharacter set after any decompression and before extraction of the

    tuples is performed. An example of a valid character set encodingis ISO_8859-9. A list of available encodings can be retrieved usingthe iconv --list command. encoding is not valid with formats binor block.

    hotFile

    This optional parameter of type boolean is used to specify if theinput file is hot. As opposed to regular files, hotfiles are not closedwhen the end of the file is reached for the first time. Instead thefile is continuously checked for more data. If the file size shrinksduring these checks, the file offset is reset to the beginning of the

    Chapter 2. Adapter Operators 17

    |

    |

    |

    |

    |

    |

    |

    |

    |

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    28/78

    file. The default value for the hotFile parameter is false. Whenset to true, a final marker is not sent upon reaching the end of thefile, as hot files ignore that event. Instead a final marker will besent upon shutdown, after a window marker punctuation is sent.Additionally, if the file offset is ever reset, a window markerpunctuation is sent. The hotFile parameter may not be specified ifthe FileSource operator has an input port, or if deleteFile or

    moveFileToDirectory are specified.

    deleteFile

    This optional parameter of boolean is used to specify that the fileshould be removed after processing of a file is finished. ThedeleteFile parameter cannot be specified if hotFile ormoveFileToDirectory is specified.

    moveFileToDirectory

    This parameter of type rstring is used to specify that the fileshould be moved to the directory after processing of a file isfinished. Any file in the moveFileToDirectory directory of the samename will be removed before the move is done. ThemoveFileToDirectory cannot be specified if hotFile or deleteFileis specified.

    A .rename subdirectory may be created in the target directory if thetarget directory is on a different filesystem. This is used to ensurethat the files appear atomically at the target directory.

    eolMarker

    This optional parameter is used to specify the end of line marker.It is of type rstring. It can only be used when the lineformat isspecified. It defaults to "\n". Valid values include strings with oneor two characters, such as"\r" and "\r\n".

    initDelay

    This optional float64 parameter is used to specify the number of

    seconds that the FileSource operator is to delay before starting toproduce tuples. If the FileSource operator has an input stream, thedelay will happen on receipt of the first tuple. During the delay,the operator is blocked, and any more input tuples will block aswell.

    blockSize

    This parameter is used to specify the block size. It is of typeuint32. It is mandatory when the block format is specified andcannot appear otherwise.

    separator

    This optional rstring parameter is used to specify an alternateseparator character for csv format. It must be a single characterstring constant. separator may only be specified if the format iscsv.

    ignoreExtraCSVValues

    This optional parameter of type boolean is only relevant withformat : csv. If true, extra data on the current input line after thelast attribute read will be skipped. If not present, or ifignoreExtraCSVValues has value false, extra data on a line in csvformat will cause an error to be logged (parsing : permissive) oran exception raised (parsing : strict).

    18 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

    |

    |

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    29/78

    WindowingThe FileSource operator does not accept any window configurations.

    AssignmentsThe FileSource operator does not allow assignments to output attributes.

    MetricsThe FileSource operator has the following metrics:

    v nFilesOpened: The number of files opened by FileSource. Onlyinteresting if the FileSource operator has an input port.

    v nInvalidTuples: The number of tuples that failed to read correctly in csvor txt format.

    ExceptionsThe FileSource operator will throw an exception and terminate in thefollowing cases:

    v The file input file cannot be opened for reading.

    v The moveFileToDirectory directory does not exist.

    v The moveFileToDirectory is not a directory.

    composite Main { //1

    graph //2// source operator with a relative file argument //3stream Beat = FileSource() //4{ //5

    param //6file : "People.dat"; // looks for /data/People.dat //7

    } //8// source operator with a default tuple for missing arguments //9stream Beat1 = FileSource() //10{ //11

    param //12file : "People.dat"; //13defaultTuple : {name="foo", age=19u, salary=10000ul}; //14

    } //15// source operator with an absolute file argument and hot file option //16stream Beat2 = FileSource() //17{ //18

    param //19

    file : "/tmp/People.dat"; //20hotFile : true; //21} //22// source operator with a csv format specifier, //23// hasDelayField option, and custom seperator //24stream Beat3 = FileSource() //25{ //26

    param //27file : "People.dat"; //28format : csv; //29separator : "|"; //30hasDelayField : true; //31

    } //32// source operator with a txt format specifier and compression //33stream Beat4 = FileSource() //34{ //35

    param //36file : "People.dat"; //37format : txt; //38compression : zlib; //39

    } //40// source operator with a csv format specifier and with strict parsing, waiting //41// 5 seonds before starting to process the file //42stream Beat5 = FileSource() //43{ //44

    param //45file : "People.dat"; //46format : csv; //47parsing : strict; //48initDelay : 5.0; //49

    } //50// source operator with a bin format specifier //51stream Beat6 = FileSource() //52{ //53

    Chapter 2. Adapter Operators 19

  • 8/22/2019 Pl Standard Toolkit Reference

    30/78

    param //54file : "People.dat"; //55format : bin; //56

    } //57// source operator with a line format specifier //58stream Beat7 = FileSource() //59{ //60

    param //61file : "People.dat"; //62format : line; //63

    } //64// source operator with a line format specifier, and an eolMarker option //65stream Beat8 = FileSource() //66{ //67

    param //68file : "People.dat"; //69format : line; //70eolMarker : "\r"; //71

    } //72// source operator with a block format specifier //73stream Beat9 = FileSource() //74{ //75

    param //76file : "People.dat"; //77format : block; //78blockSize : 1024u; //79

    } //80//81

    stream Files = DirectoryScan() { //82param directory: "foo"; //83} //84// source operator reading tuples of 2 int32s from files in directory foo //85// Delete the files after processing is done //86stream Beat10 = FileSource(Files) //87{ //88

    param deleteFile : true; //89} //90

    } //91

    The following example uses the second output stream, and shows how to get thestring form of the reason for failure:composite Main() { //1

    graph //2stream A = Beacon () { //3

    logic state : mutable int32 i = 0; //4param iterations : 4; //5output A : f = "file." + (rstring)i++; //6

    } //7//8

    (stream B; stream C) = FileSource (A) { //9param ignoreOpenErrors: true; //10

    } //11//12

    stream D = Functor (C) { //13output D : reason = strerror (e); //14

    } //15//16

    () as Nil = FileSink (D) { //17param file : "out"; //18

    } //19} //20

    FileSink

    DescriptionThe FileSink operator writes tuples to a file.

    Input PortsThe FileSink operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.

    Output PortsThe FileSink operator is configurable with an optional output stream of

    20 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    31/78

    type stream, which will have the file name that was just closed.If the file is moved, the destination filename will be generated as theoutput stream.

    ParametersThe FileSink operator has the following parameters:

    file This is a mandatory parameter that specifies the name of the

    output file. See the corresponding parameter in the FileSourceoperator for details. Only the last component of the pathname will

    be created if it does not already exist. All directories in thepathname up to the last component must already exist. Forexample, in file : "/a/b/c", /a and /a/b must already exist and bedirectories. The file is created as an empty file, discarding anyprevious contents. The user id and the umask of the instanceowner will be used. The tuples written to the file will be flushed todisk according to the flush and flushOnPunctuation parameters.

    append This optional boolean parameter is used to specify that thegenerated tuples will be appended to the output file. If false, ornot specified, the output file will be truncated before the tuples are

    generated.format See the corresponding parameter in the FileSource on page 15

    operator for details.

    hasDelayField

    This optional parameter of type boolean is used to output anadditional attribute per tuple, which specifies the inter-arrivaldelays between the input tuples. See the corresponding parameterin the FileSource on page 15 operator for details.

    compression

    See the corresponding parameter in the FileSource on page 15operator.

    encodingThis optional rstring parameter can be used to specify thecharacter set encoding used in the output file. Data written to theoutput file will be converted from the UTF-8 character set to thegiven character set before any compression is performed. encodingis not valid with formats bin or block.

    eolMarker

    See the corresponding parameter in the FileSource on page 15operator.

    flush This optional parameter of type uint32 is used to flush the outputfile after given number of tuples. By default no flushing on tuplenumbers is performed.

    Note: If an application expects low volumes of data, use the flushparameter to ensure that the output file is written to disk.

    flushOnPunctuation

    This optional parameter of type boolean is used to flush the outputfile when punctuation is received. flushOnPunctuation defaults totrue.

    writePunctuations

    This optional parameter of type boolean is used to write

    Chapter 2. Adapter Operators 21

  • 8/22/2019 Pl Standard Toolkit Reference

    32/78

    punctuations to the output file. It is false by default.writePunctuations can only be used with txt and csv formats.

    separator

    See the corresponding parameter in the FileSource on page 15operator.

    quoteStrings

    This optional parameter of type boolean is used to control thequoting of top-level rstrings. It is true by default. If true, rstringsin the tuple will be generated with a leading and trailing doublequote ("), and control characters will be escaped. If false, rstringsin the tuple will be written as is. quoteStrings can only be usedwith the csv format.

    closeMode

    This is an optional parameter of type enum {punct, count, size,time, never}. The default value is never. For any other value,when the specified condition is satisfied, the current output file isclosed and a new file is opened for writing. In such cases, the fileparameter must contain one or more {id} fields to indicate the parts

    that will be updated with the file id. For example, in the file name"myfile{id}.dat", each {id} will be replaced by 0 for the first file, 1for the next file that is opened and so on.

    tuplesPerFile

    This parameter is used to specify the maximum number of tuplesthat can be received for each output file. When the specifiednumber of tuples are received, the current output file is closed anda new file is opened for writing. This parameter is of type uint64or uint32 and must be specified if closeMode parameter is set tocount.

    timePerFile

    This parameter of type float64 is used to specify the approximate

    time, in seconds, after which the current output file is closed and anew file is opened. This parameter must be specified if thecloseMode parameter is set to time.

    bytesPerFile

    This parameter is used to specify the approximate size of theoutput file, in bytes. When the file size exceeds the specifiednumber of bytes, the current output file is closed and a new file isopened. This parameter is of type uint64 or uint32 and must bespecified when the closeMode parameter is set to size.

    moveFileToDirectory

    This optional parameter of type rstring is used to specify that thefile should be moved to the named directory after the file is closed.Any existing file with same name is removed before moving thefile to the moveFileToDirectory directory.

    A .rename subdirectory may be created in the target directory if thetarget directory is on a different filesystem. This is used to ensurethat the files appear atomically at the target directory.

    WindowingThe FileSink operator does not accept any window configurations.

    AssignmentsThe FileSink operator does not allow assignments to output attributes.

    22 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    33/78

    ExceptionsThe FileSink operator will throw an exception and terminate the operatorin the following case:

    v The file output file cannot be opened for writing.

    composite Main { //1graph //2

    stream Beat = Beacon() {} //3// sink operator with the hasDelayField option, and fields separated by ": //4// rstrings will not be printed with double quotes //5() as Sink1 = FileSink(Beat) //6{ //7

    param //8file : "/tmp/People.dat"; //9format : csv; //10separator : ":"; //11hasDelayField : true; //12quoteStrings: false; //13

    } //14// sink operator with a txt format specifier and compression //15() as Sink2 = FileSink(Beat) //16{ //17

    param //18file : "People.dat"; //19format : txt; //20compression : zlib; //21

    } //22

    // sink operator with a bin format specifier and flush option //23() as Sink3 = FileSink(Beat) //24{ //25

    param //26file : "People.dat"; //27format : bin; //28flush : 1u; //29

    } //30// sink operator with a writePunctuations option and no flushing on punctuation //31() as Sink4 = FileSink(Beat) //32{ //33

    param //34file : "People.dat"; //35writePunctuations : true; //36flushOnPunctuation: false; //37

    } //38} //39

    DirectoryScan

    DescriptionThe DirectoryScan operator watches a directory, and generates file nameson the output, one for each file that is found in the directory. The absolutepathname of the file is generated. The file name will only be generated thefirst time the file is seen during a directory scan until it is recreated. Thechange time (ctime) is used to detect if a file has been recreated. Outputclause and custom output functions can be used to specify additionalinformation about a file. All non-regular files found in the directory areignored during the scan.

    Note: Because the change time of the file is used to detect if a file has beenrecreated, it is possible that very large files are still being written when adirectory is being scanned. In this case, the same file name may begenerated multiple times, if the time between scans is less than the time towrite the file. In order to avoid this, the file should be written into adifferent directory on the same filesystem as the directory being scanned,and then renamed to the target directory when complete (/bin/mv will dothis if the files are on the same filesystem). If a regular expression patternis being used to match only certain files, creating the new files under aname that fails to match the pattern, and then renaming, will also work.

    Chapter 2. Adapter Operators 23

  • 8/22/2019 Pl Standard Toolkit Reference

    34/78

    Before submitting the file name to the output stream, the DirectoryScanoperator can optionally move processed files to a different directory usingthe moveToDirectory parameter. If the moveToDirectory parameter isspecified, the file (or symbolic link) is moved to the moveToDirectorydirectory before the output tuple is generated.

    When moveToDirectory is specified, it is valid to have multiple

    DirectoryScan operators reading the same directory. The DirectoryScanoperator ensures that each file is submitted by only one operator bycreating a temporary .rename subdirectory in the directory andmoveToDirectory directories.

    Input PortsThe DirectoryScan operator does not have any input ports.

    Output PortsThe DirectoryScan operator is configurable with a single output port. Theoutput port is non-mutating and its punctuation mode is Free. The outputschema for DirectoryScan operator is a tuple. The generated tuple ispopulated using the output clause. If there is no output clause, or anattribute in the tuple is not assigned using an output clause, then the

    attribute must be of type rstring.Parameters

    The DirectoryScan operator has the following parameters:

    directory

    This is a mandatory parameter that specifies the name of thedirectory to be scanned. It is of type rstring.

    moveToDirectory

    This optional parameter of type rstring specifies the name of thedirectory to which files should be moved before the output tuple isgenerated.

    pattern

    This optional parameter of type rstring is used to instruct theDirectoryScan operator to ignore file names that do not match theregular expression pattern.

    sortBy This optional parameter determines the order in which file namesare generated during a single scan of the directory when there aremultiple valid files at the same time. The valid values are date andname. If the sortBy parameter is not specified, the default sort orderis set to date.

    order This optional parameter controls how the sortBy parameter sortsthe files. The valid values are ascending and descending. If theorder parameter is not specified, the default value is set toascending.

    If sortBy is set to date, the file with the oldest change time (ctime)is generated first for ascending order. If sortBy is set to name, thefile name that is lexically smallest is generated first for ascendingorder.

    sleepTime

    This optional parameter of type float64 instructs theDirectoryScan operator of the minimal time between scans of thedirectory, in seconds. If not specified, the default is 5.0 seconds. Ifthe time difference between the start of the last scan and the

    24 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    35/78

    current time is less than sleepTime seconds, the DirectoryScanoperator will sleep until the time since the last scan is sleepTimeseconds. If more than sleepTime seconds have already passed, thenext scan will begin immediately.

    initDelay

    This optional float64 parameter is used to specify the number of

    seconds that the DirectoryScan operator is to delay before startingto produce tuples.

    ignoreDotFiles

    This optional boolean parameter determines if the DirectoryScanoperator ignores files with a leading period (.) in the directory. Bydefault, the value is set to false and files with a leading period areprocessed.

    ignoreExistingFilesAtStartup

    This optional boolean parameter determines if the DirectoryScanoperator ignores pre-existing files in the directory. By default, thevalue is set to false and all files are processed as usual. If set totrue, any files present in the directory are marked as already

    processed, and not submitted. If initDelay is specified, this checkis done before the DirectoryScan operator delays.

    AssignmentsThe DirectoryScan operator supports the following custom outputfunctions:

    v rstring FilePath(): The pathname to the file in the directory, relative tothe input directory parameter.

    v rstring FileName(): The last component of the pathname.

    v rstring FullPath(): The absolute pathname to the file in the directory.

    v rstring DestinationFullPath(): The absolute pathname to the file inthe destination directory.

    v

    rstring Directory(): The value of the directory parameter.v rstring DestinationDirectory(): The value of the moveToDirectory

    parameter, or the directory parameter if moveToDirectory is notspecified

    v rstring DestinationFilePath(): The pathname to the file in thedestination directory.

    v uint64 Size(): The size of the file in bytes.

    v uint64 Atime(): The access time (atime) of the file in seconds since theepoch.

    v uint64 Ctime(): The change time (ctime) of the file in seconds since theepoch.

    v uint64 Mtime(): The modification time (

    mtime) of the file in secondssince the epoch.

    Note: The atime, ctime, and mtime fields are set from the original file inthe source directory.

    MetricsThe DirectoryScan operator has the following metrics:

    v nScans: The number of times the DirectoryScan operator has read thedirectory.

    Chapter 2. Adapter Operators 25

    |

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    36/78

    ExceptionsThe DirectoryScan operator will throw an exception and terminate in thefollowing cases:

    v The directory or moveToDirectory does not exist.

    v The directory or moveToDirectory is not a directory.

    v The pattern is not a valid regular expression.

    v The .rename directories cannot be created when moveToDirectory isspecified.

    composite Main { //1graph //2

    // DirectoryScan operator with a relative directory argument //3stream Dir1 = DirectoryScan() //4{ //5

    param //6directory : "People.dir"; //7initDelay: 10.0; //8

    } //9// DirectoryScan operator with an absolute file argument and a file name pattern //10stream Dir2 = DirectoryScan() //11{ //12

    param //13directory : "/tmp/work"; //14pattern : "^work.*"; //15

    } //16// use a FileSource operator to process the file names //17stream Beat6 = FileSource(Dir2) //18{ //19

    param // note: param file is not specified //20format : line; //21deleteFile : true; // delete the file when processing is finished //22

    } //23// Use DirectoryScan operator to move files to a different directory. //24// Move the scanned files to the /tmp/active directory. Generate a tuple containing //25// the original filename in /tmp/work (sourceFile), and the moved filename //26// in /tmp/active (movedFile). //27// Generate the size of the file (fileSize). //28stream Dir3 = DirectoryScan() //29{ //30

    param //31directory : "/tmp/work"; //32moveToDirectory : "/tmp/active"; //33

    output Dir3 : sourceFile = FilePath(), movedFile = DestinationFilePath(), //34fileSize = Size(); //35} //36

    } //37

    TCPSource

    DescriptionThe TCPSource operator reads data from a TCP socket and creates tuplesout of it. It can be configured as a TCP server (listens for a clientconnection) or as a TCP client (initiates a connection to a server). In bothmodes it handles a single connection at a time. It works with both IPv4and IPv6 addresses.

    Input PortsThe TCPSource operator does not have any input ports.

    Output PortsThe TCPSource operator is configurable with a single output port. Theoutput port is mutating and its punctuation mode is Generating. TheTCPSource operator will output a window marker punctuation when a TCPconnection terminates.

    ParametersThe TCPSource operator has the following parameters:

    role This mandatory parameter specifies whether the TCPSource

    26 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    37/78

    operator is server-based or client-based. It takes one of thefollowing two values: server and client.

    address

    In the case of a client-based TCPSource operator, this parameterspecifies the destination server address of the TCP connection. Theaddress parameter must be specified when the role parameter is

    set to client and thename

    parameter is not specified. In all othercases, it cannot be specified. It takes a single value of type rstring.This value could be a host name or an IP address. address may not

    be used for a server-based TCPSource operator, as the address usedis always on the current host.

    port In the case of a server-based TCPSource operator, this parameterspecifies the port address on which the connections will beaccepted. In the case of a client-based TCPSource operator, itspecifies the destination server port address. It takes a single valueof type rstring or type uint32. This could be a well known portalias, such as http'' or ftp''1, as well as a plain port number, suchas 45134u. It is an optional parameter for server-based TCPSourceoperators and when omitted its default value is 0, which picks anyavailable port. For client-based TCPSource operators, the portparameter must be specified when the name parameter is notspecified, and it cannot be specified otherwise.

    name In the case of a server-based TCPSource operator, this parameterspecifies the name to be used to register the address and port pairfor the server with the name service that is part of the Streamsruntime. This name can be used by a corresponding client-basedTCPSink operator to connect to this operator by just specifying thename. These names are automatically prefixed by the applicationscope, thus applications with differing scopes cannot communicatethrough the same name. The application scope can be set throughthe use of config applicationScope on the main composite in the

    application. It is an error for a name with the same applicationscope to be defined multiple times with an instance. If multipleoperators attempt to define the same name, the second andsubsequent operators will keep trying periodically to register thename, with an error message for each failure. In the case of aclient-based TCPSource, this parameter specifies the name to beused to lookup the address and port pair for the destination serverfrom the name service that is part of the Streams runtime. It is anoptional parameter that takes a single value of type rstring.streamtool getnsentry command can be used to query server-basedTCPSource addresses. The Value field will contain host:port. Whenthe name parameter is specified in the client-mode, then the portand address parameters cannot be specified.

    parsing

    This optional parameter can be specified to customize the parsingbehavior of the TCPSource operator. There are three valid values,namely: strict, permissive, and fast. When strict is specified,incorrectly formatted tuples will result in a runtime error andtermination of the operator. When permissive is specified,incorrectly formatted tuples will result in a runtime log entry to be

    1. As specified under /etc/services

    Chapter 2. Adapter Operators 27

    |

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    38/78

    created, and the parser will make an effort to skip to the next tuple(formats txt and csv) and continue. If format is bin, the parser willclose the current connection, and start reading the next connection(if the reconnectionPolicy permits). permissive can only be usedwith txt, csv, and bin formats. When fast is specified, the inputfile is assumed to be formatted correctly, and no runtime checkswill be performed. Incorrect input in fast mode causes undefined

    behavior. The default parsing mode is strict.

    interface

    This optional rstring parameter specifies the network interface touse to register when the name parameter is specified. interface isonly valid when role is server and when name is specified. Usinginterface with name will ensure that a matching operator with arole of client and the same name parameter will use the desiredinterface.

    receiveBufferSize

    This is an optional parameter that is used to override the defaultkernel receive buffer size. It is of type uint32.

    reconnectionPolicyThis is an optional parameter that specifies the reconnection policy.In the case of a server-based TCPSource operator, this parameterspecifies if additional connections are allowed once the initialconnection terminates. In the case of a client-based TCPSourceoperator, this parameter specifies if additional connection attemptswill be made once the initial connection to the server terminates.The valid values are: NoRetry, InfiniteRetry, and BoundedRetry. Ifnot specified, it is set to InfiniteRetry. When set to NoRetry, theTCPSource operator produces a final marker punctuation rightaway, after the initial connection is terminated and a windowmarker punctuation is sent.

    reconnectionBound

    This parameter specifies the number of successive connections thatwill be attempted for a client-based TCPSource operator or acceptedfor a server-based TCPSource operator. It is an optional parameterof type uint32. It must appear when the reconnectionPolicyparameter is set to BoundedRetry and cannot appear otherwise.

    format See the corresponding parameter in the FileSource on page 15operator for details.

    defaultTuple

    See the corresponding parameter in the FileSource on page 15operator for details.

    hasDelayField

    See the corresponding parameter in the FileSource on page 15operator for details.

    compression

    See the corresponding parameter in the FileSource on page 15operator for details.

    encoding

    See the corresponding parameter in the FileSource on page 15operator for details.

    28 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

    |

    |

    |

    |

    |

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    39/78

    eolMarker

    See the corresponding parameter in the FileSource on page 15operator for details.

    blockSize

    See the corresponding parameter in the FileSource on page 15operator for details.

    initDelaySee the corresponding parameter in the FileSource on page 15operator for details.

    separator

    See the corresponding parameter in the FileSource on page 15operator for details.

    ignoreExtraCSVValues

    See the corresponding parameter in the FileSource on page 15operator for details.

    AssignmentsThe TCPSource operator does not allow assignments to output attributes.

    MetricsThe TCPSource operator has the following metrics:

    v nReconnections: The number of times the TCPSource operator lostconnection and reconnected to the other end of the TCP socket.

    v nInvalidTuples: The number of tuples that failed to read correctly in csvor txt format.

    v nConnections: The number of currently active TCP/IP connections. Thevalue is 0 if the TCPSource operator is waiting for a connection or areconnection, or 1 if the operator is currently connected.

    ExceptionsThe TCPSource operator will throw an exception and terminate the operator

    in the following cases:v The host cannot be resolved.

    v The name cannot be located.

    v Unable to set SO_REUSEADDR on TCP socket.

    v Unable to bind to port.

    composite Main { //1graph //2

    // server source with an alias string as port //3stream Beat = TCPSource() //4{ //5

    param //6role : server; //7port : "ftp"; //8

    } //9// server source with a number string as port //10stream Beat1 = TCPSource() //11{ //12

    param //13role : server; //14port : 23145u; //15

    } //16// server source with a name, registering interface eth1 //17stream Beat2 = TCPSource() //18{ //19

    param //20role : server; //21name : "my_server"; //22interface : "eth1"; //23

    } //24// server source with a name and port //25

    Chapter 2. Adapter Operators 29

    |

    |

    |

  • 8/22/2019 Pl Standard Toolkit Reference

    40/78

    stream Beat3 = TCPSource() //26{ //27

    param //28role : server; //29port : 23145u; //30name : "my_server"; //31

    } //32// server source with a port and infinite reconnection //33stream Beat4 = TCPSource() //34{ //35

    param //36role : server; //37port : "ftp"; //38reconnectionPolicy : InfiniteRetry; //39

    } //40// server source with a port and reconnection (5 times) //41stream Beat4r = TCPSource() //42{ //43

    param //44role : server; //45port : "ftp"; //46reconnectionPolicy : BoundedRetry; //47reconnectionBound : 5u; //48

    } //49// client source with an IP address and port //50stream Beat5 = TCPSource() //51{ //52

    param //53

    role : client; //54address : "99.2.45.67"; //55port : "ftp"; //56

    } //57// client source with an host name as the address //58

    stream Beat6 = TCPSource() //59{ //60

    param //61role : client; //62address : "mynode.mydomain"; //63port : 23145u; //64

    } //65// client source with name //66stream Beat7 = TCPSource() //67{ //68

    param //69role : client; //70

    name : "my_server"; //71

    } //72// client source with reconnection //73stream Beat8 = TCPSource() //74{ //75

    param //76role : client; //77address : "mynode.mydomain"; //78port : "ftp"; //79reconnectionPolicy : InfiniteRetry; //80

    } //81// client source with reconnection interval (and 10 connections) //82// Wait 5 seconds before starting //83stream Beat9= TCPSource() //84{ //85

    param //86role : client; //87address : "mynode.mydomain"; //88port : "ftp"; //89reconnectionPolicy : BoundedRetry; //90reconnectionBound : 10u; //91initDelay : 5.0; //92

    } //93} //94

    30 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Standard Toolkit Reference

  • 8/22/2019 Pl Standard Toolkit Reference

    41/78

    TCPSink

    DescriptionThe TCPSink operator writes data to a TCP socket in the form of tuples. Itcan be configured as a TCP server (listens for a client connection) or as aTCP client (initiates a connection to a server). In both modes it handles asingle connection at a time.

    Input PortsThe TCPSink operator is configurable with a single input port. The inputport is non-mutating and its punctuation mode is Oblivious.

    Output PortsThe TCPSink operator does not have any output ports.

    ParametersThe TCPSink operator has the following parameters:

    role See the corresponding parameter in the TCPSource on page 26operator.

    address

    See the corresponding parameter in the TCPSource on page 26operator.

    port See the corresponding parameter in the TCPSource on page 26operator.

    name In the case of a server-based TCPSink operator, this parameterspecifies the name to be used to register the address and port pairfor the server with the name service that is part of the Streamsruntime. This name can be used by a corresponding client-basedTCPSource operator to connect to this operator by just specifyingthe name, without the need for an address or port number. Thesenames are automatically prefixed by the application scope, thusapplications with differing scopes cannot communicate through the

    same name. The application scope can be set through the use ofconfig applicationScope on the main composite in theapplication. It is an error for a name with the same applicationscope to be defined multiple times with an instance. If multipleoperators attempt to define the same name, the second andsubsequent operators will keep trying periodically to register thename, with an error message for each failure. In the case of aclient-based TCPSink, this parameter specifies the name to be usedto lookup the address and port pair for the destination server fromthe name service that is part of the Streams runtime. It is anoptional parameter that takes a single value of type rstring. Whenthe name parameter is specified in the client-mode, then the portand address parameters cannot be specified.

    interface

    This optional rstring parameter specifies the network interface touse to register when the name parameter is specified. interface isonly valid when role is server and when name is specified. Usinginterface with name will ensure that a matching operator with arole of client and the same name parameter will use the desiredinterface.

    Chapter 2. Adapter Operators 31

  • 8/22/2019 Pl Standard Toolkit Reference

    42/78

    sendBufferSize

    This is an optional parameter that is used to override the defaultkernel send buffer size. It is of type uint32.

    reconnectionPolicy

    See the corresponding parameter in the TCPSource on page 26operator.

    reconnectionBoundSee the corresponding parameter in the TCPSource on page 26operator.

    format See the corresponding parameter in the FileSink on page 20operator.

    hasDelayField

    See the corresponding parameter in the FileSink on page 20operator.

    compression

    See the corresponding parameter in the FileSink on page 20operator.

    encoding

    See the corresponding parameter in the FileSink on page 20operator.

    eolMarker