ibminfospherestreams-splintroductorytutorial

Upload: ramanavg

Post on 02-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    1/36

    IBM InfoSphere StreamsVersion 2.0.0.4

    IBM Streams Processing LanguageIntroductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    2/36

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    3/36

    IBM InfoSphere StreamsVersion 2.0.0.4

    IBM Streams Processing LanguageIntroductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    4/36

    NoteBefore using this information and the product it supports, read the general information under Notices on page 19.

    Edition Notice

    This document contains proprietary information of IBM. It is provided under a license agreement and is protectedby copyright law. The information contained in this publication does not include any product warranties, and anystatements provided in this manual should not be interpreted as such.

    You can order IBM publications online or through your local IBM representative.

    v To order publications online, go to the IBM Publications Center at www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss

    v To find your local IBM representative, go to the IBM Directory of Worldwide Contacts at www.ibm.com/planetwide

    When you send information to IBM, you grant IBM a nonexclusive right to use or distribute the information in anyway it believes appropriate without incurring any obligation to you.

    Copyright IBM Corporation 2011, 2012.US Government Users Restricted Rights Use, duplication or disclosure restricted by GSA ADP Schedule Contractwith IBM Corp.

    http://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wsshttp://www.ibm.com/planetwidehttp://www.ibm.com/planetwidehttp://www.ibm.com/e-business/linkweb/publications/servlet/pbi.wss
  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    5/36

    Summary of changes

    This topic describes updates to this documentation for IBM InfoSphere StreamsVersion 2.0 (all releases).

    Updates for Version 2.0.0.4 (Version 2.0, Fix Pack 4)

    This guide was not updated for Version 2.0.0.4.

    Updates for Version 2.0.0.3 (Version 2.0, Fix Pack 3)

    This guide was not updated for Version 2.0.0.3.

    Updates for Version 2.0.0.2 (Version 2.0, Fix Pack 2)

    This guide was not updated for Version 2.0.0.2.

    Updates for Version 2.0.0.1 (Version 2.0, Fix Pack 1)

    This guide was not updated for Version 2.0.0.1.

    Copyright IBM Corp. 2011, 2012 iii

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    6/36

    iv IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    7/36

    Abstract

    This document is an introductory tutorial to the IBM Streams Processing Language(SPL), the programming language for IBM InfoSphere Streams. If you are new to

    SPL, and want to learn it, this is a good document to read first.

    Copyright IBM Corp. 2011, 2012 v

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    8/36

    vi IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    9/36

    Contents

    Summary of changes. . . . . . . . . iii

    Abstract. . . . . . . . . . . . . . . v

    Chapter 1. Getting started . . . . . . . 1

    Chapter 2. Stream processing . . . . . 3

    Chapter 3. Types and functions . . . . 7

    Chapter 4. Composite operators . . . . 9

    Chapter 5. Primitive operators. . . . . 13

    Chapter 6. Next steps . . . . . . . . 17

    Notices . . . . . . . . . . . . . . 19

    Index . . . . . . . . . . . . . . . 23

    Copyright IBM Corp. 2011, 2012 vii

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    10/36

    viii IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    11/36

    Chapter 1. Getting started

    The best way to learn a new programming language is to write programs in it. Butto write programs, you need to know how to compile and run programs, because

    you will want to frequently test whether what you wrote does what you wanted.Therefore, we will start this tutorial with a very simple program, whose purpose isless to illustrate language features than to try out the compiler. Here is the code:

    composite HelloWorld { //1graph //2

    stream Hi = Beacon() { //3param iterations : 1u; //4output Hi : message = "Hello, world!"; //5

    } //6() as Sink = Custom(Hi) { //7

    logic onTuple Hi : printStringLn(message); //8} //9

    } //10

    Let's defer the discussion of how this code works, and instead focus on getting it

    compiled. You will get the most out of this tutorial if you try out things as you goalong. Therefore, the tutorial frequently has instructions for you like the following:make sure you are on a machine that has IBM InfoSphere Streams installed, andthus, has the SPL compiler, sc, available. For more information about installingInfoSphere Streams, see the IBM InfoSphere Streams: Installation and AdministrationGuide. Create a directory called HelloWorld on that machine. Create a file in thatdirectory called HelloWorld.spl, and enter the above program text in it, then saveit. Make sure you are in that directory, and run the compiler, by entering sc -T -MHelloWorld. The -T flag creates a standalone executable, that is, a program thatcan run as a single process on a single machine, without requiring a runningInfoSphere Streams instance. The -M HelloWorld command-line option specifiesthat the main composite is called HelloWorld. Each SPL program has one maincomposite operator. A composite operator is an operator that encapsulates a stream

    graph, and the stream graph of a main composite can be run as a program. If youran the compiler as recommended, and there were no compiler errors, then itcreated the executable file ./output/bin/standalone. Run the executable file. Itshould print Hello, world! to the console.

    Now we will discuss how the code works. Line 1 declares a composite operator:composite HelloWorld { ... }. Line 2 starts a graph clause, which means thatLines 3-9 describe a stream graph. The graph consists of two operator invocations.Line 3 is the head of the first operator invocation: stream Hi =Beacon() invokes operator Beacon to produce a stream Hi whose tuples have oneattribute rstring message. Line 7 is the head of the second operator invocation: ()as Sink = Custom(Hi) invokes operator Custom, which reads from stream Hi. The() as Sink part indicates that this operator invocation produces no stream ( ( ) ),and has the name Sink. Here is a visual representation of this stream graph:

    Hi SinkBeacon

    Figure 1. Stream graph of the HelloWorld program

    Copyright IBM Corp. 2011, 2012 1

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    12/36

    The operator invocations are shown as circles, and the stream is shown as anarrow. The operator invocations are decorated at the bottom right with littlescratch-paper icons that indicate internal state: both Beacon and Sink are stateful inthis program. The Beacon operator produces data. In this invocation, Line 4, paramiterations : 1u;, tells it to produce just one tuple; the u suffix on the number 1makes it an unsigned integer, since it would not make sense to have a negativenumber of iterations. In SPL, users or library writers define operators (like Beacon)

    and their parameters (like iterations) using a common framework; they are notbuilt into the language. Line 5, output Hi : message = "Hello, world!";, assignsthe string "Hello, world!" to attribute message of output stream Hi. Moving on tothe second operator invocation, the Custom operator provides a clean slate forcustom user logic. Line 8, logic onTuple Hi : printStringLn(message), specifiesthat upon arrival of a tuple on stream Hi, the program should print the stringattribute message from the tuple, followed by a newline character \n.

    At this point, you have compiled and run a first SPL program, and youunderstand what it does. This program only illustrates a tiny fraction of SPL, but

    before we move on to more interesting examples, we will take a look at thecompiled code. Besides the standalone executable, the compiler also generatedseveral other artifacts. Recall that we started out from just one directory

    HelloWorld and with just one file HelloWorld.spl. If you look at the directory aftercompiling, you will find something like the following:

    /+ HelloWorld/+ HelloWorld.spl # SPL source code/* toolkit.xml # toolkit index/* data # directory for data read/written by the program/* output # directory for artifacts generated by the compiler

    /* HelloWorld.adl # ADL (application description language) file/* bin # compiled binaries

    /* standalone # the standalone executable from earlier/* src # generated C++ source code

    /* operator # source code for operator invocations/* pe # source code for PEs (processing elements)/* standalone # source code for the standalone file/* type # source code for types

    In this listing, authored files (files written by hand) are annotated with /+ andgenerated files (files written automatically by the compiler) are annotated with /*.For now, we do not need to cover all the generated artifacts in detail, but you areencouraged to look at a few of them to get a feeling for what they look like.

    The purpose of this tutorial is to provide an introduction to SPL. To focus on theessentials, it intentionally omits details that you do not immediately need to know,

    but can look up at your leisure in the more complete and precise referencedocumentation. This section gave an example for using sc, the SPL compiler. Formore information about using the SPL compiler, see the IBM Streams ProcessingLanguage Compiler Usage Reference. This section also used the toolkit operatorsBeacon and Custom, and the toolkit function printString. For more informationabout library operators and functions, see the IBM Streams Processing Language

    Standard Toolkit Reference.

    2 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    13/36

    Chapter 2. Stream processing

    As the name implies, SPL is a language for processing data streams. But theHelloWorld example from the previous section hardly qualifies as streamprocessing, since there was only a single stream with a single tuple in thatprogram. This section introduces a more idiomatic example that processes streamsof a-priori unknown length, using a graph of operator invocations that havepipeline parallelism. The purpose of the program is to list a file, prefix each linewith a line number, and write the result to another file. It accomplishes this withthe following stream graph:

    A stream is a (possibly infinite) sequence of tuples; in the example, Lines andNumbered are streams. A tuple is a data item on a stream. In the example, the streamLines transports one tuple for each line in the input file. An operator is a reusablestream transformer: each operator invocation transforms some input streams intosome output streams. The place where a stream connects to an operator is called a

    port. Many operators have one input port and one output port (like Functor in theexample), but operators can also have zero input ports (FileSource), zero outputports (FileSink), or multiple input or output ports (which we will see in laterexamples).

    But back to the line-numbering program. We will call it NumberedCat as an homageto the Unix cat utility that, given the right command-line options, performs thesame task. Here is the code:

    composite NumberedCat { //1graph //2

    stream Lines = FileSource() { //3param format : line; //4

    file : getSubmissionTimeValue("file"); //5} //6stream Numbered = Functor(Lines) { //7

    logic state : mutable int32 i = 0; //8onTuple Lines : i++; //9

    output Numbered : contents = (rstring)i + " " + contents; //10} //11() as Sink = FileSink(Numbered) { //12

    param file : "result.txt"; //13

    format : line; //14} //15} //16

    Like in the previous example, there is a composite operator definition with a graphclause that contains operator invocations. The invocation of FileSource in Lines 3-6reads one line at a time (param format : line), from a file specified atsubmission-time (param file : getSubmissionTimeValue("file")). In a little bit,we will see how to supply the file name at submission time. The invocation ofFunctor in Lines 7-11 maintains a state variable mutable int32 i = 0 which itincrements each time a tuple arrives (onTuple Lines : i++). SPL variables are

    FunctorFileSource SinkLines Numbered

    Figure 2. Stream graph of the NumberedCat program.

    Copyright IBM Corp. 2011, 2012 3

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    14/36

    immutable by default, so without the mutable modifier, the compiler would haveprevented us from incrementing i++. The output clause output Numbered :contents = (rstring)i + " " + contents assigns the contents attribute of theoutput stream by casting the line number i to a string (rstring)i, andconcatenating it with the contents attribute of the input stream. As the exampleshows, an output clause has assignments where the left-hand side is an attribute ofthe output stream, whereas attribute names in the right-hand side belong to input

    streams. Finally, the invocation of FileSink on Lines 12-15 writes the results to afile named result.txt.

    You should try out the following. Create a directory called NumberedCat. Put theexample program in a file NumberedCat/NumberedCat.spl. Compile it to astand-alone executable with sc -T -M NumberedCat. Put the following text in a fileNumberedCat/data/catFood.txt:

    The Unix utility "cat" is so calledbecause it can con"cat"enate files.Our program behaves like "cat -n",listing one file and numbering lines.

    When we run the program, we need to supply the input file name as asubmission-time value. The FileSource operator expects a file name that is relativeto the NumberedCat/data directory. Therefore, we run the program with./output/bin/standalone file="catFood.txt". Look at the NumberedCat/data directory.If everything went fine, then the program created a file called result.txt thatcontains the numbered lines of catFood.txt.

    So far, we have run all our programs in stand-alone mode. That is common duringtesting and debugging. But a major strength of InfoSphere Streams is that it canrun programs on a cluster of workstations. To do this, we need to compile withoutthe -T,--standalone-application option, and then create an instance of the runtimeinto which we submit the job. Please try the following sequence of commands:

    sc -M NumberedCat # compilestreamtool mkinstance --template developer # make a runtime instancestreamtool startinstance # start the runtime instancestreamtool submitjob -P file=catFood.txt output/NumberedCat.adl # submit the jobstreamtool lsjobs # list running jobs# wait until data/result.txt contains the numbered lines of data/catFood.txtstreamtool canceljob 0 # cancel the jobstreamtool stopinstance # stop the runtime instancestreamtool rminstance # remove the runtime instance

    If everything went well, this accomplished the same result as running the programstand-alone. If anything went wrong, consult your system administrator, or try todiagnose the problem yourself by using the streamtool getlog/viewlog commands.As mentioned before, the best way to learn a language is to write and runprograms in it, so now is a good time to ensure that you have the right setup to dothat. Note how the streamtool submitjob command accepts submission-timevalues with the -P option, and uses the .adl file (application description language)to figure out which operators to submit.

    This section illustrated the flavor of SPL as a streaming language, and gave you ataste for how to run programs on an instance of the IBM InfoSphere Streamsdistributed runtime. We saw three new standard toolkit operators FileSource,Functor, and FileSink. For more information about standard toolkit operators, seethe IBM Streams Processing Language Standard Toolkit Reference. To learn more aboutworking with the distributed runtime, type streamtool man, which contains a

    4 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    15/36

    plethora of information about commands like submitjob and family. To learn moreabout SPL, see the IBM Streams Processing Language Specification.

    Chapter 2. Stream processing 5

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    16/36

    6 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    17/36

    Chapter 3. Types and functions

    One of the most important goals of programming languages is to enable reuse andimprove readability. Languages support this by allowing you to define your own

    types and functions. User-defined types and functions foster reuse, because theycan be defined once and used multiple times. User-defined types and functionsfoster readability, because they give a descriptive name to a concept and unclutterthe code that uses it. To illustrate this, we will develop a simple streamingapplication that counts the lines and words in a file. The WordCount programconsists of the following stream graph:

    The FileSource operator invocation reads a file, sending lines on the Data stream.The Functor operator invocation counts the lines and words for each individualline of Data, sending the statistics on the OneLine stream. Unlike the invocation ofthe Functor operator in Chapter 2, Stream processing, on page 3, this invocationof the Functor is stateless; it has no side-effects or dependencies between tuples.Finally, the Counter operator invocation aggregates the statistics for all lines in thefile, and prints them at the end. Before we look at the main composite operator,let's define some helpers. We will use a type LineStat for the statistics about a line;a function countWords(rstring line) to count the words in a line; and a functionaddM(mutable LineStat x, LineStat y) to add two LineStat values and store theresult in x. Here is the definition of these helpers:type LineStat = tuple; //1int32 countWords(rstring line) { //2

    return size(tokenize(line, " \t", false)); //3} //4void addM(mutable LineStat x, LineStat y) { //5

    x.lines += y.lines; //6x.words += y.words; //7

    } //8

    You can put this code in a file called WordCount/Helpers.spl. Line 1 defines typeLineStat to be a tuple with two attributes for counting lines and words. Lines 2-4define function countWords by using the standard toolkit function tokenize to splitthe line on spaces and tabs (" \t"), and then using the standard toolkit function

    size to count the resulting fragments. Lines 5-8 define function addM. Asmentioned previously, SPL variables are immutable by default, so we had toexplicitly declare parameter x as mutable to enable the function to add values to itsattributes. Having the mutable modifier in the signature of the function makes itclear to the user what kind of side-effects the function might have, and thecompiler can also use this information for optimization.

    Now we are ready to define the main composite operator. You can put thefollowing code in a file called WordCount/WordCount.spl.composite WordCount { //1

    graph //2stream Data = FileSource() { //3

    Functor OneLine CounterDataFileSource

    Figure 3. Stream graph of the WordCount program.

    Copyright IBM Corp. 2011, 2012 7

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    18/36

    param file : getSubmissionTimeValue("file"); //4format : line; //5

    } //6stream OneLine = Functor(Data) { //7

    output OneLine : lines = 1, words = countWords(line); //8} //9() as Counter = Custom(OneLine) { //10

    logic state : mutable LineStat sum = { lines = 0, words = 0 }; //11onTuple OneLine : addM(sum, OneLine); //12onPunct OneLine : if (currentPunct() == Sys.FinalMarker) //13

    println(sum); //14} //15} //16

    By this point in the tutorial, you should be able to read and understand much ofthis code. Note how type LineStat is used both in Line 7 as a schema for streamOneLine, and in Line 11 as a type for variable sum. Line 12 adds the statistics fromthe newest tuple in stream OneLine into the accumulator variable sum by using thehelper function addM defined before. Lines 13-14 illustrate punctuation-handling,which is a new feature that we have not seen before. A punctuation is a controlsignal that appears interleaved with the tuples on a stream. The logic onPunctOneLine clause gets triggered each time a punctuation arrives on stream OneLine. Ifthe punctuation is Sys.FinalMarker, that indicates that the end of the stream has

    been reached. In our example, theFileSource

    operator sends aFinalMarker

    at theend of the file, and the Functor operator forwards it after sending statistics for thelast line.

    Compile and run the program as a standalone application, as you learned in theprevious sections. You will need to provide an input file in the data directory, andprovide the file name as a submission-time value on the command-line of thestandalone application. The program should print the total statistics to the console.

    When you learn a new programming language and start writing programs in it,you are bound to encounter error messages. These can be baffling, because youthought your program was fine, yet the compiler objected to something in it.Therefore, a good exercise when learning a language is to make some intentional

    errors, and familiarize yourself with the error messages. That way, when you seethe same errors again "by accident", you will already be somewhat familiar withthem. So let's inject an error into the example program. Go to fileWordCount/Helpers.spl, and remove the mutable modifier from the signature offunction addM. In other words, Line 5 should read void addM(LineStat x, LineStaty). Recompile by doing sc -T -M WordCount. You should get something like thefollowing:

    Helpers.spl:6:11: CDISP0378E ERROR: The operand modified by += must be mutable.Helpers.spl:7:11: CDISP0378E ERROR: The operand modified by += must be mutable.

    The compiler complains because the += operator tries to modify the parameter x,but it has not been declared as mutable.

    In this section, you saw how to define your own types and functions, whichenables reuse and improves readability. For more information about defining yourown types and functions, see the IBM Streams Processing Language Specification.Types and functions form a sub-language that you can easily learn without anyother materials as prerequisites. To the contrary, they serve as the foundation formore advanced language features like the ones we will cover in the remainingsections of this tutorial.

    8 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    19/36

    Chapter 4. Composite operators

    Just like user-defined types and functions, user-defined composite operators helpcode reuse and readability. A composite operator encapsulates a stream sub-graph,

    which can then be used in different contexts. Each of the examples so far had amain composite operator, which encapsulates the stream graph that forms thewhole program. Main composite operators are self-contained in the sense that theirstream graph has no output or input ports, and they have no mandatoryparameters. In this section, we will instead look at a composite operator that has

    both ports and parameters. The operator reads a stream from input port In, andremoves duplicate consecutive lines, then writes the result to a stream to outputport Out. We call the operator Uniq as an homage to the uniq utility that performsthe same task. Internally, the Uniq operator uses a Custom operator to implement itsfunctionality. Here is a diagram of the stream graph:

    To make things more interesting, the Uniq operator has a parameter type $key,which is the type containing the subset of attributes of the input tuple that areused to determine uniqueness. If two consecutive tuples are identical for theseattributes, the second one is dropped even if it differs in some other attributes. The

    following code implements operator Uniq:namespace my.util; //1public composite Uniq(output Out; input In) { //2

    param //3type $key; //4

    graph //5stream Out = Custom(In) { //6

    logic state : { //7mutable boolean first = true; //8mutable $key prev; //9

    } //10onTuple In : { //11

    $key curr = ($key)In; //12if (first || prev != curr) { //13

    submit(In, Out); //14first = false; //15prev = curr; //16

    } //17} //18} //19

    } //20

    Line 1, namespace my.util, specifies a namespace for the operator. That means thatthe operator's full name is really my.util::Uniq. You should put the above sourcecode in a file Uniq/my.util/Uniq.spl. Line 2, public composite Uniq(output Out;input In), specifies that the operator is public, meaning it can be used from othernamespaces; and that it has one output port Out and one input port In. Lines 3and 4 declare the mandatory formal parameter $key, which is a type. Line 12, $keycurr = ($key)In;, declares a local variable curr of type $key, and initializes it with

    In Out

    Uniq

    Custom

    Figure 4. Stream graph of the body of the Uniq operator.

    Copyright IBM Corp. 2011, 2012 9

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    20/36

    the expression ($key)In, which takes the current tuple from input stream In andcasts it to type $key, in other words, drops any attributes that are not relevant forthe comparison with the previous tuple. We have to consider one special case: forthe very first tuple, there is no previous tuple, so we always treat it as unique.

    Now that we have defined our own operator my.util::Uniq, we need to test it. Todo that, we will generate a stream All of tuples that have some duplicates, and

    send them through the Uniq operator to get the stream Some of unique tuples. Wewill print both All and Some so we can inspect whether the operator actuallyworked as expected. The stream graph for the test driver is:

    Note that as far as the driver is concerned, Uniq is just an ordinary operator, whoseinvocation can serve as a vertex in a stream graph just like any of the otheroperators we have used before. Note also that a single stream from a single outputport, like All in the example, can be used as the input to multiple operators; in thiscase, all tuples are duplicated, once for each recipient. The following code

    implements the test driver:use my.util::Uniq; //1composite Main { //2

    type //3KeyType = tuple; //4

    graph //5stream All = Beacon() { //6

    logic state : mutable int32 n = 0; //7param iterations : 10u; //8output All : i = ++n, j = n / 3; //9

    } //10stream Some = Uniq(All) { //11

    param key : KeyType; //12} //13() as PrintAll = Custom(All) { //14

    logic onTuple All : printString("All" + (rstring)All + "\n"); //15} //16() as PrintSome = Custom(Some) { //17

    logic onTuple Some : printString("Some" + (rstring)Some + "\n"); //18} //19} //20

    Note how Lines 11-13 invoke our operator Uniq, passing an actual parameter paramkey : KeyType, which indicates that only attribute j is to be used in the uniquenesstest. Put this code into a file Uniq/Main.spl, and run sc -T -M Main to compile itas a stand-alone application. Now run ./output/bin/standalone. You should see thefollowing output:

    All {i=1,j=0}Some {i=1,j=0}All {i=2,j=0}

    PrintSome

    All

    Uniq

    PrintAll

    Beacon

    Some

    Figure 5. Stream graph of the test driver for the Uniq operator.

    10 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    21/36

    All {i=3,j=1}Some {i=3,j=1}All {i=4,j=1}All {i=5,j=1}All {i=6,j=2}Some {i=6,j=2}All {i=7,j=2}All {i=8,j=2}All {i=9,j=3}Some {i=9,j=3}

    All {i=10,j=3}

    If you look just at All lines, you see that the i attribute just counts up iterationsfrom 1 to 10, while the j attribute is always i/3 rounded down to the nearestinteger. Since we used type tuple as the uniqueness key, only every thirdtuple is considered unique, and therefore, Some lines show only every third tuple.

    In this section, you have seen how to define your own composite operators toencapsulate useful reusable functionality. You have also seen how the Beaconoperator from the standard toolkit can serve as a useful workload generator fortesting. We recommend that you test your own operators with test drivers like theone shown in this example. Besides helping you to iron out bugs duringdevelopment, drivers like these are also useful to keep around later for regression

    testing.

    SPL composite operators are more powerful than the example in this sectionillustrates. They can encapsulate not just a single operator, but a whole graph; theycan have multiple output and input ports; and they can have more parameters, ofdifferent kinds besides types. For more information about composite operators, seethe IBM Streams Processing Language Specification.

    Chapter 4. Composite operators 11

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    22/36

    12 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    23/36

    Chapter 5. Primitive operators

    Recall that an operator is a reusable stream transformer, and a composite operatorencapsulates a stream graph. If all operators were composite, we would have a

    chicken-and-egg problem; therefore, SPL also has primitive operators, whichencapsulate code in a native language. This is usually a more traditional,von-Neumann language such as Java or C++. In this section, we will develop aprimitive operator RoundRobinSplit in C++, but IBM InfoSphere Streams alsoenables you to write primitive operators in Java. If you are not a C++ programmer,or if you anticipate that you will mostly use operators from the standard toolkit orother toolkits, you can skip this section. We will start this presentation from thefamiliar, by giving an example of invoking RoundRobinSplit from SPL code. Wewill implement the following stream graph:

    Graphs like these are called split-joins, and are a common cause ofnon-determinism in streaming applications, because data may be processed atdifferent speeds along the different paths. However, some applications requiredeterministic behavior, which is also useful for testing purposes. Our newRoundRobinSplit operator, together with the Pair operator from the standardlibrary, provides a simple way to implement a deterministic split-join withoutgiving up much of the performance advantage afforded by the parallelism in themiddle portion of the stream graph. Specifically, RoundRobinSplit deterministically

    alternates between sending data to each of its output ports, and Pairdeterministically alternates between receiving data from each of its input ports.Here is the code for this stream graph:

    use my.util::RoundRobinSplit; //1composite Main { //2

    graph //3stream Input = Beacon() { //4

    logic state : mutable int32 n = 0; //5param iterations : 10u; //6output Input : count = n++; //7

    } //8(stream A0; stream A1) = RoundRobinSplit(Input) { //9

    param batch : 2u; //10} //11

    OutputPair

    B1

    B0

    WriterInput RRSplit

    Functor

    A1

    A0

    Functor

    Beacon

    Figure 6. Stream graph of the test driver for the RoundRobinSplit operator.

    Copyright IBM Corp. 2011, 2012 13

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    24/36

    stream B0 = Functor(A0) { //12output B0 : path = 0; //13

    } //14stream B1 = Functor(A1) { //15

    output B1 : path = 1; //16} //17stream Output = Pair(B0; B1) {} //18() as Writer = FileSink(Output) { //19

    param file : "/dev/stdout"; //20flush : 1u; //21

    } //22} //23

    Line 9, (stream A0; stream A1) =RoundRobinSplit(Input), invokes operator RoundRobinSplit to produce two outputstreams A0 and A1. The operator takes a parameter param batch : 2u that indicatesthat it alternates after every two tuples. Line 18 invokes operator Pair on twoinput streams B0 and B1, with the code stream Output= Pair(B0; B1). For now, put this code into a file RoundRobinSplit/Main.spl.However, don't try to compile it yet; we need to implement the operatorRoundRobinSplit first.

    Create a directory RoundRobinSplit/my.util/RoundRobinSplit, and change into that

    directory. Now, run spl-make-operator --kind c++. That will generate severalskeleton files for you, including an operator model RoundRobinSplit.xml and twocode generation templates (.cgt files), one for a header file RoundRobinSplit_h.cgtand one for a C++ implementation file RoundRobinSplit_cpp.cgt. When you writemore sophisticated primitive operators, you will often need to edit the XMLoperator model, but in this case, the operator is simple enough so you do not needto change the operator model at all. Open the header file code generation templateRoundRobinSplit_h.cgt. You will see a class definition with several methoddeclarations. Remove most methods except for the constructor and process(Tuple& tuple, uint32_t port). Add two instance fields Mutex _mutex and uint32_t_count. You should end up with the following code in RoundRobinSplit_h.cgt:

    #pragma SPL_NON_GENERIC_OPERATOR_HEADER_PROLOGUE //1class MY_OPERATOR : public MY_BASE_OPERATOR { //2

    public: //3MY_OPERATOR(); //4void process(Tuple & tuple, uint32_t port); //5

    private: //6Mutex _mutex; //7uint32_t _count; //8

    }; //9#pragma SPL_NON_GENERIC_OPERATOR_HEADER_EPILOGUE //10

    Next, open the C++ implementation file code generation templateRoundRobinSplit_cpp.cgt. Remove most methods except for the constructor andprocess(Tuple & tuple, uint32_t port). Implement these methods as shown inthe following listing of RoundRobinSplit_cpp.cgt:

    #pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_PROLOGUE //1MY_OPERATOR::MY_OPERATOR() : _count(0) {} //2void MY_OPERATOR::process(Tuple & tuple, uint32_t port) { //3

    uint32_t const nOutputs = getNumberOfOutputPorts(); //4uint32_t const batchSize = getParameter("batch"); //5AutoPortMutex apm(_mutex, *this); //6uint32 outputPort = (_count / batchSize) % nOutputs; //7_count = (_count + 1) % (batchSize * nOutputs); //8assert(outputPort < nOutputs); //9submit(tuple, outputPort); //10

    } //11#pragma SPL_NON_GENERIC_OPERATOR_IMPLEMENTATION_EPILOGUE //12

    The constructor just initializes the _count instance variable to zero. The processmethod queries the runtime APIs for the number of output ports (Line 4) and the

    batch size parameter (Line 5); acquires the mutex to guard against concurrent

    14 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    25/36

    manipulation of the _count instance variable (Line 6); determines the output port(Line 7), updates _count (Line 8), and submits the input tuple to the appropriateoutput port (Line 10). The mutex is necessary because without it, if there are twothreads T1 and T2 , then T1 's invocation of process might be interrupted in themiddle of Line 8, after reading the old value of _count but before writing the newvalue; then T2 might call process and update _count; and finally, T1 might resumeand overwrite T2 's update to _count.

    Now, we are finally ready to compile the application. Change to theRoundRobinSplit directory and run the SPL compiler with sc -T -M Main. The SPLcompiler will invoke the C++ compiler to compile the instance of theRoundRobinSplit operator: the sources are the files A0.cpp and A0.h in directoryRoundRobinSplit/output/src/operator, and the object file is RoundRobinSplit/output/build/operator/A0.o. Run the application by changing to directoryRoundRobinSplit and executing ./output/bin/standalone. You will get the followingoutput:

    0,02,11,03,14,0

    6,15,07,1

    Each line shows the count and path attributes separated by a comma. Since thesplit uses a batch size of two but the join uses a batch size of one, the counts (leftcolumn) have a progression of 0,2,1,3,4,6,5,7 whereas the paths (right column) justalternate between 0,1,0,1,0,1,0,1. This output is deterministically repeatable,independent of the processing speed of the two paths.

    It is instructional to introduce an error in the C++ code to see what happens. If wechange the call on Line 10 of RoundRobinSplit_cpp.cgt to submit(outputPort,tuple), the C++ compiler reports an error message with the correct file name andline number:

    my.util/RoundRobinSplit/RoundRobinSplit_cpp.cgt:10: error:no matching function for call to 'SPL::_Operator::A0::submit(SPL::uint32&, SPL::Tuple&)

    note: candidates are: virtual void SPL::Operator::submit(SPL::Tuple&, uint32_t)note: virtual void SPL::Operator::submit(const SPL::Tuple&, uint32_t)note: void SPL::Operator::submit(const SPL::Punctuation&, uint32_t)

    This section barely scratched the surface of developing primitive operators in SPL.There is a rich API for generating specialized code for performance, and forcompile-time error checking on things like the number and types of ports. Formore information about developing primitive operators, see the IBM StreamsProcessing Language Toolkit Development Reference to learn more. You may also wantto take a look at the IBM Streams Processing Language Operator Model Reference tolearn about the XML file for the primitive operator.

    Chapter 5. Primitive operators 15

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    26/36

    16 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    27/36

    Chapter 6. Next steps

    In this tutorial, you have learned about a lot of topics superficially. Along the way,you saw several links to more detailed documentation. A good way to continue

    learning the language is to exercise by writing programs of your own. Table 1 listsseveral suggestions for topics you may want to study further, and exercises youmay want to do to cement your SPL skills.

    Table 1. Topics and exercises for further study

    Topic Documentation Exercise

    Expression language IBM Streams ProcessingLanguage Specification.

    Write a program to reverse thelines of a small file.

    Type system IBM Streams ProcessingLanguage Specification.

    Create a histogram of the lengthof lines in a file.

    Other operators/functions in thestandard toolkit

    IBM Streams ProcessingLanguage Standard ToolkitReference. Documentation for

    SPL standard toolkit types andfunctions is located in the$STREAMS_INSTALL/doc/spl/standard-toolkit/builtin-functions-and-types directory.

    Merge two sorted streams suchthat the output is also sorted.

    Windows IBM Streams ProcessingLanguage Specification.

    Sort a file five lines at a time.

    Configs IBM Streams ProcessingLanguage Config Reference andIBM Streams ProcessingLanguage Specification.

    Change the logLevelconfig andlook at the log files to see whathappens.

    C++ primitive operators IBM Streams ProcessingLanguage Toolkit DevelopmentReference and IBM Streams

    Processing Language OperatorModel Reference.

    Write a C++ primitive operatorthat extracts groups matched bysubexpressions of a regexp.

    Java primitive operators IBM Streams ProcessingLanguage Toolkit DevelopmentReference and IBM StreamsProcessing Language OperatorModel Reference.

    Write a Java primitive operatorthat extracts groups matched bysubexpressions of a regexp.

    Writing native functions IBM Streams ProcessingLanguage Toolkit DevelopmentReference.

    Turn a map into a list of(key,value) tuples.

    Dynamic applicationcomposition

    IBM Streams ProcessingLanguage Specification.

    Run the SchemaSharing samplethat ships with SPL.

    Streams debugger IBM Streams ProcessingLanguage Streams Debugger

    Reference.

    Run the NumberedCat programand interactively drop a tuple.

    Copyright IBM Corp. 2011, 2012 17

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    28/36

    18 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    29/36

    Notices

    This information was developed for products and services offered in the U.S.A.Information about non-IBM products is based on information available at the time

    of first publication of this document and is subject to change.

    IBM may not offer the products, services, or features discussed in this document inother countries. Consult your local IBM representative for information on theproducts and services currently available in your area. Any reference to an IBMproduct, program, or service is not intended to state or imply that only that IBMproduct, program, or service may be used. Any functionally equivalent product,program, or service that does not infringe any IBM intellectual property right may

    be used instead. However, it is the user's responsibility to evaluate and verify theoperation of any non-IBM product, program, or service.

    IBM may have patents or pending patent applications covering subject matterdescribed in this document. The furnishing of this document does not grant you

    any license to these patents. You can send license inquiries, in writing, to:

    IBM Director of LicensingIBM CorporationNorth Castle DriveArmonk, NY 10504-1785U.S.A.

    For license inquiries regarding double-byte character set (DBCS) information,contact the IBM Intellectual Property Department in your country or sendinquiries, in writing, to:

    Intellectual Property LicensingLegal and Intellectual Property LawIBM Japan Ltd.1623-14, Shimotsuruma,Yamato-shiKanagawa 242-8502 Japan

    The following paragraph does not apply to the United Kingdom or any othercountry/region where such provisions are inconsistent with local law:INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THISPUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHEREXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIEDWARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS

    FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express orimplied warranties in certain transactions; therefore, this statement may not applyto you.

    This information could include technical inaccuracies or typographical errors.Changes are periodically made to the information herein; these changes will beincorporated in new editions of the publication. IBM may make improvementsand/or changes in the product(s) and/or the program(s) described in thispublication at any time without notice.

    Copyright IBM Corp. 2011, 2012 19

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    30/36

    Any references in this information to non-IBM Web sites are provided forconvenience only and do not in any manner serve as an endorsement of those Websites. The materials at those Web sites are not part of the materials for this IBMproduct and use of those Web sites is at your own risk.

    IBM may use or distribute any of the information you supply in any way itbelieves appropriate without incurring any obligation to you.

    Licensees of this program who wish to have information about it for the purposeof enabling: (i) the exchange of information between independently createdprograms and other programs (including this one) and (ii) the mutual use of theinformation that has been exchanged, should contact:

    IBM Canada LimitedOffice of the Lab Director8200 Warden AvenueMarkham, OntarioL6G 1C7CANADA

    Such information may be available, subject to appropriate terms and conditions,including, in some cases, payment of a fee.

    The licensed program described in this document and all licensed materialavailable for it are provided by IBM under terms of the IBM Customer Agreement,IBM International Program License Agreement, or any equivalent agreement

    between us.

    Any performance data contained herein was determined in a controlledenvironment. Therefore, the results obtained in other operating environments mayvary significantly. Some measurements may have been made on development-levelsystems, and there is no guarantee that these measurements will be the same on

    generally available systems. Furthermore, some measurements may have beenestimated through extrapolation. Actual results may vary. Users of this documentshould verify the applicable data for their specific environment.

    Information concerning non-IBM products was obtained from the suppliers ofthose products, their published announcements, or other publicly available sources.IBM has not tested those products and cannot confirm the accuracy ofperformance, compatibility, or any other claims related to non-IBM products.Questions on the capabilities of non-IBM products should be addressed to thesuppliers of those products.

    All statements regarding IBM's future direction or intent are subject to change orwithdrawal without notice, and represent goals and objectives only.

    This information may contain examples of data and reports used in daily businessoperations. To illustrate them as completely as possible, the examples include thenames of individuals, companies, brands, and products. All of these names arefictitious, and any similarity to the names and addresses used by an actual

    business enterprise is entirely coincidental.

    COPYRIGHT LICENSE:

    This information contains sample application programs, in source language, whichillustrate programming techniques on various operating platforms. You may copy,

    20 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    31/36

    modify, and distribute these sample programs in any form without payment toIBM for the purposes of developing, using, marketing, or distributing applicationprograms conforming to the application programming interface for the operatingplatform for which the sample programs are written. These examples have not

    been thoroughly tested under all conditions. IBM, therefore, cannot guarantee orimply reliability, serviceability, or function of these programs. The sampleprograms are provided AS IS, without warranty of any kind. IBM shall not be

    liable for any damages arising out of your use of the sample programs.

    Each copy or any portion of these sample programs or any derivative work mustinclude a copyright notice as follows:

    (your company name) (year). Portions of this code are derived from IBM Corp.Sample Programs. Copyright IBM Corp. _enter the year or years_. All rightsreserved.

    Trademarks

    IBM, the IBM logo, ibm.com and InfoSphere are trademarks or registeredtrademarks of International Business Machines Corp., registered in many

    jurisdictions worldwide. A current list of IBM trademarks is available on the Webat Copyright and trademark information at www.ibm.com/legal/copytrade.shtml.

    The following terms are trademarks or registered trademarks of other companies

    v Linux is a registered trademark of Linus Torvalds in the United States, othercountries, or both.

    v Java and all Java-based trademarks and logos are trademarks of SunMicrosystems, Inc. in the United States, other countries, or both.

    v UNIX is a registered trademark of The Open Group in the United States andother countries.

    v

    Microsoft, Windows, Windows NT, and the Windows logo are trademarks ofMicrosoft Corporation in the United States, other countries, or both.

    Other product and service names might be trademarks of IBM or other companies.

    Notices 21

    http://www.ibm.com/legal/copytrade.htmlhttp://www.ibm.com/legal/copytrade.html
  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    32/36

    22 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    33/36

    Index

    Ccommands

    streamtool canceljob 3streamtool lsjobs 3streamtool man 3streamtool mkinstance 3streamtool rminstance 3streamtool startinstance 3streamtool stopinstance 3streamtool submitjob 3

    compiling HelloWorld program 1composite operator

    overview 9

    Ddata stream processing 3

    Eexample

    HelloWorld example 1

    FFileSink operator 3FileSource operator 3functions 7Functor operator 3

    Ggetting started

    compiling program 1HelloWorld example 1output files 1sc compiler 1writing program 1

    HHelloWorld example

    compiling program 1writing program 1

    Mmutable modifier 7

    Nnamespace 9

    Ooperator

    Uniq 9

    operatorsFileSource 7Functor 7

    output filesgetting started 1

    Pport 3primitive operators

    RoundRobinSplit 13

    RRoundRobinSplit

    primitive operators 13running HelloWorld program 1

    SSPL exercises 17stand-alone application

    compilation command 9streams

    overview 3streamtool canceljob command 3streamtool commands

    See commandsstreamtool lsjobs command 3streamtool man 3streamtool mkinstance command 3streamtool rminstance command 3streamtool startinstance command 3streamtool stopinstance command 3streamtool submitjob command 3

    Ttoolkit operators

    FileSink 3FileSource 3Functor 3

    tuples 3tutorial

    composite operators 1HelloWorld example 1

    sc compiler 1streams 1tuples 1

    UUniq operator 9user defined composite operators 9user defined types

    mutable modifier 7

    WWordCount program 7

    Copyright IBM Corp. 2011, 2012 23

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    34/36

    24 IBM InfoSphere Streams Version 2.0.0.4: IBM Streams Processing Language Introductory Tutorial

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    35/36

  • 7/27/2019 IBMInfoSphereStreams-SPLIntroductoryTutorial

    36/36

    Printed in USA