java high level stream api
TRANSCRIPT
Stream API For Apex
June 2016
Apex Overview
Apex Overview
• YARN is the resource manager
• HDFS used for storing any persistent state
Current Development ModelDirected Acyclic Graph (DAG)
Filtered
Stream
Output StreamTuple Tuple
Filtered Stream
Enriched Stream
Enriched
Stream
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
er
Operator
● Stream is a sequence of data tuples● Typical Operator takes one or more input streams, performs computations & emits one or more
output streams● Each operator is your custom business logic in java, or built-in operator from our open source library● Operator has many instances that run in parallel and each instance is single-threaded● Directed Acyclic Graph (DAG) is made up of operators and streams
Current Application Example@ApplicationAnnotation(name="WordCountDemo")
public class Application implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration conf)
{
WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator());
UniqueCounter<String> wordCount = dag.addOperator("count", new UniqueCounter<String>());
ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator());
dag.addStream("wordinput-count", input.outputPort, wordCount.data);
dag.addStream("count-console",wordCount.count, consoleOperator.input);
}
}
o Easier for beginners to start witho Fluent APIo Smaller learning curveo Transform methods in one place vs operator libraryo Operator API provides flexibility while high-level API provides
ease of use
Why we need high-level API
Stream API
map(..)filter(..)…addOperator(...)with(prop, val)…window(Opt...)
ApexStream<T> group(..)
groupByKey(...)reduce(..)fold(..)join(..)count(..)…window(Opt...)
WindowedStream<T>
<<interface>> <<interface>>
Stream API (Application Example)@ApplicationAnnotation(name = "WordCountStreamingApiDemo")
public class ApplicationWithStreamAPI implements StreamingApplication
{
@Override
public void populateDAG(DAG dag, Configuration configuration)
{
String localFolder = "./src/test/resources/data";
ApexStream<String> stream = StreamFactory
.fromFolder(localFolder)
.flatMap(new Split())
.window(new WindowOption.GlobalWindow(), new
TriggerOption().withEarlyFiringsAtEvery(Duration.millis(1000)).accumulatingFiredPanes())
.countByKey(new ConvertToKeyVal()).print();
stream.populateDag(dag);
}
}
How it works
o ApexStream<T> literally means bounded/unbounded data set of type T
o ApexStream<T> also holds a graph data struture of all operator and connections between operators from input to current point
o Each transform method attach one or more operators to current graph data structure and return a new Apex Stream object
o The graph data structure won’t be translated to Apex DAG until populateDag or run method are called
How it works (Con’t)
○ Method chain for readability○ Stateless transform(map, flatmap, filter)○ Some input and output are available (file, console, Kafka)○ Some interoperability (addOperator, getDag, set property/attributes etc)○ Local mode and distributed mode○ Annonymous function class support○ Extensible
Current Status
○ WindowedStream is in pull request along with Operators that support it○ A few window transforms (count, reduce, etc)○ 3 Window types (fix window, sliding window, session window)○ 3 Trigger types (early trigger, late trigger, at watermark)○ 3 Accumulation modes(accumulate, discard, accumulation_retraction)○ In memory window state (checkpointed)
Current Status (Con’t)
Roadmap○ Persistent window state for windowed operators (large state)○ Fully follow Beam model (window, trigger, watermark)○ Rich selection of windowed transform (group, combine, join)○ Support custom window assignor○ Support custom trigger○ More input/output (hbase, cassendra, jdbc, etc)○ Better schema support○ More language support (java 8, scala, etc...)○ What the community asks for
Resources○ Apache Apex website - http://apex.apache.org/○ Subscribe - http://apex.apache.org/community.html○ Download - http://apex.apache.org/downloads.html○ Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex○ Facebook - https://www.facebook.com/ApacheApex/○ Meetup - http://www.meetup.com/topics/apache-apex○ SlideShare -
http://www.slideshare.net/ApacheApex/presentations○ More Examples - https://github.com/DataTorrent/examples○ Pull request
https://github.com/apache/apex-malhar/pull/319 https://github.com/apache/apex-malhar/pull/327
Demo & Code Example
○ Word Count○ AutoComplete