apache flink - thedataqueen.files.wordpress.com · apache flink ana richters & ian safie. 4...
TRANSCRIPT
Overview
What is Flink?Open source platform for distributedstreaming and batch data processing
Origin◇ German for quick / nimble◇ 2011 Berlin Technical Univ.◇ Replace Hadoop MapReduce◇ 2014 Apache incubator program
Pros & Cons
Pros◇ FREE! & easy to install◇ One of the fastest (high throughput)◇ Streaming & batch modes◇ Real time analysis with fault tolerance◇ Low latency (allows human-undetectable delays in outputting)◇ Runs locally, cluster, or cloud◇ Compatible with popular platforms, like Hadoop & YARN
Cons◇ No data storage◇ Lots of competitors◇ Relatively new
Overview
Capital OneReal time monitoring of customer behavior
Other Companies◇ ResearchGate
■ network analysis & near duplicate detection◇ Okkam SRL
■ big data analysis & tax assessment◇ Bouygues Telecom
■ real time event processing & analytics◇ Leipzig University
■ distributed graph analytics & labs
Competitors
There are lots.
Here are some:◇ Hadoop MapReduce◇ Apache Tez◇ Apache Storm◇ Apache Spark◇ Apache Apex
Competitors
All are engines for processing big data
Vary based on:◇ Type of processing (streaming, batch)◇ Throughput (processing rate)◇ Latency (time for packet of data to go from one point to another)◇ Fault tolerance (ability to operate despite component failures)◇ Ease of setup◇ Compatibility / integration◇ “Nuts & bolts”
Competitors
Flink:◇ Very easy to set up
◇ High throughput (one of the fastest)
◇ Stream & batch processing
◇ Fault tolerance (via lightweight distributed snapshots)
◇ Low latency (good for trading, games, VOIP)
◇ Good documentation & actively maintained
◇ Can’t beat ‘em? Integrate with ‘em!■ Flink integrates with MapReduce, YARN, Storm & more
Conclusion
BOTTOM LINE:
One of the fastest tools for processing big data in real time,
but Lots of competition.
Review of Architecture
◇ Download & install (easy from terminal)■ UNIX-like environment (Linux, Mac OS X, etc)
◇ Runs on local machine, cluster, or cloud■ Can run on YARN clusters on top of data stored in Hadoop
◇ Start flink
◇ Write program to process data in Java or Scala API
◇ Specify input location■ Integrates with messaging queues like Kafka■ Data stays in memory once read
◇ Specify output location■ No data storage system, but integrates with HDFS
◇ Jobs executed in Flink’s own runtime engine
1. Download + Quickstart Instructions
2. Navigate to flink directory:cd flink-1.0.1
3. Start local session:bin/start-local.sh
4. Make sure it’s running
5. Run batch example:./bin/flink run ./examples/batch/WordCount.jar
6. Run streaming example:./bin/flink run ./examples/streaming/WordCount.jar --input file:///home/richters/hamlet.txt --output file:///home/richters/output.txt
7. Stop local session:./bin/stop-local.sh
Example