apache flink - thedataqueen.files.wordpress.com · apache flink ana richters & ian safie. 4...

21
Apache Flink Ana Richters & Ian Safie

Upload: others

Post on 28-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Apache FlinkAna Richters & Ian Safie

4Case Study

2Architecture

Agenda

1Overview

3Pros & Cons

5Competitors

Overview

Overview

What is Flink?Open source platform for distributedstreaming and batch data processing

Origin◇ German for quick / nimble◇ 2011 Berlin Technical Univ.◇ Replace Hadoop MapReduce◇ 2014 Apache incubator program

Architecture

Architecture

Big Data Ecosystem

Architecture

Flink’s Architecture

Pros & Cons

Pros & Cons

Pros◇ FREE! & easy to install◇ One of the fastest (high throughput)◇ Streaming & batch modes◇ Real time analysis with fault tolerance◇ Low latency (allows human-undetectable delays in outputting)◇ Runs locally, cluster, or cloud◇ Compatible with popular platforms, like Hadoop & YARN

Cons◇ No data storage◇ Lots of competitors◇ Relatively new

Case Study

Overview

Capital OneReal time monitoring of customer behavior

Other Companies◇ ResearchGate

■ network analysis & near duplicate detection◇ Okkam SRL

■ big data analysis & tax assessment◇ Bouygues Telecom

■ real time event processing & analytics◇ Leipzig University

■ distributed graph analytics & labs

Competitors

Competitors

There are lots.

Here are some:◇ Hadoop MapReduce◇ Apache Tez◇ Apache Storm◇ Apache Spark◇ Apache Apex

Competitors

All are engines for processing big data

Vary based on:◇ Type of processing (streaming, batch)◇ Throughput (processing rate)◇ Latency (time for packet of data to go from one point to another)◇ Fault tolerance (ability to operate despite component failures)◇ Ease of setup◇ Compatibility / integration◇ “Nuts & bolts”

Competitors

Flink:◇ Very easy to set up

◇ High throughput (one of the fastest)

◇ Stream & batch processing

◇ Fault tolerance (via lightweight distributed snapshots)

◇ Low latency (good for trading, games, VOIP)

◇ Good documentation & actively maintained

◇ Can’t beat ‘em? Integrate with ‘em!■ Flink integrates with MapReduce, YARN, Storm & more

Conclusion

Conclusion

BOTTOM LINE:

One of the fastest tools for processing big data in real time,

but Lots of competition.

Demonstration

Review of Architecture

◇ Download & install (easy from terminal)■ UNIX-like environment (Linux, Mac OS X, etc)

◇ Runs on local machine, cluster, or cloud■ Can run on YARN clusters on top of data stored in Hadoop

◇ Start flink

◇ Write program to process data in Java or Scala API

◇ Specify input location■ Integrates with messaging queues like Kafka■ Data stays in memory once read

◇ Specify output location■ No data storage system, but integrates with HDFS

◇ Jobs executed in Flink’s own runtime engine

1. Download + Quickstart Instructions

2. Navigate to flink directory:cd flink-1.0.1

3. Start local session:bin/start-local.sh

4. Make sure it’s running

5. Run batch example:./bin/flink run ./examples/batch/WordCount.jar

6. Run streaming example:./bin/flink run ./examples/streaming/WordCount.jar --input file:///home/richters/hamlet.txt --output file:///home/richters/output.txt

7. Stop local session:./bin/stop-local.sh

Example

Thanks!Any questions?