stream processing

41
STREAM PROCESSING @ASHIC HTTP://WWW.HEARTYSOFT.COM

Upload: ashic-mahtab

Post on 06-Apr-2017

186 views

Category:

Software


1 download

TRANSCRIPT

STREAM PROCESSING@ASHIC

HTTP://WWW.HEARTYSOFT.COM

BIG DATA

• What?

BIG DATA

• Hadoop• Map-Reduce• Spark

BIG DATA

• Optimisations• Parquet, etc.

BIG DATA

• Problems?

BIG DATA

• Problems?

BIG DATA

• Problems?

STREAMING DATA

• What?

STREAMING DATA

• Cheaper?• Timely results?• Approximations?

STREAMING DATA

EXAMPLES

• Statistical Summaries

Mean, Standard Deviation

EXAMPLES

• Statistical Summaries

Hold n, sum, and sum of squares =>

Mean, Standard Deviation

EXAMPLES

• Statistical Summaries

Approximation of Median

EXAMPLES

• Statistical Summaries* Start with a value* If item > value, add learning rate* If item < value, subtract learning rate

=>Approximation of Median

EXAMPLES

• Taking Representative Samples

- From weblogs (i.e. ip-timestamp tuples) approximate average percentage of users who have revisited.

EXAMPLES

• Filtering Streams

Filter Out (or In) Things That May Not Be Needed

EXAMPLES

• Filtering Streams

Bloom Filter• Hash based on criterion• Matching hash means entry may be

in there• Non matching hash means it’s

definitely not

EXAMPLES

How Many Distinct Things Did We Get?

EXAMPLES

• Approximate Distinct ElementsFlajolet-Martin Algorithm

• Hash element (or identifier) to longs using many hash functions. Count trailing zeroes of hash. Let it be r.

• Approximation for distinct elements = 2^R where R = max(r)

• Combine groups of hashes: Take average for each group, then take median of the averages.

EXAMPLES• Clustering

• Bradley, Fattad, Reina (BFR) approach.• BDMO algorithm.

BACK TO…

USEFUL TECHNOLOGY

• Apache Kafka• Apache Cassandra• Apache Spark

KAFKA

• Scale out, clustered, durable message broker.• Fault tolerant, replicated.• Uses topics, which have partitions.• Messages within partitions have guaranteed ordering.

KAFKA

• Kafka Streams: Lightweight Kafka => [x] library• Kafka Connect: Enables streaming large

amounts of data reliability between Kafka and other systems

• Schema Registry: Well…registry for schemas

KAFKA

KAFKA - GOTCHAS

• Messages in a partition are ordered, message processing may not be.

• At least once… downstream idempotence required.

• Disk.• Rebalances.

CASSANDRA

• Partitioned row store.• Fault tolerant, Masterless.• Very fast writes, fast reads.• Tunable consistency.• Multi-datacentre aware.• OLTP + OLAP (via Spark).

CASSANDRA - DATACENTRES

CASSANDRA – SCHEMA

• Collection Types• User defined types• Static Columns• Materialised Views

CASSANDRA - CQL

CASSANDRA – DATA MODELLING

• NOT a relational database• KNOW YOUR QUERIES• Model for queries, not normalisation• Consolidate to minimal number of tables that get the job done.• Unbound partition growth will bring down nodes, then quorum.

CASSANDRA + SPARK

SPARK

• General purpose data processing• Ability to cache things in memory, and re-use across steps.

SPARK

SPARK STREAMING

• Microbatches• Similar API to non-streaming Spark

SPARK STREAMING WC

SPARK + KAFKA

Kafka Direct Stream

SPARK + CASSANDRA

* rdd.saveToCassandra* sc.cassandraTable

KAFKA + CASSANDRA

* Cassandra Sink* Cassandra Connect

STREAM PROCESSING• Lots of open problems• RISE Labs (Real-time, Intelligent, and Secure Execution

THANK YOU@ashic

http://github/Heartysoft/cassy-up