stream processing
TRANSCRIPT
EXAMPLES
• Statistical Summaries* Start with a value* If item > value, add learning rate* If item < value, subtract learning rate
=>Approximation of Median
EXAMPLES
• Taking Representative Samples
- From weblogs (i.e. ip-timestamp tuples) approximate average percentage of users who have revisited.
EXAMPLES
• Filtering Streams
Bloom Filter• Hash based on criterion• Matching hash means entry may be
in there• Non matching hash means it’s
definitely not
EXAMPLES
• Approximate Distinct ElementsFlajolet-Martin Algorithm
• Hash element (or identifier) to longs using many hash functions. Count trailing zeroes of hash. Let it be r.
• Approximation for distinct elements = 2^R where R = max(r)
• Combine groups of hashes: Take average for each group, then take median of the averages.
KAFKA
• Scale out, clustered, durable message broker.• Fault tolerant, replicated.• Uses topics, which have partitions.• Messages within partitions have guaranteed ordering.
KAFKA
• Kafka Streams: Lightweight Kafka => [x] library• Kafka Connect: Enables streaming large
amounts of data reliability between Kafka and other systems
• Schema Registry: Well…registry for schemas
KAFKA - GOTCHAS
• Messages in a partition are ordered, message processing may not be.
• At least once… downstream idempotence required.
• Disk.• Rebalances.
CASSANDRA
• Partitioned row store.• Fault tolerant, Masterless.• Very fast writes, fast reads.• Tunable consistency.• Multi-datacentre aware.• OLTP + OLAP (via Spark).
CASSANDRA – DATA MODELLING
• NOT a relational database• KNOW YOUR QUERIES• Model for queries, not normalisation• Consolidate to minimal number of tables that get the job done.• Unbound partition growth will bring down nodes, then quorum.
SPARK
• General purpose data processing• Ability to cache things in memory, and re-use across steps.