barcelona spain apache spark meetup oct 20, 2015: spark streaming, kafka, mllib, sql, project...

Post on 11-Apr-2017






After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

Barcelona Spark Meetup

Oct 20th, 2015

Chris Fregly Principal Data Solutions Engineer

IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **

Who Am I?


Streaming Data Engineer Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced (2016)

Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

Surface and share the patterns and idioms of these well-designed, distributed, big data components

Spark Streaming

real-time Spark SQL structured data

MLlib machine learning

GraphX graph


BlinkDB approx queries

What is Spark?

Spark Deployments In Production


Tools of the Talk


  Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK)


SMACK Stack!


S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)

Themes of this Talk

 Parallelism  Performance  Streaming  Approximations  Similarity Measures  Recommendations



Goals of Spark After Dark   Generate high-quality recommendations

  Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates

Spark SQL -> DataFrames, Cassandra

  GraphX -> PageRank, Shortest Path

  MLlib -> Matrix Factor, Word2Vec


Popular Dating Sites


Parallelism


My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”


Parallel Algorithm: O(log n)


O(log n)

Non-Parallel Algorithm: O(n)



Spark is Parallel!


Performance


Spark Beats Hadoop @ 100 TB GraySort


  On-disk only   28,000 partitions   No in-memory caching

(2014) (2013) (2014)

Improved Shuffle and Network Layer   “Sort-based shuffle”

  Minimize OS resources

  Switched to async Netty

  Keep CPUs hot

  Reuse byte buffers to minimize GC

  Use epoll for I/O to stay in kernel space 18

Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize

  CPU-cache-aware data structs and algos -->

  Custom memory management Serializers Performance New HashMap


DataFrames and Catalyst Optimizer



Please Use DataFrames!

--> -->

JVM bytecode generation

Columnar Storage Format


Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Parquet File Format  Based on Google Dremel

 Implemented by Twitter and Cloudera

 Columnar storage format

 Optimized for fast columnar aggregations

 Tight compression

 Supports pushdowns

 Nested, self-describing, evolving schema 22

Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values

  Delta, Prefix Encoding: Sorted data


Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown

SELECT b FROM table WHERE a in [a2,a3]


Streaming


Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL


At least once delivery guarantee <--

Approximations


Count Min Sketch   Approximate counters

  Better than HashMap

  Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase


HyperLogLog   Approximate cardinality Approx count distinct !  From Twitter’s Algebird!

  Low memory 1.5KB @ 2% error, 10^9 elements !

  Streaming example in Spark codebase

  RDD: countApproxDistinctByKey() 29

Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)


Recommendations


Interactive Demo!


Audience Participation Needed!


  Navigate to

  Click 3 actresses and 3 actors

-> You are here


Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items


Non-Personalized Recommendations


Summary Statistics and Aggregations   Top Users by Like Count

“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame = Aggregations


Graph Analytics   Top Influencers by Like Graph

“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank


Demo!
Spark SQL/DataFrames + GraphX/PageRank

Spark SQL/DataFrames + GraphX/PageRank


Similarities


Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1


All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis


Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42

Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50 43

Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)




Personalized Recommendations


Recommendation Terminology User User seeking recommendations Item

Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

Dimension reduction


Collaborative Filtering Personalized Recs   Like behavior of similar users

“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity


Click to edit Master text styles Demo!

Spark SQL/DataFrames + MLlib/Alternating Least Squares


Text-based Personalized Recs (1/3)   Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity


Text Based Personalized Recs (2/3)


 Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.

I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition


^ Her Email < My Profile

The Future of Recommendations!


Facial Recognition   Eigenfaces

“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

53 Image courtesy of

Click to edit Master text styles Power of data. Simplicity of design. Speed of innovation. IBM Spark

Natural Language Processing: Convo Bot   NLP and DecisionTrees

“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis


Positive Negative

Maintaining the Spark!


Maintaining the Spark!

Recommendations for Couples   Pathways of Similarity

“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors


Final Recommendation!


 Get Off the Computer & Meet People! Thank you!!

Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

Relevant Links

Signup for the book and meetup!

Clone all code used today!

Run all demos presented today!


Image courtesy of

