barcelona spain apache spark meetup oct 20, 2015: spark streaming, kafka, mllib, sql, project...
TRANSCRIPT
![Page 1: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/1.jpg)
Click to edit Master text styles
Click to edit Master text styles
After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations
Barcelona Spark Meetup
Oct 20th, 2015
Chris Fregly Principal Data Solutions Engineer
IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **
![Page 2: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/2.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Who Am I?
2
Streaming Data Engineer Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer IBM Technology Center
Meetup Organizer Advanced Apache Meetup
Book Author Advanced (2016)
![Page 3: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/3.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share the patterns and idioms of these well-designed, distributed, big data components
![Page 4: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/4.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4
Core
Spark Streaming
real-time Spark SQL structured data
MLlib machine learning
GraphX graph
analytics
…
BlinkDB approx queries
What is Spark?
![Page 5: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/5.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark Deployments In Production
5
![Page 6: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/6.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Tools of the Talk
6
Redis Docker Cassandra MLlib, GraphX Parquet, JSON Apache Zeppelin Spark Streaming, Kafka Spark SQL, DataFrames Spark JDBC/ODBC Hive ThriftServer ElasticSearch, Logstash, Kibana (ELK)
and…
![Page 7: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/7.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
SMACK Stack!
7
S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)
![Page 8: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/8.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Themes of this Talk
Parallelism Performance Streaming Approximations Similarity Measures Recommendations
8
and…
![Page 9: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/9.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Goals of Spark After Dark Generate high-quality recommendations
Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates
Spark SQL -> DataFrames, Cassandra
GraphX -> PageRank, Shortest Path
MLlib -> Matrix Factor, Word2Vec
9
![Page 10: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/10.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Popular Dating Sites
10
![Page 11: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/11.jpg)
Click to edit Master text styles
Click to edit Master text styles Parallelism
11
![Page 12: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/12.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”
12
![Page 13: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/13.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parallel Algorithm: O(log n)
13
O(log n)
![Page 14: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/14.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Non-Parallel Algorithm: O(n)
14
O(n)
![Page 15: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/15.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark is Parallel!
15
![Page 16: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/16.jpg)
Click to edit Master text styles
Click to edit Master text styles Performance
16
![Page 17: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/17.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark Beats Hadoop @ 100 TB GraySort
17
On-disk only 28,000 partitions No in-memory caching
(2014) (2013) (2014)
![Page 18: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/18.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Improved Shuffle and Network Layer “Sort-based shuffle”
Minimize OS resources
Switched to async Netty
Keep CPUs hot
Reuse byte buffers to minimize GC
Use epoll for I/O to stay in kernel space 18
![Page 19: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/19.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Project Tungsten: CPU and Memory More JVM bytecode generation, JIT optimize
CPU-cache-aware data structs and algos -->
Custom memory management Serializers Performance New HashMap
19
![Page 20: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/20.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DataFrames and Catalyst Optimizer
20
20
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please Use DataFrames!
--> -->
JVM bytecode generation
![Page 21: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/21.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Columnar Storage Format
21
Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)
![Page 22: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/22.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parquet File Format Based on Google Dremel
Implemented by Twitter and Cloudera
Columnar storage format
Optimized for fast columnar aggregations
Tight compression
Supports pushdowns
Nested, self-describing, evolving schema 22
![Page 23: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/23.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
23
![Page 24: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/24.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Query Optimizations Column, Partition Pruning Row, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]
24
![Page 25: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/25.jpg)
Click to edit Master text styles
Click to edit Master text styles Streaming
25
![Page 26: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/26.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Direct Kafka Streaming – KafkaRDD No single Receiver, no Write Ahead Log (WAL) Workers pull from Kafka in parallel Each KafkaRDD partition stores relevant offsets Upon Worker Node failure, rebuild from offsets Optimizes happy path by avoiding the WAL
26
At least once delivery guarantee <--
![Page 27: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/27.jpg)
Click to edit Master text styles
Click to edit Master text styles Approximations
27
![Page 28: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/28.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Count Min Sketch Approximate counters
Better than HashMap
Low, fixed memory Known error bounds Large num of counters From Twitter’s Algebird Streaming example in Spark codebase
28
![Page 29: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/29.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HyperLogLog Approximate cardinality Approx count distinct ! From Twitter’s Algebird!
Low memory 1.5KB @ 2% error, 10^9 elements !
Streaming example in Spark codebase
RDD: countApproxDistinctByKey() 29
![Page 30: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/30.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)
30
![Page 31: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/31.jpg)
Click to edit Master text styles
Click to edit Master text styles Recommendations
31
![Page 32: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/32.jpg)
Click to edit Master text styles
Click to edit Master text styles Interactive Demo!
32
![Page 33: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/33.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Audience Participation Needed!
33
Navigate to sparkafterdark.com
Click 3 actresses and 3 actors
-> You are here
->
![Page 34: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/34.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked
Item-Item Similarity Items similar to your previously-liked items
34
![Page 35: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/35.jpg)
Click to edit Master text styles
Click to edit Master text styles Non-Personalized Recommendations
35
![Page 36: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/36.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Summary Statistics and Aggregations Top Users by Like Count
“I might like users with the highest sum aggregation of likes overall.”
SparkSQL + DataFrame = Aggregations
36
![Page 37: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/37.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Graph Analytics Top Influencers by Like Graph
“I might like users who have the highest probability of me liking them randomly while walking the like graph.”
GraphX: PageRank
37
![Page 38: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/38.jpg)
Click to edit Master text styles
Click to edit Master text styles Demo!
Spark SQL/DataFrames + GraphX/PageRank
38
![Page 39: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/39.jpg)
Click to edit Master text styles
Click to edit Master text styles Similarities
39
![Page 40: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/40.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z!
![Page 41: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/41.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis
41
![Page 42: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/42.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42
![Page 43: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/43.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets
ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50
github.com/mrsqueeze/spark-hash 43
![Page 44: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/44.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)
44
(index,value)
(index,value)
![Page 45: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/45.jpg)
Click to edit Master text styles
Click to edit Master text styles Personalized Recommendations
45
![Page 46: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/46.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Recommendation Terminology User User seeking recommendations Item
Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering
Dimension reduction
46
![Page 47: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/47.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Collaborative Filtering Personalized Recs Like behavior of similar users
“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
47
![Page 48: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/48.jpg)
Click to edit Master text styles
Click to edit Master text styles Demo!
Spark SQL/DataFrames + MLlib/Alternating Least Squares
48
![Page 49: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/49.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text-based Personalized Recs (1/3) Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
49
![Page 50: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/50.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text Based Personalized Recs (2/3)
50
Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
![Page 51: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/51.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text-based Personalized Recs (3/3) Relevant, High-Value Emails “Your initial email has similar named entities to my profile.
I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition
51
^ Her Email < My Profile
![Page 52: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/52.jpg)
Click to edit Master text styles
Click to edit Master text styles The Future of Recommendations!
52
![Page 53: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/53.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Facial Recognition Eigenfaces
“Your face looks similar to others that I’ve liked. I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
![Page 54: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/54.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Natural Language Processing: Convo Bot NLP and DecisionTrees
“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis
54
Positive Negative
![Page 55: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/55.jpg)
Click to edit Master text styles
Click to edit Master text styles
55
Maintaining the Spark!
![Page 56: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/56.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Recommendations for Couples Pathways of Similarity
“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar • plots -> <- actors
56
![Page 57: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/57.jpg)
Click to edit Master text styles
Click to edit Master text styles Final Recommendation!
57
![Page 58: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/58.jpg)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Get Off the Computer & Meet People! Thank you!!
Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA
Relevant Links advancedspark.com
Signup for the book and meetup! github.com/fluxcapacitor/pipeline
Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline
Run all demos presented today!
58
Image courtesy of http://www.duchess-france.org/
![Page 59: Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project Tungsten, Text Analytics, Natural Language Processing](https://reader031.vdocument.in/reader031/viewer/2022022411/58ecb6e11a28abd5728b4695/html5/thumbnails/59.jpg)
Click to edit Master text styles
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark