Download - Scaling up data science applications
![Page 1: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/1.jpg)
Scaling up data science applicationsHow switching to Spark improved performance, realizability and reduced cost
Kexin Xie
Director, Data Science
@realstraw
Yacov Salomon
VP, Data Science
![Page 2: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/2.jpg)
![Page 3: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/3.jpg)
![Page 4: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/4.jpg)
![Page 5: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/5.jpg)
![Page 6: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/6.jpg)
![Page 7: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/7.jpg)
![Page 8: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/8.jpg)
![Page 9: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/9.jpg)
Marketers want to find more customers like their loyal customer
Lookalikes
![Page 10: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/10.jpg)
![Page 11: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/11.jpg)
Naive Bayes FrameworkModel
![Page 12: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/12.jpg)
Naive Bayes FrameworkModel
Linear Discriminant AnalysisFeature Selection
![Page 13: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/13.jpg)
Naive Bayes FrameworkModel
Linear Discriminant AnalysisFeature Selection
Correct for autocorrelation in feature space (paper pending)Science / Art
![Page 14: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/14.jpg)
Prepare Train Classify
![Page 15: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/15.jpg)
Prepare Train Classify
109 105 105
O(n)
1014
1014 105
O(n2)
105
1014 105
O(nm)
109
![Page 16: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/16.jpg)
![Page 17: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/17.jpg)
# jobs
![Page 18: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/18.jpg)
# jobs # failures
![Page 19: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/19.jpg)
# jobs # failures cost
![Page 20: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/20.jpg)
# jobs # failures cost
Map Reduce
Disk
Complexity
![Page 21: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/21.jpg)
Number of Features Total Population
Segment Populations Segment Population Overlap
![Page 22: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/22.jpg)
![Page 23: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/23.jpg)
userSegments
.flatMap(_.segments)
.distinct
.count
userSegments.count
userSegments
.flatMap(r => r.segments.map(_ -> 1L))
.reduceByKey(_ + _)
val userSegmentPairs = userSegments
.flatMap(r => r.segments.map(r.userId -> _))
userSegmentPairs
.join(userSegmentPairs)
.map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L }
.reduceByKey(_ + _)
![Page 24: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/24.jpg)
Reality: Data in many S3 prefixes/folders
val inputData = Seq(
"s3://my-bucket/some-path/prefix1/",
"s3://my-bucket/some-path/prefix2/",
"s3://my-bucket/some-path/prefix2/",
...
"s3://my-bucket/some-path/prefix2000/",
)
![Page 25: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/25.jpg)
How about this?
val myRdd = inputData
.map(sc.textFile)
.reduceLeft(_ ++ _)
![Page 26: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/26.jpg)
Or this?
val myRdd = sc.union(inputData.map(sc.textFile))
![Page 27: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/27.jpg)
Solution
// get the s3 objects
val s3Objects = new AmazonS3Client()
.listObjects("my-bucket", "some-path")
.getObjectSummaries()
.map(_.getKey())
.filter(hasPrefix1to2000)
// send them to slave nodes and retrieve content
val myRdd = sc
.parallelize(Random.shuffle(s3Objects.toSeq), parallelismFactor)
.flatMap( key =>
Source
.fromInputStream(
new AmazonS3Client().getObjectForKey("my-bucket", key)
.getObjectContent
)
.getLines
)
![Page 28: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/28.jpg)
Reality: Large Scale Overlap
val userSegmentPairs = userSegments
.flatMap(r => r.segments.map(r.userId -> _))
userSegmentPairs
.join(userSegmentPairs)
.map { case (_, (feat1, feat2)) => (feat1, feat2) -> 1L }
.reduceByKey(_ + _)
![Page 29: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/29.jpg)
user1 a, b, c
user2 a, b, c
user3 a, b, c
user4 a, c
user5 a, c
user1 a
user1 b
user1 c
user2 a
user2 b
user2 c
user3 a
user3 b
user3 c
user4 a
user4 c
user5 a
user5 c
user1 a b
user1 a c
user1 b c
user2 a b
user2 a c
user2 b c
user3 a b
user3 a c
user3 b c
user4 a c
user5 a c
a b 3
a c 5
b c 3
1 a b
1 a c
1 b c
1 a b
1 a c
1 b c
1 a b
1 a c
1 b c
1 a c
1 a c
![Page 30: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/30.jpg)
user1 a, b, c
user2 a, b, c
user3 a, b, c
user4 a, c
user5 a, c
hash1 a 3
hash1 b 3
hash1 c 3
hash2 a 2
hash2 c 2
hash1 a b 3
hash1 a c 3
hash1 b c 3
hash2 a c 2
a b 3
a c 5
b c 3
hash1 a, b, c 3
hash2 a, c 2
![Page 31: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/31.jpg)
Solution
// Reduce the user space
val aggrUserSegmentPairs = userSegmentPairs
.map(r => r.segments -> 1L)
.reduceByKey(_ + _)
.flatMap { case (segments, count) =>
segments.map(s => (hash(segments), (segment, count))
}
aggrUserSegmentPairs
.join(aggrUserSegmentPairs)
.map { case (_, (seg1, count), (seg2, _)) =>
(seg1, seg2) -> count
}
.reduceByKey(_ + _)
![Page 32: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/32.jpg)
Reality: Perform Join on Skewed Data
user1 a
user2 b
user3 c
user4 d
user5 e
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
user3 seven
user3 eight
user4 nine
user5 ten
X
data1.join(data2)
![Page 33: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/33.jpg)
Executor 1
Executor 2
Executor 3
user1 a
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
user3 c
user4 d
user5 e
user2 b
user3 seven
user3 eight
user4 nine
user5 ten
![Page 34: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/34.jpg)
Executor 1
Executor 2
Executor 3
user1 salt1 a
user1 salt1 one
user1 salt1 two
user1 salt2 three
user1 salt2 four
user1 salt3 five
user1 salt3 six
user1 salt2 a
user1 salt3 a
![Page 35: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/35.jpg)
user1 a
user1 one
user1 two
user1 three
user1 four
user1 five
user1 six
X
user3 seven
user3 eight
user4 nine
user5 ten
user2 b
user3 c
user4 d
user5 e
![Page 36: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/36.jpg)
Solution
val topKeys = data2
.mapValues(x => 1L)
.reduceByKey(_ + _)
.takeOrdered(10)(Ordering[(String, Long)].on(_._2).reverse)
.toMap
.keys
val topData1 = sc.broadcast(
data1.filter(r => topKeys.contains(r._1)).collect.toMap
)
val bottomData1 = data1.filter(r => !topKeys.contains(r._1))
val topJoin = data2.flatMap { case (k, v2) =>
topData1.value.get(k).map(v1 => k -> (v1, v2))
}
topJoin ++ bottomData1.join(data2)
![Page 37: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/37.jpg)
Smarter retrieval of data from S3
Condensed overlap algorithm
Hybrid join algorithm
Clients with more than 2000 S3 prefixes/folders
Before: 5 hours
After: 20 minutes
100x faster and 10x less data for segment
overlap
Able to process joins for highly skewed data
Hadoop to Spark Maintainable codebase
![Page 38: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/38.jpg)
Failure Rate
![Page 39: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/39.jpg)
Performance
![Page 40: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/40.jpg)
Cost
Before
After
![Page 41: Scaling up data science applications](https://reader031.vdocument.in/reader031/viewer/2022022415/5a647a8e7f8b9a2c568b49e5/html5/thumbnails/41.jpg)