deltaprocessing&for&social&...
TRANSCRIPT
![Page 1: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/1.jpg)
Delta Processing for Social Analy3cs
James Thomas, Xinhao Yuan Myeongjae Jeon, Jorgen Thelin,
Lidong Zhou
![Page 2: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/2.jpg)
DSoAP System Architecture (from before)
• Retrieve tweets with plain text search and perform some analysis on them
![Page 3: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/3.jpg)
Common Social Analy3cs Workflow
1. Plain text search to get target tweet set 2. Analysis within groups of tweets – e.g. each group has all tweets with a certain pair of
words co-‐occurring 3. Finally global analysis (e.g. sort the groups by some
field) – e.g. sort co-‐occurrence pairs by # of tweets
4. Refine search query if irrelevant results observed and repeat steps 1-‐3
Make this process faster, assuming fixed
analysis
![Page 4: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/4.jpg)
Mo3va3on
• Honing in on target tweet set with plain text search
![Page 5: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/5.jpg)
Problem
• Only realize tweet set is bad at end of computa3on pipeline
• Must rerun whole computa3on pipeline on updated tweet set, even though many tweets shared
Oh no…
![Page 6: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/6.jpg)
Solu3on
• Keep previous results, pass only deltas through computa3on pipeline and update previous results
![Page 7: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/7.jpg)
Design Goals / Limits of Prior Work
1. Have good distributed computa3on characteris3cs, like fault/straggler tolerance, communica3on avoidance – Modify Spark without losing these proper3es
2. Efficient fine-‐grained updates to prior results
![Page 8: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/8.jpg)
Design Goals Cont’d
3. Mix non-‐incrementalizable computa3ons with incrementalizable in general-‐purpose computa3on framework
4. Easy to roll back to and apply deltas to previous result sets
Not the desired result
![Page 9: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/9.jpg)
Spark Review • Par33oned datasets (RDDs) • One-‐to-‐one and shuffle dependencies (narrow and wide) • DAG of dependencies • Programmer-‐specified caching of datasets in memory
![Page 10: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/10.jpg)
Feeding Deltas through Computa3on
Don’t know other keys in group 6, so can’t pass deltas through this stage without saving ShuffledRDD
![Page 11: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/11.jpg)
Feeding Deltas through Shuffle
Arbitrary map func3ons can be applied to groups, so we must pass en3re old group as neg. delta and new group as pos. delta
![Page 12: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/12.jpg)
Target API (Design Goals 3 & 4)
• d = new DeltaComputation(rdd)!– rdd can be any map-‐filter-‐groupby computa3on
• newD = d.applyDeltas(pos, neg)!– pos, neg are RDDs that are deltas to the source dataset in the computa3on in d!
– d is not mutated, so we s3ll have a handle to it and can apply other deltas
• result = newD.getResult()!– result is an RDD on which any Spark transforma3ons (incrementalizable or not) can be run
![Page 13: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/13.jpg)
API Example result = source.map(...).groupByKey().map(...)!
! ! ! ! .groupByKey().filter()...!d = new DeltaComputation(result)!top10 = d.getResult().topK(10)!// top10 has bad results, refine query !(pos, neg) = [submit new text filter to storage
! ! ! ! layer, obtain +/- deltas to source]!dUpdated = d.applyDeltas(pos, neg)!newTop10 = dUpdated.getResult().topK(10)!// realize new query was completely wrong, go back to original query and refine it!(newPos, newNeg) = ...!dUpdated2 = d.applyDeltas(newPos, newNeg)!... !!!
!
Handle to old version allows us to apply deltas
to it
Mixing in non-‐
incremental computa3on
![Page 14: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/14.jpg)
Generality of map-‐filter-‐groupby
• Our system can pass deltas through any map-‐filter-‐groupby computa3on
• All single-‐dataset computa3ons can be expressed with these opera3ons (c.f. MapReduce)
• Social analy3cs generally groups records and then performs map-‐like transforma3ons on the groups, then groups again and repeats
![Page 15: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/15.jpg)
Fine-‐grained Updates to RDDs • Fast, space-‐efficient fine-‐grained updates required for deltas
not easy in Spark due to dataset immutability • IndexedRDD (AMP Lab) uses func3onal data structure idea to
solve problem • Space-‐efficiency means persis3ng old versions is cheap
(design goal 4)
(fig. by Ankur Dave, AMP Lab)
![Page 16: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/16.jpg)
Computa3on DAG Rewriter (Design Goals 1 & 2)
![Page 17: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/17.jpg)
Implementa3on Details
• Just a library – no modifica3ons to Spark core • ~300 LOC (DAG rewriter and modifica3ons to IndexedRDD)
• Spark par33oners used to avoid shuffles for IndexedRDD state maintenance
![Page 18: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/18.jpg)
Evalua3on Methodology
• 1e7 records, uniformly distributed over G keys • Computa3on: records.map(…).groupByKey().map(…)
• Feed D deltas into the computa3on • Measure percentage improvement in run3me over redoing whole computa3on
![Page 19: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/19.jpg)
Results Increasing because fewer deltas passed through computa3on pipeline
Good perf. even at
25%-‐50% of dataset changed
Good perf. even when most groups changed
![Page 20: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/20.jpg)
General Applicability of Delta Engine
• Can be used on top of any datastore that can return deltas
• Can be used to improve performance of itera3ve computa3ons where each itera3on can work with deltas from previous itera3on
![Page 21: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/21.jpg)
Conclusion
• General purpose delta processing system built on Spark, with applica3ons in social analy3cs
• Achieved design goals: – Retains good proper3es of Spark – Supports fine-‐grained updates to prior results – Supports mixing of incremental and non-‐incremental computa3on
– Efficiently saves prior dataset versions for reuse
![Page 22: DeltaProcessing&for&Social& Analy3cs&jjthomas/msr-intern.pdfCommon&Social&Analy3cs&Workflow& 1. Plain&textsearch&to&gettargettweetset 2. Analysis&within&groups&of&tweets& – e.g.&each&group&has&all&tweets&with&acertain&pair](https://reader035.vdocument.in/reader035/viewer/2022071114/5feb42171ff4c5477853342b/html5/thumbnails/22.jpg)
Future Work • Support wider class of Spark operators (joins, etc.)
• Test performance when intermediate IndexedRDDs must be recomputed and when delta volume large for several stages – Decide intelligently in these cases whether full pipeline recomputa3on is faster
• Support upda3ng the computa3on as well as the dataset – Iden3fying common subcomputa3ons and running the deltas through them