the magic of random sampling: from surveys to big data · the magic of random sampling: from...
TRANSCRIPT
TheMagicofRandomSampling:FromSurveystoBigData
EdithCohenGoogleResearchTelAvivUniversity
Disclaimer: Randomsamplingisclassicandwellstudiedtoolwithenormousimpactacrossdisciplines.Thispresentationisbiasedandlimitedbyitslength,myresearchinterests,experience,understanding,andbeingaComputerScientist.Iwillattempttopresentsomebigideasandselectedapplications.Ihopetoincreaseyourappreciationofthisincredibletool.
HarvardIACSseminar11/11/2016
Whatisasample?
Ø InherentLimitedavailabilityofthedataØHavedatabutlimitedresources:
§ Storage:longorevenshortterm§ Transmissionbandwidth§ Surveycost ofsampleditems§ Computation§ time delayofprocessinglargerdata
Asamplecanbeanadequatereplacementofthefulldatasetforpurposessuchassummarystatistics,fittingmodels,othercomputation/queries
Asampleisasummaryofthedata,intheformofasmallsetofrepresentatives.
Data,data,everywhere.Economist2010
Whyusesamples?
Whentousesamples?
art:pokemon.com
History
Recentcenturies:Toolforsurveyingpopulations§ Graunt 1662:EstimatepopulationofEngland§ Laplace ratioestimator[1786,1802]:EstimatethepopulationofFrance.Sampled“Communes”(administrativedistricts),countingpopulationratiotolivebirthsinpreviousyear.Extrapolatefrombirthregistrationsinwholecountry.
§ Kiaer 1895“representativemethod”;March1903“probabilitysampling”§ UScensus1938:Useprobabily sampletoestimateunemployment
Recentdecades:Ubiquitouspowerfultoolindatamodeling,processing,analysis,algorithmsdesign
Basisof(human/animal)learningfromobservations
Samplingschemes/algorithmsdesigngoals
§Optimizesample-sizevs.informationcontent(qualityofestimates)§ Sometimesbalancemultiplequerytypes/statisticsoversamesample
§ Computationalefficiencyofalgorithmsandestimators§ Efficientonmodernplatforms(streams,distributed,parallel)
Asampleisalossy summaryofthedatafromwhichwecanapproximate(estimate) propertiesofthefulldata
Data𝒙 Sample𝑺
Q:𝑓(𝒙) ? 𝑓&(𝑺)estimator
Composable (Mergeable)Summaries
Data𝐴 Sample(A)
Data𝐵 Sample(B)
Data𝐴 ∪ 𝐵 Sample(A∪ B)
Whycomposable isuseful?
Streameddata
Distributeddata/parallelizecomputation
Sample(S) Sample(S) Sample(S) Sample(S) Sample(S) Sample(S)
Sample4
Sample5
Sample3Sample1
S.1 ∪ 2
S.1 ∪ 2 ∪ 5
S.3 ∪ 4
1∪ 2 ∪ 3 ∪ 4 ∪ 5
Sample2
Outline
Selected“bigideas”andexampleapplications§ “Basic”sampling:Uniform/Weighted§ Themagicofsamplecoordination§ EfficientcomputationoverUnaggregated datasets(streamed,distributed)
Uniform(equalprobability)sampling
§ Associatewitheachitem𝑥 anindependentrandomnumber𝐻 𝑥 ∼ 𝑈[0,1]§ Canusearandomhashfunction,limitedprecision
§ Keep𝑘 itemswithsmallest𝐻 𝑥
Reservoirsampling[Knuth1968,Vitter1985]:selectsa𝑘-subset(uniformlyatrandom)§ Fixedsamplesize𝑘 ;Composable (Knuth’s);state∝samplesize𝑘
Bernoullisampling:EachitemisselectedIndependentlywithprobabilityp
§ Correctness:All𝑘 subsetshavesameprobability§ Composability:bottom-k(𝐴 ∪ 𝐵) = bottom-k(bottom-k(𝐴) ∪ bottom−k(B) )
Uniform(equalprobability)distinctsamplingDistinctsampling: A(uniformatrandom)𝑘-subsetofdistinct “keys”
Distinctkeys:
Uniform(equalprobability)distinctsamplingDistinctsampling: A(uniformatrandom)𝑘-subsetofdistinct “keys”
Distinctkeys:
Distinctcounting: #(distinctkeys)in(sub)population
ExampleApplications:§ Distinctsearchqueries§ Distinctsource-destinationIPflows§ Distinctusersinactivitylogs§ Distinctwebpagesfromlogs§ Distinctwordsinacorpus§ ……………
Uniformdistinctsampling&approximatecounting
§ Associatewitheachitem𝑥 anindependentrandomhash 𝐻 𝑥 ∼ 𝑈[0,1]§ Keep𝑘 keys withsmallest𝐻 𝑥
Distinctsampling: A(uniformatrandom)𝑘-subsetofdistinct“keys”
withkey
Distinctcounting: #(distinctkeys)in(sub)populationComputation§ (Naïvealgorithm)Aggregatethensample/count:state∝#distinctkeys§ DistinctReservoirsampling/approximatecounting[Knuth1968]+[Flajolet &Martin1985]:composablesummarywithstate∝samplesize𝒌
EstimationfromauniformsampleDomain/Segmentqueries:Statisticofaselectedsegment𝐻 ofpopulationΧ
Examples:ApproximatethenumberinPokémons inourpopulationthatare§ College-educatedwatertype§ FlyingandhatchedinBoston§ Weighinglessthan1kg
Application:Logsanalysisformarketing,planning,targetedadvertising
Estimationfromauniformsample
§ Ratioestimator(whentotalpopulationsize Χ isknown):|E∩G||E| |Χ|
Domain/Segmentqueries:Statisticofaselectedsegment 𝐻 ofpopulationΧ
§ Unbiased minimumvarianceestimator
§ coefficientofvariation:HI =JK L
�“relativeerror”
§ Concentration (Bernstein,Chernoff)- probabilityoflargerelativeerrordecreasesexponentially
§ Inverseprobabilityestimator[HorvitzThompson1952](when|Χ| isnotavailable,aswithdistinctreservoirsampling):NO 𝑆 ∩ 𝐻 where𝑝 = Prob 𝐻 𝑥 ≤ min
Y∉E𝐻 𝑦 = 𝑘 + 1 ]^-smallest𝐻(𝑥)
Properties:
ApproximatedistinctcountingofalldataelementsCanusedistinctuniformsample,butcandobetterforthisspecialcase
HIPestimators[Cohen‘14,Ting‘15]:halvethevarianceto N_L !
§ Idea: trackanestimatecountasthestructure(sketch)isconstructed.Addinverseprobabilityofmodifyingstructurewitheachmodification
§ Applicablewithalldistinctcountingstructures§ Surprisingly,betterestimatorthanpossiblefromfinalstructure
HyperLogLog [Flajolet etal2007]
§ Reallysmallstructure:optimalsizeO(𝜖b_ + loglogn) forCVHI = 𝜖 ; 𝑛 distinctkeys
§ Idea: forcounting,noneedtostorekeysinsample,sufficestouse𝑘 = 𝜖b_exponentsofhashes.Exponentsvalueconcentratedsostoreoneand𝑘 offsets.
Sample(S)c
Approx.count
IfstructureS ismodified𝑐 += 1/𝑝(modified)
Graphs:Estimatingcardinalitiesofneighborhoods,reachability,centralities
Graph𝐺(𝑉, 𝐸)
Applications: Networkanalytics,socialnetworks,otherdatawithgraphrepresentation
§ Exactcomputationis𝑂( 𝐸 𝑉 )§ Sample-basedsketches[C’1997]:near-linear𝑂m( 𝐸 )
Q: numberofnodes§ Reachablefrom§ Withindistance5from
Idea:§ Associateindependentrandom𝐻 𝑣 ∼ 𝑈[0,1]
withallnodes§ Computeforeachnodeu asketch𝑆(𝑢) ofthe𝜖b_ reachablenodes𝑣 withsmallestH(𝑣)
EstimatingSparsitystructureofmatrixproductsProblem: Computeamatrixproduct𝐵 = 𝐴N𝐴_⋯𝐴s 𝑛 ≥ 3§ 𝐴u aresparse (manymorezeroentriesthannonzeroentries)Q: findbestorder(minimizecomputation)forperformingmultiplication§ Fromassociativity,𝐵 = 𝐴N𝐴_𝐴v canbecomputedusing(𝐴N𝐴_)𝐴v or 𝐴N(𝐴_𝐴v)§ Computationdependsonsparsitystructure
[C’1996]:Thesparsitystructureofsub-products(numberofnonzeros inrows/columnsofproducts)canbeapproximatedinnear-lineartime(innumberofnonzeros in{𝐴u}.Solution: Preprocesstodeterminethebestorder,thenperformcomputation
Idea: Defineagraphforproductbystackingbi-partitegraphs𝐺u thatcorrespondtononzeros in𝐴u.Row/columnsparsityofproductcorrespondstoreachabilitysetcardinality.
Unequalprobability,WeightedsamplingKeys𝑥 canhaveweights 𝑤z
Exampleapplications(logsanalysis):§ Key:video-userviewweight:watchtime Query: Totalwatch
timesegmentedbyuserandvideoattributes§ key: IPnetworkflowweight: bytestransmittedQuery:Totalbytes
transmittedforsegmentofflows(srcIP,destIP,protocol,port,…)Applications:billing,trafficengineering,planning
Segmentqueries:Forsegment𝐻 ⊂ Χ,estimate
w H = ∑ 𝑤z�z∈G (weightw H ofmyPokémonbag𝐻)
10.0kg
9.0kg
460.0kg
210.0kg
210.0kg 12.4kg
30.0kg
4.0kg 8.0kg
Unequalprobability,Weightedsampling§ Key𝑥 hasweights 𝑤z§ Sample𝑆 thatincludes𝑥withprobability𝑝z
§ Unbiased (when𝑝z > 0)§ ! Toobtainstatisticalguaranteesonqualityweneedtosample
heavierkeyswithhigherprobability.
Segmentquery:w H = ∑ 𝑤z�z∈G
Inverseprobabilityestimator[HT52]:𝑤� H = ∑ ��
O��z∈G∩E
10.0kg
9.0kg
460.0kg
210.0kg
210.0kg 12.4kg
30.0kg
4.0kg 8.0kg
PoissonProbabilityProportionaltoSize(PPS)[HansenHurwitz1943]𝑝z ∝ 𝑤z minimizessumofper-keyvariance
§ CVHI =�(J)� G L
� “relativeerror”,concentration
Robust:If probabilitiesareapproximate,guaranteesdegradegracefully
Unequalprobability,Weightedsampling
§ Associatewitheachkey thevalue ,forindependentrandom
§ Keep keyswithsmallest
Composableweightedsamplingschemewithfixedsamplesize𝑘:Bottom-𝑘/Ordersamples/“weighted”reservoir
Key𝒙
𝑤z
𝑢z 0.22 0.12 0.31 0.81 0.06 0.72 0.45 0.57
𝑟 𝑥 0.017 0.03 0.00067 0.027 0.00029 0.072 0.05625 0.00271
10.0kg460.0kg 210.0kg 210.0kg12.4kg 30.0kg4.0kg 8.0kg
Examplewithu� ∼ 𝑈[0,1]
Unequalprobability,Weightedsampling
§ Without-replacement 𝐷 = EXP[1] [Rosen72,C’’97C’Kaplan‘07,Eframidis Spirakis ’05]§ Priority(sequentialPoisson)sampling𝐷 = U 0,1 [Ohlsson ‘00,DuffieldLundThorup ’07]
§ EssentiallysamequalityguaranteesasPoissonPPS
!! worksformax-distinctoverkeyvaluepairs(𝑒. 𝑘𝑒𝑦, 𝑒. 𝑤)where𝑤z = maxe. w�|�.L�Y�z
§ Associatewitheachkey thevalue ,forindependentrandom
§ Keep keyswithsmallest
Composableweightedsamplingschemewithfixedsamplesize𝑘:Bottom-𝑘/Ordersamples/“weighted”reservoir
Estimation:Inverseprobability𝜏 = smallest𝑟z of𝑥 ∉ 𝑆 ;𝑝z = Prob 𝑟 𝑥 < 𝜏 = ProbY∼�[𝑦 < 𝑤z𝜏]
Unequalprobability,Weightedsampling
Cut/spectralgraphsparsifiers of𝐺 = (𝑉, 𝐸)[Benczur Karger 1996]Samplegraphedgesandre-weightbyinverseprobabilitytoobtaina𝐺’ = (𝑉, 𝐸’)suchthat|𝐸′| = 𝑂(|𝑉|) andallcutvaluesareapproximatelypreserved[Spielman Srivastava2008]Spectralapproximation𝐿� ≈ 𝐿�� byusing𝑝� ∝(approximate)effectiveresistanceof𝑒 ∈ 𝐸
ExampleApplications:
Samplingnonnegativematrixproducts[CohenLewis1997]Nonnegativematrices𝐴, 𝐵Efficientlysampleentriesintheproduct𝐴𝐵 proportionallytotheirmagnitude(withoutcomputingtheproduct)Idea: randomwalksusingedgeweightprobabilities
SampleCoordination[Brewer,Early,Joyce1972] Samesetofkeys,multiplesetsofweights(“instances”).Makesamplesofdifferentinstanceassimilaraspossible.
10.0kg460.0kg 210.0kg
210.0kg
12.4kg30.0kg4.0kg 8.0xkg
Surveysamplingmotivation: Weightevolvebutsurveysimposeburden.Wanttominimizetheburdenandstillhaveaweightedsampleoftheevolvedset.
5.0kg300.0kg 110.0kg
300.0kg
15.0kg50.0kg6.0kg 4.0kg
Monday:
Tuesday:
Key𝒙
Mon 𝑤z
Tue 𝑤z
𝑢z 0.22 0.12 0.31 0.81 0.06 0.72 0.45 0.57
Mon 𝑟 𝑥 0.017 0.03 0.00067 0.027 0.00029 0.072 0.05625 0.00271
Tue𝑟 𝑥 0.015 0.02 0.00103 0.0162 0.00055 0.144 0.1125 0.0019
HowtocoordinatesamplesCoordinatedbottom-𝑘 samples:Usesame 𝑢z = 𝐻(𝑥) forallinstances
§ Associatewitheachkey thevalue ,forindependentrandomhash
§ Keep keyswithsmallest
!!Changeisminimizedgiventhatwehaveaweightedsampleofeachinstance
10.0kg460.0kg 210.0kg 210.0kg12.4kg 30.0kg4.0kg 8.0kg
300.0kg 300.0kg15.0kg 50.0kg6.0kg 4.0kg110.0kg 5.0kg
Coordinationofsamples
§ LocalitySensitiveHashing(LSH)(similarweightvectorshavesimilarsamples/sketches)§ Multi-objectivesamples (universalsamples):Asinglesample(assmallaspossible)thatprovidesstatisticalguaranteesformultiplesetsofweights/functions.
§ Statistics/Domainqueriesthatspanmultiple“instances”(Jaccard similarity,𝐿Odistances,distinctcounts,unionsize,sketchesofcoveragefunctions…)§ MinHash sketchesareaspecialcasewith0/1weights.
§ Facilitatesfastercomputationofsamples.Example: [C’97]Sketching/samplingreachabilitysetsandneighborhoodsofallnodesinagraphinnear-lineartime.
§ Facilitatesefficientoptimizationoversamples: Optimizeobjectiveoversetsofweights/functions/parametrizedfunctions.Example:Centrality/clusteringobjective[CCK‘15],learning[KingmanWelling‘14]
Verypowerfultoolforbigdataanalysiswith applicationswellbeyondwhat[Brewer,Early,Joyce1972] couldenvision
Multi-objectiveSample§ Samekeyscanhavedifferent“weights:”IPflowshavebytes,packets,count§ Wewanttoanswersegmentquerieswithrespecttoallweights.§ Naïvesolution:3disjointsamples§ Smartsolution:Asinglemulti-objectivesample
10.0kg50cm10years
460.0kg180cm50years
210.0kg300cm40years
210.0kg200cm150years12.4kg
30cm2years
30.0kg100cm100years
4.0kg60cm1year
8.0kg60cm12years
Multi-objective Priority (sequential Poisson) sampling
xwx 135 2 9 18 21 4 11 4 2
Count 1 1 1 1 1 1 1 1 1cap5(wx) 5 2 5 5 5 4 5 4 2thresh10 1 0 0 1 1 0 1 0 0
ux 0.52 0.24 0.76 0.90 0.14 0.32 0.44 0.07 0.82ux
thresh10(wx )0.52 1 1 0.90 0.14 1 0.44 1 1
uxcap5(wx )
0.104 0.120 0.152 0.18 0.064 0.080 0.088 0.0175 0.41
For k = 3, the MO sample for F = {count, thresh10, cap5} is:
Edith Cohen Scalable Weighted Sampling
Multi-objectivesampleof𝑓-statistics
𝑓
Onesetofweights𝑤z,butareinterestedindifferentfunctions 𝑓(𝑤z)
Multi-objectiveSample[C’KaplanSen‘09,C’’15]
Ourmulti-objectivesample𝑆 for𝑊 is§ TheunionS = ⋃ 𝑆(u)�
u§ Samplingprobabilities𝑝z = max
u𝑝 u (𝑥)
Setoffunctionsw ∈ 𝑊 wherew: Χ → 𝑅�§ Computecoordinatedsamples𝑆(u) foreachw(�) ∈ 𝑊
Theorem: Foranydomainquery:w(�) H = ∑ 𝑤z(�)�
z∈G
Theinverseprobabilityestimator𝑤�(u) H = ∑ ��(�)
O��z∈G∩E
Isunbiased andprovidesstatisticalguaranteesthatareatleastasstrongasanestimatorappliedtothededicatedsample𝑆(u)
Multi-objective Priority (sequential Poisson) sampling
xwx 135 2 9 18 21 4 11 4 2
Count 1 1 1 1 1 1 1 1 1cap5(wx) 5 2 5 5 5 4 5 4 2thresh10 1 0 0 1 1 0 1 0 0
ux 0.52 0.24 0.76 0.90 0.14 0.32 0.44 0.07 0.82ux
thresh10(wx )0.52 1 1 0.90 0.14 1 0.44 1 1
uxcap5(wx )
0.104 0.120 0.152 0.18 0.064 0.080 0.088 0.0175 0.41
For k = 3, the MO sample for F = {count, thresh10, cap5} is:
Edith Cohen Scalable Weighted Sampling
Multi-objectivesampleofstatisticsPriority
(seq
uentialPoisson
)
𝑓
Onesetofweights𝑤z,differentfunctions𝑓(𝑤z)
Multi-objectivesampleofallmonotonestatistics𝑀 :Allmonotonenon-decreasingfunctions𝑓 with𝑓 0 = 0Examples:𝑓 𝑤 = 𝑤O;𝑓 𝑤 =min{10,w};𝑓 𝑤 = log(1 + 𝑤),…..
Dataofkeyvaluepairs(𝑥, 𝑤z) :Foreach𝑓,Instanceis𝑓(𝑤z) for𝑥 ∈ Χ
Theorem: [C’97,C’K’07](thresholdfunctions)[C‘15]all𝑀Multi-objectivesampleforallmonotonestatistics𝑀has§ (expected)samplesize: 𝑂(𝑘ln𝑛) ,where𝑛 = #keyswith𝑤z > 0§ Composable structureofsizeequaltothesamplesize
⟹ Veryefficienttocomputeonstreamed/parallel/distributedplatforms
𝑁𝑒𝑥𝑡: 𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑠𝑎𝑛𝑑𝑠𝑡𝑟𝑒𝑎𝑚𝑠
Multi-objectivesampleofmonotonestatistics
Application:DataStreamstime-decayingaggregationsmonotonenon-increasing𝛼(𝑥),andsegment𝐻 ⊂ 𝑉
𝐴® = ¯ 𝛼�
°∈G
𝑡°
• 𝑡°:Elapsedtimefromstartofstreamto𝑢• 𝑡°:Elapsedtimefrom𝑢 tocurrenttime
Theorem: [C’97,C’K’07](thresholdfunctions)[C‘15]all𝑀Multi-objectivesampleforallmonotonestatistics𝑀has§ (expected)samplesize: 𝑂(𝑘ln𝑛) ,where𝑛 = #keyswith𝑤z > 0§ Composable structureofsizeequaltothesamplesize
12:00am 1:00am 2:00am 3:00am 4:00am
Multi-objectivesampleofmonotonestatistics
Thm: All-DistancesSketches(ADS)(MOsamples)forall nodescanbecomputedin𝑂m( 𝐸 ) computation.WecanestimateC® 𝑣,𝐻 forall𝛼,𝐻 fromADS(𝑣)
Application: Centralityofallnodesinagraph𝑮 = (𝑽, 𝑬) [C’97C’Kaplan‘04C‘15]Foranode𝑣,monotonenon-increasing𝛼(𝑥),andsegment𝐻 ⊂ 𝑉centrality of𝑣 forsegment𝐻 is
C® 𝑣, 𝐻 = ∑ 𝛼�°∈G 𝑑µ° (Harmoniccentrality:𝛼 𝑥 = N
z )
Theorem: [C’97,C’K’07](thresholdfunctions)[C‘15]all𝑀Multi-objectivesampleforallmonotonestatistics𝑀has§ (expected)samplesize: 𝑂(𝑘ln𝑛) ,where𝑛 = #keyswith𝑤z > 0§ Composable structureofsizeequaltothesamplesize
Multi-objectivesampleofdistancestoasetofpointsinametricspace[Chechik C’Kaplan‘15]§ Metricspace𝑀§ Setofpoints𝑃 = {𝑥N, 𝑥_, … , 𝑥s} ⊂ 𝑀§ Eachpoint𝑣 ∈ 𝑀 definesweights𝑤µ 𝑥u = 𝑑µz�§ Amultiobjectivesampleofall𝑤µ allowsustoestimateforsegments𝐻 ⊂ 𝑃,andanyquerypoint𝑣thesumofdistances
C v, H = ∑ 𝑑µz��z�∈G
!!! Canevenrelaxtriangleinequalitytod¹º ≤ 𝜌 𝑑¼½ + 𝑑½¾ (e.g.squareddistances)
Theorem:§ Multi-objectiveoverheadfordistancesis𝑂(1) !§ Canbecomputedusinganear-linearnumberofdistancequeries
⟹Samplesize𝑂(𝜖b_) sufficesforestimatesofC v, 𝑃 foreach𝑣 ∈ 𝑀 withCV𝜖
Estimatorsformulti-instanceaggregates
Graphs:Influenceestimationfromnodesketches
MonotoneEstimationProblems[C’Kaplan13,C’14]:§ Characterizationofallfunctionsforwhichunbiasedboundedvarianceestimatorsexist§ Efficient(Paretooptimal)estimators(whentheyexist)
Specificestimatorsforspecificaggregations:0/1weights:Sizeofunion[C’95]Jaccard similarity[Broder‘97]Generalweights,tighterestimators:max,min,quantiles[C’Kaplan2009,2011]
Setoffunctionsw ∈ 𝑊 wherew: Χ → 𝑅�Coordinatedsamples𝑆(u) foreachw(�) ∈ 𝑊
Examplemulti-instanceaggregations:§ 𝐿O
O distance∑ |𝑤(N) 𝑥 − 𝑤(_) 𝑥 |O�z∈G
§ One-sided𝐿OO ∑ max{0, 𝑤(N) 𝑥 − 𝑤 _ 𝑥 O�
z∈G
§ max,min,𝑘ÀÁ,§ Union,Jaccard similarity§ Generalizedcoveragefunctions
!!!Coordinationisessentialingettinggoodestimators.Independentsampleswillnotwork.
Distributed/Streameddataelements:Sampling/countingwithoutaggregation§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey
§ Naïve:Aggregatepairs(𝑥, 𝑤z),thensample- requiresstatelinearin#distinctkeys§ Challenge: Sample/Countwithrespectto𝑓(𝑤z) usingsmallstate(noaggregation).
§ Sampling goldstandard:“aggregated”samplesize/qualitytradeoffsCV=HI =NL�
§ Countinggoldstandard:likeHyperLogLog 𝑂(𝜖b_ + loglog𝑛) ,CV=𝜖
§ Max weight:𝑤z = maxe. value�|�.L�Y�z
§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�z∈G samplingscheme
§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J Approximate“counting”structure
VComposable bottom-k
2
82
1
5
8 5§ Sum ofweights:𝑤z = ∑e. value
�|�.L�Y�z11 7
Distributed/Streameddataelements:Sampling/countingwithoutaggregation
§ Distinct𝑓 x = 1 (𝑥 > 0):count [FlajoletMartin’85,Flajolet etal‘07]sample [Knuth‘69]§ Sum𝑓 x = x:count[Morris’77],sample [GibbonsMatias‘98,Estan Varghese‘05,CDKLT‘07]§ Frequencymoments𝑓 x = xÂ:count [Alon MatiasSzegedy ’99,Indyk ‘01]§ “universal”sketches[Braverman Ostrovsky ‘10](count)
§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey§ Sum ofweights:𝑤z = ∑e. value
�|�.L�Y�z§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�
z∈G
§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J
But-- Exceptforsum anddistinct,notevencloseto”goldstandard”
2
82
1
5
Sampling/countingwithoutaggregation§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey§ Sum ofweights:𝑤z = ∑e. value
�|�.L�Y�z§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�
z∈G
§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J
§ Distinct,sum§ Lowfreq.moments:𝑓 𝑥 = 𝑥O for𝑝 ∈ [0,1]
2
82
1
5
SamplingIdeas:Elementprocessthatconvertsumtomaxviadistributionstoapproximatesamplingprobabilities.Invertthesamplingtransformforunbiasedestimation.
§ Cappingfunctions:𝐶𝑎𝑝Ä 𝑥 = min{𝑇, 𝑥}§ Logarithms:𝑓 𝑥 = log(1 + 𝑥)
[C’15,C’16]Samplingandcountingnear“goldstandard”(× 𝑒/(𝑒 − 1)� ≈ 1.26)𝑓:Concavewith(sub)lineargrowth
CountingIdeas:ElementprocessingguidedbyLaplacetransformtoconverttomax-distinctapproximatecountingproblem.
ConclusionWegotatasteofsampling“bigideas”thathavetremendousimpactonanalyzingmassivedatasets.§ Uniformsampling§ Weightedsampling§ Coordinationofsamples
§ Multi-objectivesample§ Estimationofmulti-instancefunctions
§ Samplingandcomputingstatisticsunaggregated (distributed/streamed)elements
Future: StillfascinatedbysamplingandtheirapplicationsNearfuturedirections:§ Extend“goldstandard”sampling/countingoverunaggregated dataandunderstandlimitsofapproach§ Coordinationforbettermini-batchselectionformetricembeddingviaSGD§ Multi-objectivesamplesforclusteringobjectives,understandoptimizationovercoordinatedsamples
Thankyou!