the magic of random sampling: from surveys to big data · the magic of random sampling: from...

TheMagicofRandomSampling:FromSurveystoBigData

EdithCohenGoogleResearchTelAvivUniversity

Disclaimer: Randomsamplingisclassicandwellstudiedtoolwithenormousimpactacrossdisciplines.Thispresentationisbiasedandlimitedbyitslength,myresearchinterests,experience,understanding,andbeingaComputerScientist.Iwillattempttopresentsomebigideasandselectedapplications.Ihopetoincreaseyourappreciationofthisincredibletool.

HarvardIACSseminar11/11/2016

Whatisasample?

Ø InherentLimitedavailabilityofthedataØHavedatabutlimitedresources:

§ Storage:longorevenshortterm§ Transmissionbandwidth§ Surveycost ofsampleditems§ Computation§ time delayofprocessinglargerdata

Asamplecanbeanadequatereplacementofthefulldatasetforpurposessuchassummarystatistics,fittingmodels,othercomputation/queries

Asampleisasummaryofthedata,intheformofasmallsetofrepresentatives.

Data,data,everywhere.Economist2010

Whyusesamples?

Whentousesamples?

art:pokemon.com

History

Recentcenturies:Toolforsurveyingpopulations§ Graunt 1662:EstimatepopulationofEngland§ Laplace ratioestimator[1786,1802]:EstimatethepopulationofFrance.Sampled“Communes”(administrativedistricts),countingpopulationratiotolivebirthsinpreviousyear.Extrapolatefrombirthregistrationsinwholecountry.

§ Kiaer 1895“representativemethod”;March1903“probabilitysampling”§ UScensus1938:Useprobabily sampletoestimateunemployment

Recentdecades:Ubiquitouspowerfultoolindatamodeling,processing,analysis,algorithmsdesign

Basisof(human/animal)learningfromobservations

Samplingschemes/algorithmsdesigngoals

§Optimizesample-sizevs.informationcontent(qualityofestimates)§ Sometimesbalancemultiplequerytypes/statisticsoversamesample

§ Computationalefficiencyofalgorithmsandestimators§ Efficientonmodernplatforms(streams,distributed,parallel)

Asampleisalossy summaryofthedatafromwhichwecanapproximate(estimate) propertiesofthefulldata

Data𝒙 Sample𝑺

Q:𝑓(𝒙) ? 𝑓&(𝑺)estimator

Composable (Mergeable)Summaries

Data𝐴 Sample(A)

Data𝐵 Sample(B)

Data𝐴 ∪ 𝐵 Sample(A∪ B)

Whycomposable isuseful?

Streameddata

Distributeddata/parallelizecomputation

Sample(S) Sample(S) Sample(S) Sample(S) Sample(S) Sample(S)

Sample4

Sample5

Sample3Sample1

S.1 ∪ 2

S.1 ∪ 2 ∪ 5

S.3 ∪ 4

1∪ 2 ∪ 3 ∪ 4 ∪ 5

Sample2

Outline

Selected“bigideas”andexampleapplications§ “Basic”sampling:Uniform/Weighted§ Themagicofsamplecoordination§ EfficientcomputationoverUnaggregated datasets(streamed,distributed)

Uniform(equalprobability)sampling

§ Associatewitheachitem𝑥 anindependentrandomnumber𝐻 𝑥 ∼ 𝑈[0,1]§ Canusearandomhashfunction,limitedprecision

§ Keep𝑘 itemswithsmallest𝐻 𝑥

Reservoirsampling[Knuth1968,Vitter1985]:selectsa𝑘-subset(uniformlyatrandom)§ Fixedsamplesize𝑘 ;Composable (Knuth’s);state∝samplesize𝑘

Bernoullisampling:EachitemisselectedIndependentlywithprobabilityp

§ Correctness:All𝑘 subsetshavesameprobability§ Composability:bottom-k(𝐴 ∪ 𝐵) = bottom-k(bottom-k(𝐴) ∪ bottom−k(B) )

Uniform(equalprobability)distinctsamplingDistinctsampling: A(uniformatrandom)𝑘-subsetofdistinct “keys”

Distinctkeys:

Uniform(equalprobability)distinctsamplingDistinctsampling: A(uniformatrandom)𝑘-subsetofdistinct “keys”

Distinctkeys:

Distinctcounting: #(distinctkeys)in(sub)population

ExampleApplications:§ Distinctsearchqueries§ Distinctsource-destinationIPflows§ Distinctusersinactivitylogs§ Distinctwebpagesfromlogs§ Distinctwordsinacorpus§ ……………

Uniformdistinctsampling&approximatecounting

§ Associatewitheachitem𝑥 anindependentrandomhash 𝐻 𝑥 ∼ 𝑈[0,1]§ Keep𝑘 keys withsmallest𝐻 𝑥

Distinctsampling: A(uniformatrandom)𝑘-subsetofdistinct“keys”

withkey

Distinctcounting: #(distinctkeys)in(sub)populationComputation§ (Naïvealgorithm)Aggregatethensample/count:state∝#distinctkeys§ DistinctReservoirsampling/approximatecounting[Knuth1968]+[Flajolet &Martin1985]:composablesummarywithstate∝samplesize𝒌

EstimationfromauniformsampleDomain/Segmentqueries:Statisticofaselectedsegment𝐻 ofpopulationΧ

Examples:ApproximatethenumberinPokémons inourpopulationthatare§ College-educatedwatertype§ FlyingandhatchedinBoston§ Weighinglessthan1kg

Application:Logsanalysisformarketing,planning,targetedadvertising

Estimationfromauniformsample

§ Ratioestimator(whentotalpopulationsize Χ isknown):|E∩G||E| |Χ|

Domain/Segmentqueries:Statisticofaselectedsegment 𝐻 ofpopulationΧ

§ Unbiased minimumvarianceestimator

§ coefficientofvariation:HI =JK L

�“relativeerror”

§ Concentration (Bernstein,Chernoff)- probabilityoflargerelativeerrordecreasesexponentially

§ Inverseprobabilityestimator[HorvitzThompson1952](when|Χ| isnotavailable,aswithdistinctreservoirsampling):NO 𝑆 ∩ 𝐻 where𝑝 = Prob 𝐻 𝑥 ≤ min

Y∉E𝐻 𝑦 = 𝑘 + 1 ]^-smallest𝐻(𝑥)

Properties:

ApproximatedistinctcountingofalldataelementsCanusedistinctuniformsample,butcandobetterforthisspecialcase

HIPestimators[Cohen‘14,Ting‘15]:halvethevarianceto N_L !

§ Idea: trackanestimatecountasthestructure(sketch)isconstructed.Addinverseprobabilityofmodifyingstructurewitheachmodification

§ Applicablewithalldistinctcountingstructures§ Surprisingly,betterestimatorthanpossiblefromfinalstructure

HyperLogLog [Flajolet etal2007]

§ Reallysmallstructure:optimalsizeO(𝜖b_ + loglogn) forCVHI = 𝜖 ; 𝑛 distinctkeys

§ Idea: forcounting,noneedtostorekeysinsample,sufficestouse𝑘 = 𝜖b_exponentsofhashes.Exponentsvalueconcentratedsostoreoneand𝑘 offsets.

Sample(S)c

Approx.count

IfstructureS ismodified𝑐 += 1/𝑝(modified)

Graphs:Estimatingcardinalitiesofneighborhoods,reachability,centralities

Graph𝐺(𝑉, 𝐸)

Applications: Networkanalytics,socialnetworks,otherdatawithgraphrepresentation

§ Exactcomputationis𝑂( 𝐸 𝑉 )§ Sample-basedsketches[C’1997]:near-linear𝑂m( 𝐸 )

Q: numberofnodes§ Reachablefrom§ Withindistance5from

Idea:§ Associateindependentrandom𝐻 𝑣 ∼ 𝑈[0,1]

withallnodes§ Computeforeachnodeu asketch𝑆(𝑢) ofthe𝜖b_ reachablenodes𝑣 withsmallestH(𝑣)

EstimatingSparsitystructureofmatrixproductsProblem: Computeamatrixproduct𝐵 = 𝐴N𝐴_⋯𝐴s 𝑛 ≥ 3§ 𝐴u aresparse (manymorezeroentriesthannonzeroentries)Q: findbestorder(minimizecomputation)forperformingmultiplication§ Fromassociativity,𝐵 = 𝐴N𝐴_𝐴v canbecomputedusing(𝐴N𝐴_)𝐴v or 𝐴N(𝐴_𝐴v)§ Computationdependsonsparsitystructure

[C’1996]:Thesparsitystructureofsub-products(numberofnonzeros inrows/columnsofproducts)canbeapproximatedinnear-lineartime(innumberofnonzeros in{𝐴u}.Solution: Preprocesstodeterminethebestorder,thenperformcomputation

Idea: Defineagraphforproductbystackingbi-partitegraphs𝐺u thatcorrespondtononzeros in𝐴u.Row/columnsparsityofproductcorrespondstoreachabilitysetcardinality.

Unequalprobability,WeightedsamplingKeys𝑥 canhaveweights 𝑤z

Exampleapplications(logsanalysis):§ Key:video-userviewweight:watchtime Query: Totalwatch

timesegmentedbyuserandvideoattributes§ key: IPnetworkflowweight: bytestransmittedQuery:Totalbytes

transmittedforsegmentofflows(srcIP,destIP,protocol,port,…)Applications:billing,trafficengineering,planning

Segmentqueries:Forsegment𝐻 ⊂ Χ,estimate

w H = ∑ 𝑤z�z∈G (weightw H ofmyPokémonbag𝐻)

10.0kg

9.0kg

460.0kg

210.0kg

210.0kg 12.4kg

30.0kg

4.0kg 8.0kg

Unequalprobability,Weightedsampling§ Key𝑥 hasweights 𝑤z§ Sample𝑆 thatincludes𝑥withprobability𝑝z

§ Unbiased (when𝑝z > 0)§ ! Toobtainstatisticalguaranteesonqualityweneedtosample

heavierkeyswithhigherprobability.

Segmentquery:w H = ∑ 𝑤z�z∈G

Inverseprobabilityestimator[HT52]:𝑤� H = ∑ ��

O��z∈G∩E

10.0kg

9.0kg

460.0kg

210.0kg

210.0kg 12.4kg

30.0kg

4.0kg 8.0kg

PoissonProbabilityProportionaltoSize(PPS)[HansenHurwitz1943]𝑝z ∝ 𝑤z minimizessumofper-keyvariance

§ CVHI =�(J)� G L

� “relativeerror”,concentration

Robust:If probabilitiesareapproximate,guaranteesdegradegracefully

Unequalprobability,Weightedsampling

§ Associatewitheachkey thevalue ,forindependentrandom

§ Keep keyswithsmallest

Composableweightedsamplingschemewithfixedsamplesize𝑘:Bottom-𝑘/Ordersamples/“weighted”reservoir

Key𝒙

𝑤z

𝑢z 0.22 0.12 0.31 0.81 0.06 0.72 0.45 0.57

𝑟 𝑥 0.017 0.03 0.00067 0.027 0.00029 0.072 0.05625 0.00271

10.0kg460.0kg 210.0kg 210.0kg12.4kg 30.0kg4.0kg 8.0kg

Examplewithu� ∼ 𝑈[0,1]


§ Without-replacement 𝐷 = EXP[1] [Rosen72,C’’97C’Kaplan‘07,Eframidis Spirakis ’05]§ Priority(sequentialPoisson)sampling𝐷 = U 0,1 [Ohlsson ‘00,DuffieldLundThorup ’07]

§ EssentiallysamequalityguaranteesasPoissonPPS

!! worksformax-distinctoverkeyvaluepairs(𝑒. 𝑘𝑒𝑦, 𝑒. 𝑤)where𝑤z = maxe. w�|�.L�Y�z

§ Associatewitheachkey thevalue ,forindependentrandom


Composableweightedsamplingschemewithfixedsamplesize𝑘:Bottom-𝑘/Ordersamples/“weighted”reservoir

Estimation:Inverseprobability𝜏 = smallest𝑟z of𝑥 ∉ 𝑆 ;𝑝z = Prob 𝑟 𝑥 < 𝜏 = ProbY∼�[𝑦 < 𝑤z𝜏]


Cut/spectralgraphsparsifiers of𝐺 = (𝑉, 𝐸)[Benczur Karger 1996]Samplegraphedgesandre-weightbyinverseprobabilitytoobtaina𝐺’ = (𝑉, 𝐸’)suchthat|𝐸′| = 𝑂(|𝑉|) andallcutvaluesareapproximatelypreserved[Spielman Srivastava2008]Spectralapproximation𝐿� ≈ 𝐿�� byusing𝑝� ∝(approximate)effectiveresistanceof𝑒 ∈ 𝐸

ExampleApplications:

Samplingnonnegativematrixproducts[CohenLewis1997]Nonnegativematrices𝐴, 𝐵Efficientlysampleentriesintheproduct𝐴𝐵 proportionallytotheirmagnitude(withoutcomputingtheproduct)Idea: randomwalksusingedgeweightprobabilities

SampleCoordination[Brewer,Early,Joyce1972] Samesetofkeys,multiplesetsofweights(“instances”).Makesamplesofdifferentinstanceassimilaraspossible.

10.0kg460.0kg 210.0kg

210.0kg

12.4kg30.0kg4.0kg 8.0xkg

Surveysamplingmotivation: Weightevolvebutsurveysimposeburden.Wanttominimizetheburdenandstillhaveaweightedsampleoftheevolvedset.

5.0kg300.0kg 110.0kg

300.0kg

15.0kg50.0kg6.0kg 4.0kg

Monday:

Tuesday:

Key𝒙

Mon 𝑤z

Tue 𝑤z

𝑢z 0.22 0.12 0.31 0.81 0.06 0.72 0.45 0.57

Mon 𝑟 𝑥 0.017 0.03 0.00067 0.027 0.00029 0.072 0.05625 0.00271

Tue𝑟 𝑥 0.015 0.02 0.00103 0.0162 0.00055 0.144 0.1125 0.0019

HowtocoordinatesamplesCoordinatedbottom-𝑘 samples:Usesame 𝑢z = 𝐻(𝑥) forallinstances

§ Associatewitheachkey thevalue ,forindependentrandomhash


!!Changeisminimizedgiventhatwehaveaweightedsampleofeachinstance

10.0kg460.0kg 210.0kg 210.0kg12.4kg 30.0kg4.0kg 8.0kg

300.0kg 300.0kg15.0kg 50.0kg6.0kg 4.0kg110.0kg 5.0kg

Coordinationofsamples

§ LocalitySensitiveHashing(LSH)(similarweightvectorshavesimilarsamples/sketches)§ Multi-objectivesamples (universalsamples):Asinglesample(assmallaspossible)thatprovidesstatisticalguaranteesformultiplesetsofweights/functions.

§ Statistics/Domainqueriesthatspanmultiple“instances”(Jaccard similarity,𝐿Odistances,distinctcounts,unionsize,sketchesofcoveragefunctions…)§ MinHash sketchesareaspecialcasewith0/1weights.

§ Facilitatesfastercomputationofsamples.Example: [C’97]Sketching/samplingreachabilitysetsandneighborhoodsofallnodesinagraphinnear-lineartime.

§ Facilitatesefficientoptimizationoversamples: Optimizeobjectiveoversetsofweights/functions/parametrizedfunctions.Example:Centrality/clusteringobjective[CCK‘15],learning[KingmanWelling‘14]

Verypowerfultoolforbigdataanalysiswith applicationswellbeyondwhat[Brewer,Early,Joyce1972] couldenvision

Multi-objectiveSample§ Samekeyscanhavedifferent“weights:”IPflowshavebytes,packets,count§ Wewanttoanswersegmentquerieswithrespecttoallweights.§ Naïvesolution:3disjointsamples§ Smartsolution:Asinglemulti-objectivesample

10.0kg50cm10years

460.0kg180cm50years

210.0kg300cm40years

210.0kg200cm150years12.4kg

30cm2years

30.0kg100cm100years

4.0kg60cm1year

8.0kg60cm12years

Multi-objective Priority (sequential Poisson) sampling

xwx 135 2 9 18 21 4 11 4 2

Count 1 1 1 1 1 1 1 1 1cap5(wx) 5 2 5 5 5 4 5 4 2thresh10 1 0 0 1 1 0 1 0 0

ux 0.52 0.24 0.76 0.90 0.14 0.32 0.44 0.07 0.82ux

thresh10(wx )0.52 1 1 0.90 0.14 1 0.44 1 1

uxcap5(wx )

0.104 0.120 0.152 0.18 0.064 0.080 0.088 0.0175 0.41

For k = 3, the MO sample for F = {count, thresh10, cap5} is:

Edith Cohen Scalable Weighted Sampling

Multi-objectivesampleof𝑓-statistics

𝑓

Onesetofweights𝑤z,butareinterestedindifferentfunctions 𝑓(𝑤z)

Multi-objectiveSample[C’KaplanSen‘09,C’’15]

Ourmulti-objectivesample𝑆 for𝑊 is§ TheunionS = ⋃ 𝑆(u)�

u§ Samplingprobabilities𝑝z = max

u𝑝 u (𝑥)

Setoffunctionsw ∈ 𝑊 wherew: Χ → 𝑅�§ Computecoordinatedsamples𝑆(u) foreachw(�) ∈ 𝑊

Theorem: Foranydomainquery:w(�) H = ∑ 𝑤z(�)�

z∈G

Theinverseprobabilityestimator𝑤�(u) H = ∑ ��(�)

O��z∈G∩E

Isunbiased andprovidesstatisticalguaranteesthatareatleastasstrongasanestimatorappliedtothededicatedsample𝑆(u)

Multi-objective Priority (sequential Poisson) sampling

xwx 135 2 9 18 21 4 11 4 2

Count 1 1 1 1 1 1 1 1 1cap5(wx) 5 2 5 5 5 4 5 4 2thresh10 1 0 0 1 1 0 1 0 0

ux 0.52 0.24 0.76 0.90 0.14 0.32 0.44 0.07 0.82ux

thresh10(wx )0.52 1 1 0.90 0.14 1 0.44 1 1

uxcap5(wx )

0.104 0.120 0.152 0.18 0.064 0.080 0.088 0.0175 0.41

For k = 3, the MO sample for F = {count, thresh10, cap5} is:

Edith Cohen Scalable Weighted Sampling

Multi-objectivesampleofstatisticsPriority

(seq

uentialPoisson

)

𝑓

Onesetofweights𝑤z,differentfunctions𝑓(𝑤z)

Multi-objectivesampleofallmonotonestatistics𝑀 :Allmonotonenon-decreasingfunctions𝑓 with𝑓 0 = 0Examples:𝑓 𝑤 = 𝑤O;𝑓 𝑤 =min{10,w};𝑓 𝑤 = log(1 + 𝑤),…..

Dataofkeyvaluepairs(𝑥, 𝑤z) :Foreach𝑓,Instanceis𝑓(𝑤z) for𝑥 ∈ Χ

Theorem: [C’97,C’K’07](thresholdfunctions)[C‘15]all𝑀Multi-objectivesampleforallmonotonestatistics𝑀has§ (expected)samplesize: 𝑂(𝑘ln𝑛) ,where𝑛 = #keyswith𝑤z > 0§ Composable structureofsizeequaltothesamplesize

⟹ Veryefficienttocomputeonstreamed/parallel/distributedplatforms

𝑁𝑒𝑥𝑡: 𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝑠𝑡𝑜𝑔𝑟𝑎𝑝ℎ𝑠𝑎𝑛𝑑𝑠𝑡𝑟𝑒𝑎𝑚𝑠

Multi-objectivesampleofmonotonestatistics

Application:DataStreamstime-decayingaggregationsmonotonenon-increasing𝛼(𝑥),andsegment𝐻 ⊂ 𝑉

𝐴® = ¯ 𝛼�

°∈G

𝑡°

• 𝑡°:Elapsedtimefromstartofstreamto𝑢• 𝑡°:Elapsedtimefrom𝑢 tocurrenttime


12:00am 1:00am 2:00am 3:00am 4:00am

Multi-objectivesampleofmonotonestatistics

Thm: All-DistancesSketches(ADS)(MOsamples)forall nodescanbecomputedin𝑂m( 𝐸 ) computation.WecanestimateC® 𝑣,𝐻 forall𝛼,𝐻 fromADS(𝑣)

Application: Centralityofallnodesinagraph𝑮 = (𝑽, 𝑬) [C’97C’Kaplan‘04C‘15]Foranode𝑣,monotonenon-increasing𝛼(𝑥),andsegment𝐻 ⊂ 𝑉centrality of𝑣 forsegment𝐻 is

C® 𝑣, 𝐻 = ∑ 𝛼�°∈G 𝑑µ° (Harmoniccentrality:𝛼 𝑥 = N

z )


Multi-objectivesampleofdistancestoasetofpointsinametricspace[Chechik C’Kaplan‘15]§ Metricspace𝑀§ Setofpoints𝑃 = {𝑥N, 𝑥_, … , 𝑥s} ⊂ 𝑀§ Eachpoint𝑣 ∈ 𝑀 definesweights𝑤µ 𝑥u = 𝑑µz�§ Amultiobjectivesampleofall𝑤µ allowsustoestimateforsegments𝐻 ⊂ 𝑃,andanyquerypoint𝑣thesumofdistances

C v, H = ∑ 𝑑µz��z�∈G

!!! Canevenrelaxtriangleinequalitytod¹º ≤ 𝜌 𝑑¼½ + 𝑑½¾ (e.g.squareddistances)

Theorem:§ Multi-objectiveoverheadfordistancesis𝑂(1) !§ Canbecomputedusinganear-linearnumberofdistancequeries

⟹Samplesize𝑂(𝜖b_) sufficesforestimatesofC v, 𝑃 foreach𝑣 ∈ 𝑀 withCV𝜖

Estimatorsformulti-instanceaggregates

Graphs:Influenceestimationfromnodesketches

MonotoneEstimationProblems[C’Kaplan13,C’14]:§ Characterizationofallfunctionsforwhichunbiasedboundedvarianceestimatorsexist§ Efficient(Paretooptimal)estimators(whentheyexist)

Specificestimatorsforspecificaggregations:0/1weights:Sizeofunion[C’95]Jaccard similarity[Broder‘97]Generalweights,tighterestimators:max,min,quantiles[C’Kaplan2009,2011]

Setoffunctionsw ∈ 𝑊 wherew: Χ → 𝑅�Coordinatedsamples𝑆(u) foreachw(�) ∈ 𝑊

Examplemulti-instanceaggregations:§ 𝐿O

O distance∑ |𝑤(N) 𝑥 − 𝑤(_) 𝑥 |O�z∈G

§ One-sided𝐿OO ∑ max{0, 𝑤(N) 𝑥 − 𝑤 _ 𝑥 O�

z∈G

§ max,min,𝑘ÀÁ,§ Union,Jaccard similarity§ Generalizedcoveragefunctions

!!!Coordinationisessentialingettinggoodestimators.Independentsampleswillnotwork.

Distributed/Streameddataelements:Sampling/countingwithoutaggregation§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey

§ Naïve:Aggregatepairs(𝑥, 𝑤z),thensample- requiresstatelinearin#distinctkeys§ Challenge: Sample/Countwithrespectto𝑓(𝑤z) usingsmallstate(noaggregation).

§ Sampling goldstandard:“aggregated”samplesize/qualitytradeoffsCV=HI =NL�

§ Countinggoldstandard:likeHyperLogLog 𝑂(𝜖b_ + loglog𝑛) ,CV=𝜖

§ Max weight:𝑤z = maxe. value�|�.L�Y�z

§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�z∈G samplingscheme

§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J Approximate“counting”structure

VComposable bottom-k

2

82

1

5

8 5§ Sum ofweights:𝑤z = ∑e. value

�|�.L�Y�z11 7

Distributed/Streameddataelements:Sampling/countingwithoutaggregation

§ Distinct𝑓 x = 1 (𝑥 > 0):count [FlajoletMartin’85,Flajolet etal‘07]sample [Knuth‘69]§ Sum𝑓 x = x:count[Morris’77],sample [GibbonsMatias‘98,Estan Varghese‘05,CDKLT‘07]§ Frequencymoments𝑓 x = xÂ:count [Alon MatiasSzegedy ’99,Indyk ‘01]§ “universal”sketches[Braverman Ostrovsky ‘10](count)

§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey§ Sum ofweights:𝑤z = ∑e. value

�|�.L�Y�z§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�

z∈G

§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J

But-- Exceptforsum anddistinct,notevencloseto”goldstandard”

2

82

1

5

Sampling/countingwithoutaggregation§ Dataelement𝑒 haskeyandvalue(e.key,e.value)§Multipleelementsmayhavethesamekey§ Sum ofweights:𝑤z = ∑e. value

�|�.L�Y�z§ Segment𝑓-statistics:∑ 𝑓(𝑤z)�

z∈G

§ 𝑓-statistics:∑ 𝑓(𝑤z)�z∈J

§ Distinct,sum§ Lowfreq.moments:𝑓 𝑥 = 𝑥O for𝑝 ∈ [0,1]

2

82

1

5

SamplingIdeas:Elementprocessthatconvertsumtomaxviadistributionstoapproximatesamplingprobabilities.Invertthesamplingtransformforunbiasedestimation.

§ Cappingfunctions:𝐶𝑎𝑝Ä 𝑥 = min{𝑇, 𝑥}§ Logarithms:𝑓 𝑥 = log(1 + 𝑥)

[C’15,C’16]Samplingandcountingnear“goldstandard”(× 𝑒/(𝑒 − 1)� ≈ 1.26)𝑓:Concavewith(sub)lineargrowth

CountingIdeas:ElementprocessingguidedbyLaplacetransformtoconverttomax-distinctapproximatecountingproblem.

ConclusionWegotatasteofsampling“bigideas”thathavetremendousimpactonanalyzingmassivedatasets.§ Uniformsampling§ Weightedsampling§ Coordinationofsamples

§ Multi-objectivesample§ Estimationofmulti-instancefunctions

§ Samplingandcomputingstatisticsunaggregated (distributed/streamed)elements

Future: StillfascinatedbysamplingandtheirapplicationsNearfuturedirections:§ Extend“goldstandard”sampling/countingoverunaggregated dataandunderstandlimitsofapproach§ Coordinationforbettermini-batchselectionformetricembeddingviaSGD§ Multi-objectivesamplesforclusteringobjectives,understandoptimizationovercoordinatedsamples

Thankyou!

the magic of random sampling: from surveys to big data · the magic of random sampling: from...

Documents