[ieee 24th annual symposium on foundations of computer science (sfcs 1983) - tucson, az, usa...

*Some of this research was done while the. authorwas consulting for the IBM Palo Alto Scientific Center.Support was also provided in part by NSF Grant MeS81-05324, by an IBM research contract,and by ONRand DARPA under Contract NOOOl4-83-K-0146 andARPA Order No. 4786.

Many computer science and statistics applications callfor a sample of n records selected randomly froma :file containing N records or for a random sampleof n integers from the set {I, 2,3, ... , N}. (Bothtypes of random sampling are essentially equivalent,so in this paper we will refer to the former typeof sampling, 'in which records are selected.) Someimportant uses of sampling include market surveys,quality control in manufacturing, and probabilistic algorithms. The author's interest in the subject stemsfrom work on a fast new external sorting method calledBucketSort that uses random sampling for preprocessing ([Lindstrom and Vitter 81, 82]).

Several·fast new algorithms are presented for samplingn records at random from a file containing N records.The first problem we solve deals with sampling whenN is known, and the the second problem considersthe case when N is unknown. The two main resultsin this paper are Algorithms D and Z. Algorithm Dsolves the first problem by doing the sampling witha small constant amount of space and in O(n) time,on the average; roughly n uniform random variatesare generated, and approximately n exponentiationoperations are performed during the sampling Thesample is selected sequentially and online; it answersan open problem in [Knuth 81]. Algorithm Z solvesthe second problem by doing the sampling using O(n)space, roughly nln(N In) uniform random variates andO(n(1 + 10g(NIn») time, on the average. Both algorithms are time- and space-optimum and are short andeasy to implement.

Abstraet

1. INTRODUCTION

OPTIMUM ALGORITHMS FOR

TWO RANDOM SAMPLING PROBLEMS·

(extended abstract)

Jeffrey Scott Vitter

Department of Computer ScienceBrown University

Box 1910Providence, RI 02912

In this paper we find and analyze efficient algorithms for the two random sampling problemsdescribed below. Access to the records is always donein a sequential manner. For each case we measure onlythe CPU time and not the I/O time. This is reasonablefor records stored on random-access devices like RAMor disk, and it's even reasonable for tape storage, sincetapes have a fast-forward speed that can quickly skipover unwanted records. In terms of random samplingof n integers out of N, the I/O time is insignificant,since there is no file of records being read.

Problem 1: When N is known. If the value of N isknown, one way to select the n records is to repeatedlygenerate a random integer k between 1 and Nandselect the kth record if it has not already been selected.The algorithm terminates when n records have beenselected. (If n > N /2, it is faster select the N - nrecords not in the sample.) This is an example of a nonsequential algorithm, because the records in the samplemay not be selected in linear order. For example, the84th record in the file may be selected before the 16threcord in the file is selected. The algorithm requiresthe generation of O(n) uniform random variates, andit runs in O(n) time, but requires excessive space.

Often we want the n records in the sample to bein the same order that they appear in the file so that,for example, they can be accessed sequentially if theyreside on disk or tape. In order to accomplish this usinga nonsequential algorithm, we must sort the records'by their indices after the sampling is done. This takesnonlinear time or else the space requirement is verylarge and the algorithm is complicated.

More importantly, the n records cannot be outputin sequential order online: it takes O(n) time to outputthe first element, since the sorting can begin 0111y afterall n records have been selected.

The alternative we take for this problem is toinvestigate sequential random sampling algorithms,which select the records in the same order that theyappear in the file. The sequential sampling algorithmswe present (Algorithms A, B, C, and D) are suitedideally to online use, since they iteratively select the

650272-5428/83/0000/0065$01.00 © 1983 IEEE

2.1. ALGORITHM S

2. PROBLEM 1: WREN N IS KNOWN

R N-n D(N)

Z ~nln(Nln) O(n(I+logN/n»)

ave. exec. timeave. # unit.

Alg. random TarS.

In this section we discuss the sequential random sampling method introduced in [Fan, Muller, and Rezucha62] and [Jones 62]. The algorithm in sequence processesthe next record in the file and determines whether itshould be included in the sample. When n recordshave been selected, the algorithm terminates. The algorithm never runs past the last record in the file, and

This section reviews Algorithm S, which up until nowwas the method of choice for sequential random sampling, and then covers the four new sequential methods.The main result of this section is AlgorithmD, whichis faster than all other random sampling algorithms,both sequential and nonsequential.

Typical values of the constant r are in the range 5-15..Much of the analysis and many of the derivations arenot included in this version of the paper for purposesof brevity. For the same reason, Algorithms D and Zare the only new algorithms that are covered in detail.

y ~ nIn(N In) O(n2 (1 + log(NIn) log log N))

x ~ nIn(N In) D(N)

processed so far. When the last record is processed, thecandidates are sorted internally by index and returnedas the final sample. We can show that every algorithmthat processes the records in a sequential manner is atype of reservoir algorithm.

The method of choice up until now was AlgorithmR in [Knuth 81], which was developed by AlanWaterman. We present three new reservoir methods(Algorithms X, Y, and Z) that use the same amountof space as Waterman's method, but are much faster.They are discussed in Section 3; their performance issummarized in the table below. The time for the finalinternal sort is not included. Algorithm Z, which isthe main result for this problem, achieves the optimumrunning time.

ave. # unit.Aig. random vars. aTe. exec. time

S(N + l)n O(N)

n+l

'A " O(N)

B n O(n2 1oglogN)

Cn(n+ 1) 0(,,2)

2

D ~n O(n)

Section 2.6 gives CPU timings for FORTRAN implementations of Algorithms S, A, and D on an mM3081 mainframe; the constants of proportionality inthe above big-oh terms for these three algorithms(in pseconds) are approximately 30 (Algorithm S), 4(Algorithm A)" and 50 (Algorithm D).

Problem 2: When N I. unknown. This problem arises,for example, if the records to be sampled from resideon a tape of indeterminate length. One way to get arandom sample when N is unknown is first to determine the value of N J which reduces this problem to theprevious one, and then to use AlgorithmD. However,predetermining N may not always be practical or possible. For example, in the tape analogy, rewinding maynot be allowed.

In this section we will look at methods for sampling when N cannot be predetermined that access therecords sequentially. In this case the best we can dois use a type of "reservoir" algorithm, which cannotbe online, since the entire sample must be selected before the first record is output. A reservoir algorithmstarts out with the first n records as candidates forthe sample. It maintains the invariant that the candidates form a true random sample of all the records

next record for the sample in an efficient way. Theyalso have the advantage that they require a small constant amount of· space and are simple to implement.

The main result for this problem is the designand analysis of Algorithm D, which does the sequential sampling with about n uniform random variatesand in O(n) time, on the average. This yields the optimum running time, and it solves the open problemlisted in exercise 3.4.2-8 in (Knuth 81]. The method ismuch faster than the previously fastest-known sequential algorithm (Algorithm S), and it is much faster andsimpler than the nonsequential algorithms mentionedabove. The algorithms are covered in Section 2; theirperformance is summarized in the table below.

66

the n selected records form a true random sample ofsize n.

If m records have already been selected fromamong the first t records in the file" the (t+1)st recordis selected with probability

(N -t-l)/(N -t) = n . m.n - m - 1 n - m N - t (2-1)

In the implementation below, the values of nandN decrease during the course of execution. All thealgorithms in Section 2 follow the convention that nis the number of records remaining to be selected andN is the number of records that have not yet beenprocessed. (This is different from the implementationsof Algorithm S in [Fan, Muller, and Rezucha 62], [Jones62], and [Knuth 81], in which nand N remain constantand auxiliary variables like m and t in (2-1) are used.)With this convention, the probability of selecting thenext record for the sample is simply n/N. This canbe proved directly by the following short but subtleargument: if at any given time we must select n morerecords at random from a pool of N remaining records,then the next record should be chosen with probabilityniN.

while n > 0beginGenerate an independent uniform r. v. U;it NU < n then

beginSelect the next record for the sample;n:= n-lend

else Skip over the next record(do not include it in the sample);

N:=N-1;end;

2.2 THREE NEW SEQUENTIAL ALGORITHMS

We define S(n, N) to be the random variable thatcounts the number of records to skip over before selecting the next record for the sample. The parameter nis the number of records remaining to be selected, andN is the total number of records left in the file. Inother words, the (8(n, N) + l)st record is the next oneselected. Often we will abbreviate S(n, N) by S, inwhich case the parameters nand N will be implicit.

In this section we present three new methods(Algorithms A, B, and C) for sequential random sampling that outperform Algorithm S. A fourth new

67

method (Algorithm D) is described in Sections 2.32.5. Much of the analyses of the sampling methods areomitted for lack of space. Each method decides whichrecord to sample next by generating S and skippingthat many records. The general form of all four algorithms is as follows:

while n > 0beginGenerate an independent random variate 8(n, N);Skip over the next 8('1, N) records and

select the following one for the sample;N := N - 8(n, N) - 1;n:= n-1;end;

Algorithms A, B, C, and D differ from one anotherin how they generate 8(n, N). The range of Sen, N) isthe set of integers in the interval 0 ~ s ::; N - n.The distri.bution function F(s) - Prob{S ::; 8}, foro~ 8 ~ N - 'I, can be expressed in two ways:

F(8) = 1- (N - 8 -1)!!. = 1- (N - n).!±!.. (2-2)N!! N s+1

(We use the notation a! to denote the "falling power"a(a-l) ... (a-b+l) = a!/(a-b)!.) ·SimilarlYl we havetwo equivalent expressions for the probability function1(8) = Prob{S = 8}, for 0 ::; 8 S N - n:

n (N - 8 - l)n-l n (N' - n)!.f(8) = N (N - 1)!!=.!. = N (N _ 1).!.· (2-3)

The expected value of 8 is (N - n)/(n + 1), and thestandard deviation of S is approximately the same.The probability function f(8) is graphed in Figure 1.

Algorithm A. We can generate S by setting it equalto the minimum value 8 ~ 0 su·ch that U ~ F(s),where U is uniformly distributed on the unit interval.By (2-2), we have

(N - n)8+1 .U ~ 1 - N.!±!. I

(N - n)8+1+1 <1-U.N_B _ -

The random variable V = 1 - U is uniformly distributed as U is, so we can generate V directly, as inthe following algorithm.

whUe n> 0beginGenerate 8n independent uniform r. v. V;Search sequentially for tbe minimum 8 ~ 0

such that (N - n)B+1 ~ N B+1V;8:=8;Skip over the next 8 records and select

the following one for the sample;N := N - 8(7'1, N) - 1;n:= n-I;end;

Algorithms S and A both require O(N) time,but the number n of uniform variates generated byAlgorithm A is much less than (N + 1)7'1/('1 + 1),

.which is the average number of variates generated byAlgorithm S. Algorithm A is several times faster.

Algorithm A is similar to the one proposed in[Fan, Muller, and Rezucha 62], except that in thelatter method, the minimum , > 0 satisfying U SF(s) is found by recomputing F(,) from scratch foreach successive value of 8. The resulting algorithm

n/N

••

takes O(nN) time. As the authors Doted, that algorithm was definitely slower than their implementationof Algorithm S.

Algorithm B. In Algorithm A, the minimum value 8

satisfying U ~ F(8) is found by means of a sequential search. Another way to do that is to find the"approximate root" 8 of the equation F(,) ~ U byusing a variant of Newton's interpolation method. Wecall the resulting method Algorithm B.

Since F(s) does not have a continuous derivative,we use in its place the ditrerence function

4F(s) = F(8 + 1) - F(,) = 1(8 + 1).

Newton's method can be shown to converge for thissituation. We can obtain a higher order convergencein the search for the minimum 8 by replacing Newton'smethod with an interpolation scheme that uses thehigher order dift'erences 41;F(8) = 41;-1F(8 + 1) 4 k - 1F(8). Each difference ~kF(8) can be computedin constant time from 4 k - 1F(,) by one multiplicationand one division. Algorithm B does not seem to be ofpractical interest, compared to Algorithms A and D, sowe will omit the details.

•••••• cg(s)

•••••• 1(5)

••

•• •

••

• • • • •• h (5)

•

• •• • •

• • •• • •

• •• • •• • ,i• • 6 , •• f • • • I

N/n N-n N

figure 1. Probability function f(s) = Prob{S = 8} is graphed as a·function of 8. The mean and standarddeviation of S are both approximately N In. The functions Cg(8) and h(B) that are used in Algorithm Darealso graphed; the random variable X is assumed to be integer-valued.

68

S in constant time by generating its continuous counterpart and then "correcting" it so that it has exactlythe desired distribution function F(B).

Algorithm D has the general form described at thebeginning of Section 2.2. The random variable S isgenerated by an application of a discrete version ofvon Neumann's rejection-acceptance method. We usea random variable X that is easy to generate and thathas a distribution whjch approximates F(B) well. F~r

simplicity, we assume· that X is either a continuousor an integer-valued random variable. Let g(z) denotethe density function of X if X is continuous or else theprobability function of X if X is integer-valued. Wechoose a constant c ~ 1 so that

With high probability, we have U ~ h(lXj)/cg(X).When this occurs, it follows that U ~ f(lXj)/cg(X),so we can accept lXj and set S := LXJ. Thevalue of f(lXj) must be computed only when U >h(lXJ)/cg(X) which happens rarely. The functionsf(8), Cg(B), and h(8) are graphed in Figure 1 for thecase in which X is an integer-valued random variable.

When n is large with respect to N, the rejection technique may be slower than the previous algorithms in this section due to the overhead involved ingenerating X J h(lX j), g(X), and !(lXj), For large n,Algorithm A is probably the fastest sampling method.The following algorithm utilizes a constant Q < 1 thatspecifies where the tradeoJfis: if n < QN, the rejectiontechnique is used to do the sampling; otherwise, if n .~

for all real numbers z.In order to generate S, we generate X and a ran

dom variate U that is uniformly distributed on theunit interval. If U > j(lXj)/cg(X) (Which occurswith low probability), we reject LXj and start all overby generating a new X and U. When the conditionU ~ J(lXj)/cg(X) is finally satisfied, then we acceptLXj and make the assignment S := LXJ. The following lemma.shows that S has the desired distribution.

Lemma 1. The random variate S generated by theabove procedure has distribution (2-2).

The comparison U > !(LXJ)/cg(X) that is madein order to decide whether LXJ should be rejectedinvolves the computation of !ClXj), which requiresO(min{n, lXJ+ I}) time using (2-3). Since the probability of rejection is very small, we can avoid this expense most of the time by substituting for feB) a morequickly computed approximation h(a) such that

Algorithm C. The distribution function F(B) Prob{S ~ 8} can be expressed algebraically as

F(B) = 1 - n N - 8 - IeIS1;S",N-k+l

= 1- n (1 - F1;(B + 1»), (3-8)1~1;S'"

where we let F,,(z) = Prob{(N - k + l)Uk ~ z} =%/(N - k + 1) be the distribution function of the random variable (N -le+l)U1;. The random variables UI ,

U2J ••• , U", are independent and uniformly distributedon the unit interval. By independence, we have

II (1 - F1;(a + 1»)lSkS"

IT Prob{(N - k + l)Uk > 8 + I}lSkS'"

= prOb{ min {(N - Ie + l)UA;} > 8 + I}.ISleS"

Substituting this back into (3-8), we get

F(8) = 1- prOb{ min {(N - Ie + l)UA;} > 8 + I}lSkS"

= prob{l min {(N - It; + I)U1;}J ~ B}.lS1;S"

(The notation lzj, which is read "floor of z," denotesthe largest integer ~ z.) This shows that S hasthe same distribution as the .:O.oor of the minimumof the n independent random variables NUl, (N 1)U2 , ••• J (N - n + l)U". Algorithm C is based onthis idea: it iteratively generates- S by setting S :=Lminl<k<", {(N -Ie + l)Uk}J.

The selection of the jth record in the sample,where 1 ~ j ~ n, requires the generation of n - j + 1independent uniform random variates and takes O(n- j + 1) time. Hence, Algorithm C requires n + (n1)+·· ·+1 = n(n+l)/2 uniform variates, and it runsin 0(n2 ) time.

2.3. ALGORITHM D

It is important to note that if the term (N -k+l)Uk inthe above expression for F(B) were instead replac.ed byNUt, the resulting expression would be the distribution function for the minimum of n real numbers in therange from 0 to N. That distribution is the continuouscounterpart of 5, and it approximates S well. One ofthe key ideas in this section is that we can generate

69

j(lzJ) ~ cg(z),

h(s) ~ f(8).

(2-4)

(2-5)

aN, the sampling is done by Algorithm A. The valueof Q depends on the particular computer implementation. Typical values of a can be expected to ·be in therange 0.05-0.20.

In the code below, ft is the number of records thatremain to be selected, and N is the number of unprocessed records in the pool. The functions g(z) andh(8) and the constant c ~ 1 depend on the currentvalues of nand N, and they must satisfy (2-4) and(2-5). The constant a is in the range 0 < a < 1.

while (n > 0) and (n :5 O-N)beginrepeat

Generate an independent r.v. Xwith density or probability In. g(z);

Generate an independent uniform r.v. U;If U ::; h(LXJ)/cg(X) then break-loop

untU U < !(LXl)/C9(X);S:= LXJ;Skip over the next S records and select

the following one for the sample;N :=N-S-l;n:= n-l;end;

It n > 0 thenCall Algorithm A to finisl1 the sampling

Choosing the Parameters. We will now present twogood ways for choosing the parameters X, c, g(z), andh(s). The first way works better when n2 / N is small,and the second way is better when n2 / N is large. Therule for deciding which to use is as follows: if n2 / N .~p, then we use Xl, CI, gt(z), and hl(s);else if n2 /N >{1, then X2, C2, g2(8), and h2(s) are used. The value ofthe constant p is implementation-dependent. In orderto minimize the average number of uniform variatesgenerated by Algorithm D, we should set (j ~ 1; thedetails are omitted.

Our first choice of the parameters is

gl(Z)={;(I-~r-l, O~z~Nj0, otherwise;

NCI = N-n+l;

hl(s) = {;(1- N_Bn+l)"·-l, O~ B ~ N-nj

0, otherwise.(2-6)

The random variable Xl with density g1(Z) has thebeta distribution scaled to the interval [0, N] and with

70

parameters a = 1 and b = n. It is the continuouscounterpart of S: the value of X I can be thought ofas the smallest of n numbers chosen independently anduniformly from the real numbers 0 ~ z ~ N. We cangenerate Xl in constant time by setting

Xl := N(1-Vl/") or Xl := N(1- e-Y In),(2-7)

where V is uniform random variate and Y is an ex...ponential variate.

Lemma 2. The choices of gl(Z), C1, and hl(s) in (2-6)satisfy the relation

Note that since gl(Z) is a nonincreasing function, thisimmediately implies (2-4).

The second choice for the parameters is

n-1 ( n-l)B92(8) = N -1 1 - N _ 1 ' 8 ~ OJ

n N-lC2=----·

n-l N '

{

n ( n _1)8- 1 - -- 0 ~ 8 ~ N - n;

h2(S) = N N - 8 '

0, otherwise.(2-8)

The random variable X2 with probability function92(S) has the geometric distribution. Its range of valuesis the set of nonnegative integers. It can be generatedvery quickly with a single uniform or exponential random variate.

Lemma 3. The choices '2(8), C2, and h2(s) in (2-8)satisfy relations (2-4) and (2-5):

2.4. OPTIMIZATION TECHNIQUES

This section gives a nice example of the interplay oftheory and practice by studying how Algorithm D canbe optimized. We'll show that by using some statisticaland programming techniques, the running time of thenaive implementation of the algorithm can be cut bya factor of 2.

For simplicity, let's assume that Xl is used forX. The random variable Xl can be generated in constant time by first generating an independent uniform

~ nNtil •N-n+1

2.5. ANALYSIS or THE REJECTION METHOD

In this section we prove that the average number ofuniform random variates generated by Algorithm Dand the average running time are both O(n). We alsodiscuss how the correct choice of X I or X 2 during eachiteration of the algorithm can improve performancefurther. The derivations are omitted for purposes ofbrevity.

Average Number V(n, N) of Unltorm RandomVariates. We will denote the average number ofuniform random variates generated during Algoritm Dby the symbol V(n, N). The following theorem showsthat V(n, N) is bounded by a linear function of n.

Theorem 1. The average number V(n, N) of uniformrandom variates used by Algorithm D is bounded by

When n < aN, we have VCn, N) ~ n(l +n/N). Theproof is by induction on n. The basic ideas of the proofare that U and X must be generated roughly 1+ n/Ntimes, on the average, in order to generate each S, andthat the ratio 711N does not change much each of the71 times S is generated.

In our derivation of (2-9), we assume that Xl isused for X throughout Algorithm D. We can do betterif we sometimes use X2 for X. The details will beincluded in the full version of the paper.

Average ExeeutloB Time T(n, N). We will useT(n, N) to represent the total average running time ofAlgorithm D.

Theorem 2. The average running time T(n, N) ofAlgorithm D is bounded by a linear function of D. Theconstant implicit in the big-oh notation is very small.

Outline of Proof. If n ~ aN, then Algorithm A isused, and the running time is O(n). All we need toshow is that the case n < aN also requires linear time.We assume that Xl, gl(Z)i Cl, and h1(s) are used inplace of X, g(z), c, and h(s) throughout Algorithm D.The repeat-loop is executed c times, on the average,when S is generated; each execution of the repeat-loop(not counting the test U ~ Jl(lXj)/Clgl(X» takesd1 time units, for some very small constant dl. Theproof of Theorem 1 shows that the total contributionto T(n, N) from the statements in the repeat-loop isbounded by

random variate V, as in (2-7). Hence the naive wayof generating X and U requires two uniform randomvariates per loop. We can reduce that number fromtwo per loop to just one per loop by generating Vbased on the previous values of U and X in a waythat guarantees independence, as follows: During theprevious repeat-loop, it was determined that eitherU ~ fll, fll < U < fl2, orfl2 < U, where Yl =h(lXj)/cg(X) and fl2 = f(lXJ)/cg(X). We can compute V for the next loop by setting

Uif U ~ fl1;I

1/1

v·- U -tilif 111 < U ~ 112;.- I

112 - 111

U-Y2if f/2 < u.

1- !/2,

The following lemma can be proven using thedefinitions of independence and V:

Lemma 4. The value V computed above is a uniformrandom variate that is independent of all previousvalues ofX and of whether or not each X was accepted.

At this point, only one uniform random variatemust be generated during each loop, but each loopstill requires two exponentiation operations: oneto compute X from V and the other to computeh(lXJ)/cg(X). When Xl is used for X, we have

h(lXj) = 1'f-n+l(N-n- lX J+l)",-lcg(X) N N - n + 1

We can cut down the number of exponentiations toroughly one per loop in the following way: Instead, ofdoing the test U ~ h(lXj)/cg(X), we use the equivalent test

(N )1/(11-1) N -. n-lXj + 1

N U < ·-,,+1 - N-n+l

If the test is true, which is almost always the case,we set V'to the quotient of the LHS divided by theRHS; the resulting V' has the same distribution as the(n - l)st root of a uniform random variate. Since 'I

decreases by 1 before the start of the next loop, wecan generate the next value of X without doing anexponentiation by setting

X := N(l - V').

(cr. (2-7).) Thus, in almost all cases, only one exponentiation operation is required per loop. Similaroptimizations can be used when X2 is used for X.

71

. {nNV(n,N) ~ N-n+l'

'I,

if 71 < a.N;

if n ~ aN.

(2-9)

For example, for the case n = 102 , N = 108 , theCPU times were 0.75 hours (Algorithm 8), 6.3 minutes(Algorithm A), and 0.0048 seconds (Algorithm D). Theoptimizations discussed in Section 2.4 cut the CPUtime by roughly half of what it was·previously.

These timings give a good lower bound on how fastthese algorithms run in ,practice and show the relativespeeds of the algorithms. On a smaller computer, therunning times should be several times longer.

The tricky part of the proof is to considerthe .contribution to T(", N) from the test U ~

Jl(lXJ)!Clg1(X) at the end of'the repeat-loop. Thetime for each execution of the test is bounded by cl2 •min{n, LX1J+ I} ~ d2(lX1J+ 1), for some very smallconstant ~. The repeat-loop is executed an averageof cltimes per generation of S. The probability thatU ~ ht (lXJ)/Clgt(X) (which is the probability thatthe test U ~ Jl(lXJ)/Clgt(X) will be executed next)is ~ 1 - hl(lXJ)/c191(X). Hence, the time spentexecuting the test U :::; fl(lXj)!Ct91(X) in order togenerate S is bounded by

The first integral is

approximations:

Algorithm

s

A

D

average executiontime· (seconds)

~3 X 10-5 N

The second integral equals

d:lN+2 = N-n+ld2

N+2.Cl n+ 1 N n+ 1

The diiference of the two integrals is bounded by

N3d2 N _ n + 1 ,

The proof of Theorem 1 shows that the total contribution of the test U ::; Jl(lXJ)/Clgl(X)to T(n,N) is atmost

nN3~N-n+l'

This completes the proof of Theorem 2. I

As before we can do better when ,,2/N > f1 (forsome constant p ~ 1) by using X2 for X instead ofXl. The details are omitted for lack of space.

2.8. COMPARISON or RUNNING TIMES

Algorithms 5, A, and D were implemented inFORTRAN on a large mainframe IBM 3081 computer,in order to get an idea of the limit of their performance.Their average CPU times are given by the following

72

3. PROBLEM 2: WHEN N IS UNKNOWN

The best solution to the problelll when N cannotbe predetermined is a "reservoir" algorithm, whichoperates as follows: Initially, the first n records·are putin the reservoir as "candidates" for the final sample.From time to time, more records are put in the reservoir as candidates, taking away candidate status fromrandomly-chosen former candidates. The algorithmmaintains the invariant that the current set of n candidates in the reservoir forms a true random sampleof the records processed so far. When the end of theIDe is reached, the candidates are sorted by index andreturned as theflnal sample. We can show that anyalgorithm that solves this problem without predetermining N and by processing the records in sequentialorder must be a type of reservoir algorithm.

Many more records are put in the reservoir thanend up as part of the final sample. It is shown in[Knuth 81] that the resulting size of the reservoir isn(1 +HN-H",) ~ n(l +In(NIn). We can show thaton the averageJ this size of reservoir is required for anyalgorithm, which implies a lower bound of n(n(l +log(NIn») time for the sampling.

This section begins with a description of AlgorithmR, due to Alan Waterman, which appears in [Knuth81]. Three new more efficient reservoir methods

(Algorithms X, Y, and Z) are then presented.Algorithm Z, which is the main result for this problem,realizes the above optimum time bound.

All the reservoir methods use an n-slot array I ofpointers to keep track of which records in the reservoir

are currently candidates. The pointer values I[k], for1 ~ Ie ~ n are the addresses of the actual candidates.The array I is small enough to fit in internal memory1

so the final sorting phase is very fast.

3.1. ALGORITHM R

After initializing the reservoir to the first n records,Algorithm R repeatedly decides whether the nextrecord in the file should be moved into the reservoirand made a candidate. When the (t + l)st record isbeing considered, the record has a n/(t + 1) chance ofbeing in a random sample of size n out of a pool of thefirst t+ 1 records, so Algorithm R makes it a candidatewith probability n/(t + 1). The candidate it replacesis chosen randomly. The complete algorithm is below.The builtin boolean function eo! returns true iff theend of the file has been reached.

{ Initialize the reservoir to store the first n records }Copy the first n records into the reservoir;tor j := 1 to n do Set I[j] to point to the jth record;t:= n;

while not eo! do {Process the rest of the records}bept:= t+ 1;Generate an independent random integer M 1

1 S M S t;it M S nthen

beginCopy the next record into the reservoir;Set I[M] to point· to that recordend

else Skip over the next record(do not 'include it in the reservoir);

end;

{ Final internal sorting phase }Sort the array I so that 1[1] < ... < I[tI];for j := 1 to n do Output the record at address I[jl;

3.2. TWO NEW RESERVOm METHODS

Using the same idea as in Section 2.2, we will define8(n, t) to be the random variable that counts the number of records to skip over before selecting the nextrecord for the reservoir. The parameter n is the number of records in the sample, t is the number of recordsprocessed so far, and N is the (unknown) number ofrecoirds in the file.

We present in this section two new methods forreservoir sampling, called Algorithms X and Y. These

73

two methods along with Algorithm Z, which is coveredin the next three sections, follow the same basicapproach. After the reservoir is initialized, as inAlgorithm R, these three algorithms repeatedly determine which record to put in the reservoir next bygenerating S and skipping over that many records.When the end of :file is reached, the selecting terminates. The general format is below. The initialization of the reservoir and the final internal sorting phaseare not included, since they are the same as before.

while not eoj do {Process the rest of the records}begint :-t+ 1;Generate an independent r. v. S(n, t);Skip over the next Sen, t) records;if not eoj then

beginCopy the next record into the reservoir;Generate an indep. random integer M ,

1 ~ M ~ n;Set I[M) to point to that recordend

end;

Algorithms X, Y J and Z differ in how theygenerate S. The range of 8(n, t) is the set of nonnegative integers. The distribution function F{B) =Prob{S S B}, for 8 ~ 0, can be expressed in two ways:

F(s) = 1 _ t~ = 1 _ (t + 1 - n)8-FI .(t + s + 1)11 (t + 1)s:FT

(3-1)

(The notation ab denotes the "rising power" a(a +1) ... (a + b - 1) = (a + b - 1)!/(a - 1)1.) The twocorresponding expressions for the probability function1(8) = Prob{S - 8}, for 8 ~ 0 are

" t!! n (t - n)8+1I(s) = t+s+l(t+s)~ = t-n - (t+l)8-Ff"

(3-2)The expected value of S is (t - n + l)/(n - I), andthe standard deviation of S is slightly more.

For reasons of brevity, Algorithms X and Yarenot included in this extended abstract. They are basedon the ideas of Algorithms A and B~ Algorithm X setsS to the minimums ,~ 0 such that U ~ F(s).Algorithm Y uses Newton's method to find the minimum 8.

3.3. ALGORITHM Z

This method follows the basic form described at thebeginning of Section 3.2. It uses the rejection technique

developed in Section 2.3. There is some value of r > 1such that when t ~ r", it is faster to generate S alaAlgorithm X. Typical values of r are in the range 5-15.The subprogram below is the portion of Algorithm Zthat generates S.

3.5. ANALYSIS or ALGORITHM Z

Averap Number V(n, 'N) of UDltorm RandomVariates.

Theorem 3. The average number V(n, N) of uniformrandom variates is bounded by

If t $ rn then Use Algorithm X to generate Selse begin

repeatGenerate an independent r. v. X

with density or probability tn. g(z);Generate an independent uniform r.v. U;It U $ h(lXJ)/cg(X) then b~ak-Ioop

untO U ~ f(lXJ)/cg(X);S:= lXJend;

Our choice of parameters is

n ( t )"g(z) = t + z t + z J z > 0;

t+lc- ·- t-n+l'

n ( t - n + 1 )"'+1h(s) = t + 1 t + 8 - n + 1

z > O·- ,

(3-3)

{

"CHN H) + n(n +1) if fn < N;V(n, N) ~ - n rn - n - 1 '

n(HN - Hn ), if rn ~ N.(3-4)

Typical valu~s of r are in the range 5-15. Thenotation Hk denotes the kth harmonic number, definedby H/r, = ElSiSk l/i.Average Exeeutlon Time T(ti, N). We letT(n, N) denote the total average running time ofAlgorithm Z.

Theorem 4. The average ranning time T(n, N) ofAlgorithm Z is bounded by O(n(1 +10g(NIn»).

The main task in the proof is showing that theevaluation of !(lXJ), which requires tl · min{n, lXJ +I} time, does not contribute much to T(n, N). Thefact that t changes during the course of the algorithmadds extra difficulty.

Lemma 5. The choices of gl(X), el, and hl(S) in (3-3)satisfy the relation

This satisfies condition (2-5). Since g(z) is a monotonedecreasing function, this also implies (2-4). The distribution for X is not a common distribution, as arethe distributions for Xl and X2 in Problem 1, but itcan also be generated quickly in constant time.

3.4. OPTIMIZATION TECHNIQUES

As in Section 2.4, we can reduce the number of uniformrandom variates required to generate S to roughly oneper loop. In addition similar techniques can be applied ,to Algorithms X, Y, and Z so that the random variable V need not be generated in order to decide whichrecord to replace in the reservoir. Similarly, the number of .exponentiation operations can be reduced fromtwo per loop to one per loop using techniques similarto those in Section 2.4. The details are omitted forlack of space.

74

4. CONCLUSIONS AND RELATED WORK

This paper discusses the problem of how to select nrecords for a sample out of a pool of N records, orequivalently, how to select n integers out of the first Nwhole numbers. We have presented several algorithmsthat solve two problems of random sampling: one inwhich N is known, and the other in which N cannotbe predetermined efficiently. All the algorithms for thefirst problem are ideally suited to online use, since therecords are selected iteratively in the same order asthey appear in the file. The performances of all thealgorithms discussed in this paper are summarized inthe table a~ the end of Section 1. Timings of threealgorithms for the first problem are given in Section2.6. The detailed proofs and algorithms are given in[Vitter 83a, 83b].

The main results of this paper are the design andanalysis of Algorithms D and Z, which use a rejection technique to do the sampling in optimum time.Algorithms D and Z are also short and easy to implement, since the complexity is in the choice of g(z), c,and h(z), not in the program. Algorithm D, whichsolves the case when N is known, runs in O(n) time,on the average. The constant factor associated with

the big-oh term is very small, due to the statistical andprogramming techniques used in Section 2.4 to furtheroptimize the algorithm; it requires the generation ofapproximately n uniform random variates and roughlyn exponentiation operations, on the average. Themechanism for generating Sen, N) gives an optimum.solution to the open problem listed in exercise 3.4.2-8of [Knuth 81]. The analysis is outlined in Section 2.5.Algorithm. Z, which handles the case when N cannot bepredetermined, uses approximately nln(NIn) uniformrandom variates and it runs in O(n(1 + 10g(NIn»)time, on the average.

There are a couple other interesting methods thathave been developed independently. The online se...quential algorithms in [Kawarasaki and Sibuya 82] usea rather complicated version of the rejection techniquethat does not run in O(n) time; preliminary analysisindicates that the algorithms run in O(N/",2) time.They are fast only when n is not too small, but nottoo large; Algorithm A may well be faster in practicalsituations.

Bentley [Bentley 82J has proposed the followingtwo-pass method that is not online, but does run inO(n) time. In the first pass, a random sample of sizegreater than n is generated by truncating a randomsample of en uniform real numbers in the range [0, N +1), for some constant e > 1; the real numbers can begenerated sequentially by the algorithms in [Bentleyand Saxe 80]. If the resulting sample has size m ~ n,then Algorithm S (or A) is applied to the sample of sizem to produce the final sample of n records; if m < n,then the first pass is repeated. The parameter c > 1is chosen to be as small as possible, but large enoughto make it very unlikely that the first pass must berepeated; the optimum value of c can be determinedeasily for any given implementation. During the firstpass, the indices of the records chosen for the sampleare stored in an array or linked list, which requiresO(en) space; however, this storage requirement can beavoided if the random number generator can be reseeded for the second pass, so that the program canregenerate the indices on the fly. When re-seeding isdone, assuming that the first pass does not have to berepeated, the program requires m+cn random numbergenerations and the equivalent of en exponentiationoperations.

Preliminary study indicates that roundoff error isinsignificant in the algorithms in this paper. Th~ random variates S generated by Algorithm 0 pass thestandard statistical tests.

The ideas in this paper have other applicationsas well. Research is currently underway to see if therejection technique can be extended to generate the

75

kth record of a random sample of size n in constanttime,. on the average. One possible approach is toapproximate the distribution of the index of the kthrecord by the beta distribution with parameters a = kand b = n - k + 1 and normalized to the interval[0, Nj. An alternate approximation is the negativebinomial distribution. Possibly the rejection techniquecombined with a partitioning approach can give thedesired result.

ACKNOWLEDGMENTS

The author would like to thank Phil Heidelberger fordiscussions on ways to reduce the number of randomvariates generated per loop in Algorithm D from twoto one.

REFERENCES

1. J. L. Bentley. Private Comm. (December 1982).

2. J. L. Bentley and J. B. Saxe. Generating SortedLists of Random Numbers. ACM Trans. on Math.Software, 6, 3 (Sept. 1980), 359-364.

3. C. T. Fan, M. E. Muller, and I. Rezucha. Development of Sampling Plans by Using Sequential(Item by Item) Selection Techniques and DigitalComputers. American Statistical Assn. Journal,57 (June 1962), 387-402.

4. T. G. Jones. A Note on Sampling a Tape File.Communications of the ACM, 5, 6 (June 1962),343.

5. J. Kawarasaki and M. Sibuya. Random Numbersfor Simple Random Sampling Without Replacement. Keio Math. Sem. Rep, No.7 (1982), 1-9.

8. D. E. Knuth. The Art of Computer Programming.Volume 2: Seminumerical Algorithms. AddisonWesley, Reading, MA (second edition 1981).

7. E. E. Lindstrom and J. S. Vitter. The Designand Analysis of BucketSort: A Fast ExternalSorting. Method for Use with Associative Secondary Storage. Technical Report ZZ20-6455, IBMPalo Alto Scientific Center (December 1981).

8. E. E. Lindstrom and J. S. Vitter. Analysis ofBucketSort for Bubble Memory Secondary Storage. Technical Report ZZ20-6458, mM Palo AltoScientific Center (October 1982). Patent pending.

9. J. S. Vitter. Faster Methods for RandomSampling. Technical Report CS-82-21, BrownUniversity (August 1982).

10. J. S. Vitter. Random Sampling with a Reservoir.Technical Report CS-83-17, Brown University,(July 1983).

[ieee 24th annual symposium on foundations of computer science (sfcs 1983) - tucson, az, usa...

Documents